Conquering the Curse of Dimensionality by Using Background Knowledge

The functioning of today’s society is based on collecting and analysing large amounts of data. Since the collection and storage of data have recently both become very cheap, it is not our usual practice to observe multiple small sets of carefully selected variables, but rather to routinely collect large quantities of measurements. This is true for all fields of work, from science - for example sequencing the whole genome and observing the activity of all of the genes at the same time - to the business world where, for example, we capture snapshots of share prices or market prices at short intervals. In principle, by observing large quantities of invariables, we would be able to discover more complex and unexpected patterns in data than before. In practice, however, such a large amount of data looks like a giant haystack and we lack efficient methods to look for needles in it, or even worse, to distinguish the needles from the hay. In more formal terms, the current methods for knowledge discovery from data find a large number of models and patterns which fit the data equally well. Even though most of them are random, it is impossible to distinguish them from real samples through mathematical methods. We believe that the problem is the current approach to knowledge discovery, which uses (only) data for constructing new theories – this is a bad practice, which has been called »data fishing«. So far, we have avoided the problems this type of approach causes by trying to find the simplest theories possible (e.g. by using linear models, various regulations, Occam's principle etc.). However, this is no longer possible for high-dimensional problems, because there are too many equally complex theories that fit the data equally well. In this project, we plan to investigate what we believe to be the only practical solution to the problem. Just as traditional science does not build theories from observation alone, the search for models, patterns and visualisations in automatic discovery of knowledge from data has to be based on already existing knowledge from that research area. This background knowledge can be in any type of form that describes the connections between variables, such as: ontology or network entities that correspond to variables, the correlation of variables from previous experiments, rules that are explicitly drawn up by an expert in the field or texts that are associated with the field and which enable us to statistically determine the connectedness between the variables. Background knowledge can be used throughout all of the phases of discovering knowledge. We intend to develop data transformation methods in this project, which will, for example, reduce the dimensionality of data by using background data to form new variables from the observed ones. This approach differs from the existing techniques for reducing dimensionality, which reduce data dimensionality through the data itself. We will develop visualisation methods that will construct useful and informative visualisations on the basis of existent knowledge. The construction of predictive models, especially with methods of machine learning, is based on the investigation of the vast space of possible models. This type of searching can be navigated by background knowledge of connections between variables. Finally, the existing knowledge can be used to select models and patterns from a vast set of models and samples, which fit the given data equally well. We will model our work on the contemporary methods of genetic data analysis - a field which has recently contributed the most towards overcoming the curse of dimensionality - and on statistical techniques for reducing dimensionality, as well as procedures for limiting searches in machine learning, which currently do not use background knowledge, at least not in the way we expect it to in this project. The developed methods will be implemented in open source packages for discovering knowledge from data and thus immediately available for practical use. Real time usage will also facilitate the testing and refinement of algorithms that will be developed in our project.

Collaborators