• Advancement of computationally intensive methods for efficient modern general-purpose statistical analysis and inference
The Client : Javna agencija za raziskovalno dejavnost Republike Slovenije ( L1-7542 )
Project type: Research projects ARRS
Project duration: 2016 - 2019
  • Description

It is difficult to overstate the importance of statistical data analysis in today's world: all the empirical sciences, health, finance, fraud detection, telecommunications, social networking, and marketing are just a few areas, which rely heavily on data and their analysis. While applied statistics, especially modern Bayesian statistics, have progressed tremendously and have become much more accessible, progress has recently been slowing down, because current state-of-the-art computation cannot handle the models and volumes of data we want to analyze today.

The issue of inefficient statistical computation has recently been highlighted as one of the top 5 open problems in statistics. Our primary objective is to contribute to solving this problem by researching an approach to more efficient general-purpose computation and implementing the findings in a tool, which would allow us to analyze ever growing volumes of data at a reasonable cost.

We plan to achieve this objective by automatically parallelizing the most expensive parts of general-purpose Markov Chain Monte Carlo computation algorithms (in particular, Metropolis-Hastings and Hamiltonian Monte Carlo) and using graphical processing units. As a result of our project, we anticipate at least 100-fold speedups at a low cost (less than €1.000,00). Furthermore, have attracted top researchers and experts from the University of Ljubljana, the Slovenian Academy of Sciences and Arts, and industry to participate in the project. Every data set and statistical inference problem we use to gain insight, develop, evaluate, and validate our methodology, will be a part of a relevant practical problem faced by Slovenian researchers.

There have been successful attempts at efficient statistical computation for very limited cases, but what we are aiming for - general-purpose inference, which is automatically parallelized for highly efficient computation - is novel and has so far not been achieved. This makes the project extremely relevant both as a significant scientific achievement in the field of computation and due to the numerous practical benefits of low-cost accessible high-performance statistical inference.

Indices from related work suggest that the speedups we are aiming for are achievable. While this is a research project and several technical details and implementation issues remain to be resolved, we are confident of the projects feasibility, as have a set of well-defined and directly measurable requirements, we laid out a clear plan on how to achieve them, and assembled a project team of experts from varied backgrounds with all the required knowledge and know-how. We also attracted co-financing from industry to supplement our budget and we will actively promote student participation.

The main contributions of the project will be the theoretical research that leads to efficient computation, the practical implementation of this research into a software tool for general-purpose statistical computation, and, as a by-product, empirical research achievements in other fields of science made possible by our methodological research. Efficient computation will cut time and costs, which will directly benefit industry and, given the ubiquity and growing volumes of data, every-day life. And last, but not least, the collaboration between researchers, applied researchers, industry, and students will raise the general level of applied statistical knowledge, a field that is extremely underdeveloped in Slovenia.

 

Yearly scope

1,78 FTE

 

Partner research organizations

Project researchers

 

Project phases

  • Drafting and parallelization of specific models. [complete]
  • Research on automated parallelization. [current]
  • Implementation of research and practical implementation.
  • Testing and consolidation of results.

 

Financed by

Publications

Direct results

  • ČEŠNOVAR, Rok, ŠTRUMBELJ, Erik. Parallel draws from the Polya-Gamma distribution for faster Bayesian multinomial and count model inference. Slovenian Conference on Artificial Intelligence : proceedings of the 19th International Multiconference Information Society - IS 2016, 12 October 2016, Ljubljana, Slovenia : volume A, 2016, str. 9-12.

 

Indirect results and use of project results

  • DIMITRIEV, Aleksandar, ŠTRUMBELJ, Erik. Bayesian binary and ordinal regression with structured uncertainty in the inputs. Slovenian Conference on Artificial Intelligence : proceedings of the 19th International Multiconference Information Society - IS 2016, 12 October 2016, Ljubljana, Slovenia : volume A, 2016, str. 17-20.
  • ROPRET, Matevž, GAŠPARAC, Greta, ŠTRUMBELJ, Erik. Pollution source attribution using air mass back-trajectories : a machine learning approach. Zbornik petindvajsete mednarodne Elektrotehniške in računalniške konference ERK 2016, 2016, zv. B, str. 95-98.
  • ROPRET, Matevž. Forecasting air pollutant concentrations and identifying source regions : master's thesis : second level program of Computer and Information Science. Ljubljana, 2016.