A next-generation analytics toolbox for integrated high-throughput genomic data analysis

Client ID: ARRS (BI-US/11-12-020)
Project type: Bilateral Collaboration Project
Project duration: 2011 - 2012


The major bottleneck in high-throughput genomics is data analysis. Modern biomedical laboratories can readily collect vast amounts of data from a variety of experimental platforms. The biomedical interpretation of the data usually begins with searches for promising subsets of genes, proteins or connections between them. The investigators then want to find existing ontologies and annotations, perform enrichment analyses, find similar experiments in local and in public databases, perform literature searches for related findings, carry out text mining on the results, cluster the outcomes using literature annotations, extract lists of associated genes, generate hypotheses and test them in the original data set.

In this project we propose a pioneering next-generation bioinformatics data analysis framework that includes web-based applications, a visual programming environment and a scripting environment for integrative data analysis. We will develop it around our expertise in transcriptional data mining but with a vision of general applicability. The framework will allow biologists to perform sophisticated analysis tasks through visual analytics and interactive data exploration without knowledge in programming. It will also allow developers to use existing components to craft new procedures, and to integrate data and knowledge using only a few lines of code. The framework will be based on open source data mining suites and developed upon an innovative software architecture designed for speed, flexibility and simplicity. We will use FLASH programs for interactive web-applications, visual programming, data analytics and interactive visualization, server-based repositories of preprocessed data, and modern scripting languages.