Proteomics analysis using cloud infrastructure

Proteomics is the study of the global protein expression of cells and tissues. In proteomics, measurements are often carried out using mass spectrometers and the resulting data is both complex and large in volume. Proteins are complex macromolecules consisting hundreds or thousands of 20 amino acid types. Each amino acid can also undergoes modifications and this result that an estimated 1 million different protein types exists in complex organisms such as humans and their abundance varies over 7 orders of magnitude.

Computational proteomics aims at generating interpretable information from the thousands of mass spectra produced each hour. In general, the computational workflows need to be adapted to new data acquisition strategies and sometimes even per project. To accommodate this, typical workflows consist of many tools produced by research groups, consortia or companies. Below, we describe the technology stack we use to provide stable workflows to both experienced and novice users, yet remain flexible to accommodate special analysis cases.

All produced data, both measured and derived, is ingested into a data manager referred to as openBIS (Bauch et al 2011), which is ultimately stored on Swestore. Workflows can automatically stage data on the computation infrastructure in use. GC3PIE is used to manage the workflow and to interact with the computational resources as follows; a new workflow is submitted by a user, the GC3PIE head node downloads the data, creates cloud workers that then executes the various tools that constitutes the workflow. The final result data is registered in the data manager in relation to the input data. The result data consist of both result data and interactive reports.

Johan Malmström (Lund University) and Lars Malmström (ETH Zurich)