"Abstract The Cancer Genome Atlas (TCGA) set the standards for large-scale cancer genome projects worldwide. In the next phase, the National Cancer Institute and its Center for Cancer Genomics are planning large-scale projects closely tied to clinical questions and trials. In order to perform the analysis of these data, the NCI is creating a Genome Data Analysis Network (GDAN) of different types of Genome Data Analysis Centers (GDACs). Central to this Network is a single Processing GDAC, which will take all the harmonized data, as stored in the NCI's Genomics Data Commons, and perform higher level integrated analyses on these data to support both the Analysis Working Groups (AWGs) within the Network (which will be formed for each project to perform special analyses of the data and write manuscripts) as well as the entire biomedical research community. Herein we propose to build the centralized Processing GDAC on top of our FireCloud platform, an infrastructure to run large scale computation on the cloud in a fully rigorous and reproducible fashion. FireCloud development was based on our experience with Firehose, the Broad internal platform on which the standard TCGA data and analyses currently run. We propose to create and operate the GDAN Standard Workflow, incorporating tools actively developed and used within the GDAN and across the entire field, with particular emphasis on clinical tools. This Workflow will serve as the starting point for AWGs and set the highest standards of transparency, reproducibility and rigor for cancer genome analysis. The results of the Standard Workflow will be stored in a public database, and accessible via standard APIs, and used together with a continuously updated database of prior knowledge to create scientific reports that will be made available to the community, in a pre-publication manner. Finally, a major innovation is that AWG members will be able to login into FireCloud and rerun the entire workflow, or parts of it, with their own parameters and subsets of the data - thus making the entire GDAN analysis fully reproducible and scalable. Our goals are therefore: (1) To create a global infrastructure for collaborative extreme- scale cancer analysis; (2) Operate the Standard Workflows at scale; (3) Rapidly and continuously evolve the Standard Workflows; and (4) created improved capabilities for reporting, exploring the results, clinical diagnostics and reproducibility." |