Abstract: Technical batch effects pose a fundamental challenge to quality control and reproducibility of even single-laboratory research projects, but the possibilities for serious error are greatly magnified in complex, multi-institutional enterprises such as the cancer molecular profiling projects being undertaken by the NCI Center for Cancer Genomics (CCG). To aid in detection, quantitation, interpretation, and (when appropriate) correction for technical batch effects in such data, we have developed the MBatch computational tool and web portal. MBatch has become indispensible for quality-control ?surveillance? of data in The Cancer Genome Atlas (TCGA) project, but detecting and quantitating batch effects (or trend effects or statistical outliers) are just the first steps in a process. The next steps involve detective work in collaboration with those who generated the data, drawing upon expertise in integrative analysis across data types, pathways, and systems-level biology. That detective work usually succeeds in diagnosing the cause of a batch effect as technical or biological. If technical, then computational correction can be done (judiciously). The primary aim of the proposed Genome Data Analysis Center (GDAC) is to translate that successful quality-control model from TCGA to other current and future large-scale molecular profiling projects sponsored by the CCG. We will be ready to do that on Day 1. The second aim is to increase the power of MBatch to perform the basic quality-control functions. We will add a number of innovative new algorithms (Replicates- Based Normalization, Empirical Bayes++, and CorNet) and increase the repertoire of standard methods. We will also add major visualization resources including our interactive Next-Generation Clustered Heat Maps. The third aim is to make the system sufficiently robust, user-friendly, interactive, carefully documented, and easy to install that bench biologists and clinical researchers can use it to explore CCG-generated data or their own. Toward those ends, we have established collaborations to implement MBatch in Galaxy and on the cloud. We bring a number of assets to the proposed GDAC, including (i) multidisciplinary expertise in bioinformatics, biostatistics, software engineering, biology, and clinical oncology; PIs with a combined 21 years of experience in high-throughput molecular profiling studies of clinical cancers (in a highly consortial context); international leadership in batch effects analysis; a highly professional software engineering team with a track record of producing high-end, highly visual bioinformatics packages and websites; a team of 20 Analysts whose expertise can be called on; extensive computing resources, including one of the most powerful academically- based machines in the world; strong institutional support; close working relationships with first-class basic, translational, and clinical researchers throughout MD Anderson, one of the foremost cancer centers in the country. The bottom-line mission of the GDAC will be aid the research community's effort to understand cancer and to prevent, detect, diagnose, and treat it more effectively for the benefit of patients and their families." |