AAAS Symposium 2011: The Digitization of Science

Rationale for the Symposium

Scientific computation is emerging as absolutely central to the scientific method, but the prevalence of very relaxed practices is leading to a credibility crisis affecting many scientific fields. It is impossible to verify most of the results that computational scientists present at conferences and in papers today. Reproducible computational research, in which all details of computations -- code and data -- are made conveniently available to others, is a necessary response to this crisis.

This session addresses reproducible research from three critical vantage points: the consequences of reliance on unverified code and results as a basis for clinical drug trials; groundbreaking new software tools for facilitating reproducible research and pioneered in a bioinformatics setting; and new survey results elucidating barriers scientists face in the practice of open science as well as proposed policy solutions designed to encourage open data and code sharing. A rapid transition is now under way -- visible particularly over the past two decades -- that will finish with computation as absolutely central to scientific enterprise, cutting across disciplinary boundaries and international borders and offering a new opportunity to share knowledge widely.

Speakers and Slides

Audio for the symposium is here. Warning: it's 3 hours, 139MB, and not indexed by speaker or slide. I have, however, given time markers below for each of the talks.

The Importance of Reproducibility in High-Throughput Biology: Case Studies (slides, 00:00 to 29:26 on the audio file)
Keith A. Baggerly, University of Texas M.D. Anderson Cancer Center

High-throughput biological assays let us ask very detailed questions about how diseases operate, and promise to let us personalize therapy. Data processing, however, is often not described well enough to allow for reproduction, leading to exercises in “forensic bioinformatics” where raw data and reported results are used to infer what the methods must have been. Unfortunately, poor documentation can shift from an inconvenience to an active danger when it obscures not just methods but errors.

In this talk, we examine several related papers using array-based signatures of drug sensitivity derived from cell lines to predict patient response. Patients in clinical trials were allocated to treatment arms based on these results. However, we show in several case studies that the reported results incorporate several simple errors that could put patients at risk. One theme that emerges is that the most common errors are simple (e.g., row or column offsets); conversely, it is our experience that the most simple errors are common. We briefly discuss steps we are taking to avoid such errors in our own investigations.

Policies for Scientific Integrity and Reproducibility: Data and Code Sharing (slides, 32:52 to 47:40 on the audio file)
Victoria C. Stodden, Columbia University

As computation emerges as central to the scientific enterprise, new modalities are necessary to ensure scientific findings are reproducible. Without access to the underlying code and data, it is all but impossible to reproducible today’s published computational scientific results. As seen in recent widely reported events, such as Climategate and the clinical trials scandal at Duke University, a lack of transparency in computational research undermines public confidence in science as well as slowing scientific progress, engendering a credibility crisis.

In order to communicate scientific discoveries and knowledge through the release of the associated code and data, scientists face issues of copyright. In this talk I address copyright as a barrier to reproducible research and present open licensing solutions for computational science. My efforts, labeled The Reproducible Research Standard, are designed to realign the communication of modern computational scientific research with longstanding scientific norms.

With transparency in scientific research, a new era of access to computational science is at hand that does not confine understanding to specialists, but permits knowledge transfer not just between disciplines but to any interested person in the world with an internet connection.

Reproducible Software versus Reproducible Research (slides, extended abstract, 53:25 to 1:10:47 on the audio file)
Fernando Perez, University of California, Berkeley

As an active participant of both the scientific research and the open-source software development communities, I have observed that the latter often lives up better than the former to our ideals of scientific openness and reproducibility. I will explore the reasons behind this, and I will argue that these problems are particularly acute in computational domains where they should be in fact less prevalent.

Open source software development uses public fora for most discussion and systems for sharing code and data that are, in practice, powerful provenance tracking systems. There is a strong culture of public disclosure, tracking and fixing of bugs, and development often includes exhaustive automatic validation systems, that are executed automatically whenever changes are made to the software and whose output is publicly available on the internet. This helps with early detection of problems, mitigates their reoccurrence, and ensures that the state and quality of the software is a known qunatity under a wide variety of situations (operating systems, inputs, parameter ranges, etc). Additionally, the very systems that are used for sharing the code track the authorship of contributions. All of this ensures that open collaboration does not dilute the merit or recognition of any individual developer, and allows for a meritocracy of contributors to develop while enabling highly effective collaboration.

In sharp contrast we have incentives in computational research, strongly biased towards rapid publication of papers without any realistic requirement of validation, that lead to a completely different outcome. Publications in computationally-based research (applied to any specific discipline) often lack any realistic hope of being reproduced, as the code behind them is not available, or if it is it rarely has any automated validation, history tracking, bug database, etc.

I will discuss how we can draw specific lessons from the open source community both in terms of technical approaches and of changing the structure of incentives, to make progress towards a more solid base for reproducible computational research.

GenePattern (slides, 1:16:36 to 1:32:33 on the audio file)
Michael Reich, Broad Institute of MIT and Harvard

The rapid increase in biological data acquisition has made computational analysis essential to research in the life sciences. However, the myriad of software tools to analyze this data were developed in diverse settings, without the capability to interact with one another or to capture the information necessary to reproduce an analysis. The burden of maintaining analytical provenance is therefore placed on the individual scientist. As a consequence, publications in biomedical research usually do not contain sufficient information for reproduction of the presented results. To alleviate these problems, we created a computational genomics environment called GenePattern which tracks the steps in the analysis of genomic data. Recently, in collaboration with Microsoft, we linked GenePattern to Microsoft Word. This resulting combination provides a Reproducible Research System that enables users to link analytical tools into workflows, to automatically record their work, to transparently embed that *recording* into a publication without ever leaving their word processing environment, and importantly to allow exact reproduction of published results.

Strategies for Reproducible Research (slides, 1:35:40 to 1:53:12 on the audio file)
Robert Gentleman, Bioinformatics & Computational Biology, Genentech, Inc.

I will discuss some practical considerations for engaging in the practice of reproducible research. These will consider different users (individual investigators, research groups and whole departments or organizations) as well as different uses (publication, internal communication, documentation of approaches that were tried).

A Universal Identifier for Computational Results (slides, 2:01:18 to 2:28:03 on the audio file)
David Donoho, Stanford University; Matan Gavish, Stanford University

We propose that scientific publications recognize the primacy of computational results -- figures,tables, and charts -- and follow a protocol we have developed which asks publishers for subtle, easy changes in article appearance, asks authors for simple easy changes in a few lines of code in their programs and word processors, and yet has very far-reaching and we think lasting consequences.

The effect of these small changes will be the following. Each author would permanently register each computational result in a published article (figure, table, computed number in in-line text) with a unique universal result identifier (URI). Each figure or table appearing in a published article would have its URI clearly indicated next to that item: a string that permanently and uniquely identifies that computational result.

In our proposal, an archive, run by the publisher under the standard client/server architecture described here, will respond to queries about the URI and provide: (a) the figure/table itself; (b) metadata about the figure's creation; (c) (with permissions) data from the figure/table itself; (d) (with permissions) a related figure/table, obtained by changing the underlying parameters that created the original figure, but keeping everything else about the figure's creation the same.

To better evoke the whole package we propose -- which consists of URI's, of specific content associated to URI's, and specific server processes that record and serve up content, we call the whole package -- a URI, its associated content and server behaviors -- a Verifable Computational Result (VCR).

Our talk will describe an existing implementation of this idea and the advantages to journals, scientists, and government agencies of this approach.

Lessons for Reproducible Science from the DARPA Speech and Language Program (slides, 2:35:30 to 3:07:12 on the audio file)
Mark Liberman, University of Pennsylvania

Since 1987, DARPA has organized most of its speech and language research in terms of formal, quantitative evaluation of computational solutions to well-defined "common task" problems. What began as an attempt to ensure against fraud turned out to be an extraordinarily effective way to foster technical communication and to explore a complex space of problems and solutions. This engineering experience offers some useful (if partial) models for reproducible science, especially in the area of data publication; and it also suggests that the most important effects may be in lowering barriers to entry and in increasing the speed of scientific communication.

links associated with Mark's talk:

http://nist.gov/itl/iad/mig/bmt.cfm

http://www-nlpir.nist.gov/projects/duc/

http://www.nist.gov/tac/

http://www.nist.gov/itl/iad/rg/content-based-access.cfm

http://berouge.com/default.aspx

http://www.ldc.upenn.edu/Catalog/ByYear.jsp

The Digitization of Science:

Reproducibility and Interdisciplinary Knowledge Transfer