Buggy Software: Achilles Heel of Big-Data-Powered Science?

We've heard a lot about scientific fraud recently, and it's a serious concern. But how reliable are honest research results? On the science website iSGTW, the journalist Adrian Giordani points to a growing concern with software defects:

In October 2012, a workshop about maintainable software practices in e-science highlighted that unchecked errors in code have caused retractions in major research papers. For example, in December 2006, Geoffrey Chang from the Department of Molecular Biology at the Scripps Research Institute, California, US, was horrified when a hand-written program flipped two columns of data, inverting an electron-density map. As a result, a number of papers had to be retracted from the journal Science.

In an earlier paper co-authored with a professor of forensic software engineering, Leslie Hatton, he had written:

I (Hatton) have worked for 40 years in meteorology, seismology, and computing, and most of the software I've used has been corrupted to some extent by such defects - no matter how earnestly the programmers performed their feats of testing. The defects, when they eventually surface, always seem to come as a big surprise.

The defects themselves arise from many causes, including: a requirement might not be understood correctly; the physics could be wrong; there could be a simple typographical error in the code, such as a + instead of a - in a formula; the programmer may rely on a subtle feature of a programming language which is not defined properly, such as uninitialized variables; there may be numerical instabilities such as over-flow, under-flow or rounding errors; or basic logic errors in the code. The list is very large. All are essentially human in one form or another but are exacerbated by the complexity of programming languages, the complexity of algorithms, and the sheer size of the computations.

A site called RunMyCode, developed by the Columbia University computer scientist Victoria Stodden, helps scientists discover errors by sharing code and data, accelerating replication of experiments.  And fortunately many software glitches, as in more familiar consumer products, occur only occasionally and don't seriously affect functionality most of the time. The real problem is that once in a while a software error can turn lethal, as it did notoriously in radiation therapy in the 1980s.

Replicability can help scientists correct inevitable bugs. But what about the many other programs that govern everyday life, from forensic lab tests to credit scores and anti-terrorism watch lists, of which the codes are often commercial or national security secrets? In the Defense Department's troubled F-35 jet program,

the "gorilla in the room," [the project manager Air Force Major] General [Christopher] Bogdan said, is testing and securing the 24 million lines of software code for the plane and its support systems, a mountain of instructions that goes far beyond what has been tried in any plane.

In civilian life, for example in credit score calculations, there is often no effective appeal from these programs' results. The question now is whether we can develop better tools for catching false positives and false negatives before they do serious damage.