It's The Computer Code, Silly, Ctd

A reader writes:

As someone who worked as an experimental physicist for more than 15 years, I have to take issue with your reader's characterization of the data as a "hopeless mess."  Data sets acquired by different groups at different times with (presumably) different instrumentation and methodologies will have different formats and systematic biases.  It's necessary to correct for these if the aggregate is to be properly understood.  This process is often painstaking and "messy."  But it doesn't indicate a problem with the code.

The readme file that your reader points to looks to me like a detailed set of procedural notes being kept by a person eyeball-deep in this difficult, often frustrating work.  Scanning it, my initial impression is that this person was approaching the problem with a great deal of integrity.  That said, I agree that avoiding bias when reducing data is a very tricky problem.  In fact, I think that it is by far the most difficult thing that experimentalists do.  Good scientists never stop worrying about bias and, as Feynman points out, they are not always successful in avoiding it.  But over time, as experiments are repeated and hypotheses are tested in new contexts, scientific communities reach a consensus -- until a result comes along that overturns it.  By the way, there is no better way to establish your reputation as a scientist than overturning a consensus.  It's true that it's not easily done, but any scientist would be thrilled to do it.

Another reader adds:

One of the big problems I've been seeing in the commentary surrounding climategate is a fundamental lack of understanding of the culture of scientists and software developers.  I saw the post you made earlier talking about the readme files that came out of climate gate.  I am a software developer and this looks like the same kind of thing that every software developer on any kind of large scale project deals with.

It definitely looks like the incoming data files were a total mess, but that's pretty standard in any kind of massive collection of data that's done with poorly defined standards in a large distributed fashion.  The commentary there walks through what had to be done to turn that data into something coherent, and was clearly written for other computer systems people to look at.  The random comments about how things were done and how dumb that was is routine in the culture of computer programmers.  So if you looked at this without that perspective it can look somewhat nefarious, but the reality is that's how the world of software development works.

I've seen similar commentary made about the e-mails discussing how "tricks" were used by scientists.  However, if you talk to an actual scientist, talking about a "trick" is just the lingo.  That's just casual ways of talking about some algorithm or methodology you used.  I will grant that in science, it is a risk that you come out with the result you want rather than the result you should get.  However, that's why you have rigorous peer review processes, to create a somewhat competitive environment to weed out other people's bad assumptions.

Sure much of this looks sketchy to an outsider, but I guarantee you that if you dug through the e-mails of any large organization in a similar manner you'd find all manner of seemingly sketchy things.  It's a side effect of the casual language we use in e-mails and assumptions of the context that goes into them.  That doesn't mean they were hiding something, it just means they were doing their job like everybody else in this world.

Yet another reader:

Where's the quote from the text file suggesting that temperature data was made up at all; let alone made up to conform with expectations?  A cursory reading suggests that this is the section your reader is reference: "What the hell is supposed to happen here? Oh yeah - there is no 'supposed', I can make it up. So I have."  What is Harry talking about here?  He's not talking about making up temperature data!  He's talking about cases where a single weather station has a gap in the data available for it.  What should he do with the data in that situation, since their analysis requires continuous data?  Harry lays out three options: treat the data as contiguous, throw out more recent data, or treat the data as coming from two separate stations at the same location.  It's not pretty, but it's not making things up.

Is there any actual evidence that the data in the CRU database was massaged to match expectations of warming?  I haven't seen any and your reader doesn't point to any.  As a scientist, I'll be the first to say that neither myself nor any scientist I know can claim to be strictly objective, but that doesn't mean we make up data.  It's the one thing you don't do, and there are very, very few cases of conspiracies amongst scientists to do so (there are several examples of individual scientists doing so, however).

What's interesting about the Millikan experiment is that Millikan originally massaged the data to make it look more precise.  The original measurement was somewhat fraudulent.  Analysis has shown that Mendel committed similar crimes in his work on heredity.  More here. Making up or massaging data is fundamentally different from assuming that data conforming to the status quo is correct.