To get around this problem, the In Codice Ratio team had to teach their software some common sense—practical intelligence. They found a corpus of 1.5 million already-digitized Latin words, and examined them in two- and three-letter combinations. From this, they determined which combinations of letters are common, and which never occur. The OCR software could then use those statistics to assign probabilities to different strings of letters. As a result, the software learned that nn is far more likely than iiii.
With this refinement in place, the OCR was finally ready to read some texts on its own. The team decided to feed it some documents from the Vatican Registers, a more than 18,000-page subset of the Secret Archives consisting of letters to European kings, rulings on legal matters, and other correspondence.
The initial results were mixed. In texts transcribed so far, a full one-third of the words contained one or more typos, places where the OCR guessed the wrong letter. If yov were tryinj to read those lnies in a bock, that would gct very aiiiioying. (The most common typos involved m/n/i confusion and another commonly confused pair: the letter f and an archaic, elongated form of s.) Still, the software got 96 percent of all handwritten letters correct. And even “imperfect transcriptions can provide enough information and context about the manuscript at hand” to be useful, says Merialdo.
Like all artificial intelligence, the software will improve over time, as it digests more text. Even more exciting, the general strategy of In Codice Ratio—jigsaw segmentation, plus crowdsourced training of the software—could easily be adapted to read texts in other languages. This could potentially do for handwritten documents what Google Books did for printed matter: open up letters, journals, diaries, and other papers to researchers around the world, making it far easier to both read these documents and search for relevant material.
That said, relying on artificial intelligence does have limitations, says Rega Wood, a historian of philosophy and paleographer (expert on ancient handwriting) at Indiana University. It “will be problematic for manuscripts that are not professionally written but copied by nonprofessionals,” she says, since the handwriting and letter shapes will vary far more in those documents, making it harder to teach the OCR. In addition, in cases where there’s only a small sample size of material to work with, “it is not only more accurate, but just as quick to make transcriptions without such technology.”
Pace Dan Brown, the “secret” in the Vatican Secret Archives’ name doesn’t refer to anything clandestine or conspiratorial. It merely means that the archives are the personal property of the pope; “private archives” would probably be a better translation of the original name, Archivum Secretum. Still, until recently, the VSA might as well have been secret to most of the world—locked away and largely inaccessible. “It is amazing for us to bring these manuscripts back to life,” Merialdo says, “and make their comprehension available to everybody.”