Inside the Google Books Algorithm

Google built its empire on the power of the link, but books don't have them. Here's how the company attacks the problems of the universal library.

googlebooks1.jpg

Google is famous for the brilliance of its algorithm for searching web pages. While the company looks at dozens of factors in determining which results to display, the heart of the search engine is using links between pages to rank their relevancy. We have come to depend on Google to give us exactly what we want.

But what about when the company has to reach outside the web? The printed volumes represented on Google Books form a completely different kind of problem. Google's famous algorithm can't be deployed to search through books because they don't link to each other in the way that webpages do. There is no perfect BookRank corollary for PageRank.

All of which made me wonder: How does Google Books work? What makes it tick? It turns out that it's actually a great place for the company's engineers to learn how to function in a linkless, physical world.

"There is a meaningful effort to say, how do we tune for books? We've got a lot of people doing very focused on the web. How do we take the lessons from what we learned on the web and invent new things that are unique to books?" Matthew Gray, lead software engineer of Google Books, told me.

The system they've come up with has become increasingly sophisticated, as highlighted by their latest tweak, Rich Results, which begins rolling out this afternoon. The feature selectively presents you with one extra-large result when it detects that you're probably searching for an individual title and not a specific mote of information or general topic.

Rich Results is the latest in a series of smaller front-end tweaks that have been matched by backend improvements. Now, the book search algorithm takes into account more than 100 "signals," individual data categories that Google statistically integrates to rank your results. When you search for a book, Google Books doesn't just look at word frequency or how closely your query matches the title of a book. They now take into account web search frequency, recent book sales, the number of libraries that hold the title, and how often an older book has been reprinted.

So, if you search "Help" now, you get a big blow-up of Kathryn Stockett's 2009 book, not one of the dozens of other books with the same title. Or if you search "dragon tattoo," you get Stieg Larsson's blockbuster, not the 2008 children's book actually called Dragon Tattoo.

"One of the fundamental things we've learned is that the whole is greater than the sum of the parts," Gray said.

This is deeply Google thinking but without the dominant algorithm. It's a Google subspecies that evolved by feeding on a different corpus. There is less data about books than web pages, but there is more structure to it, and there's less spam to contend with. Yet the focus on optimizing an experience from vast amounts of data remains. "You want it to have the standard Google quality as much as possible," Gray said. "[You want it to be] a merger of relevance and utility based on all these things."

googlebooks2.jpg

The most difficult part of making Google Books work, said James Crawford, the team's engineering director, was determining the intent of the service's heterogeneous user base. Scholars who search Google Books have very different wants and expectations from casual users looking to find a trade fiction title.

"Sometimes they are looking for a preview. Sometimes they are looking for information about that book. Third, they want to buy a copy of that book," Crawford said.

Rich Results will help people who are looking specifically for a title, but Crawford said that they aren't ruling out other presentations or features for other user types (e.g. quasi-scholars like myself.)

All the Google Books tweaks I've noticed are small. Earlier this year, they introduced a sidebar for customizing your search. This summer, they added a Books-specific "Suggest" function, so when you type "sh" you get the suggestion of "Sherlock Holmes" instead of "Shoppers," which is what you get on the web. Now you can sort by date, too, or restrict your queries by subject.

But you add them all up and apply them to the 15 million books Google has scanned and the truly unprecedented nature of Google Books starts to emerge. It's not perfect -- and the Google Books Settlement is a whole separate issue -- but it is unique.

"We're in the middle of doing something radical. No one has ever pulled together this whole collection, scanning books from 40 different libraries," Crawford said. "I would say our general approach here has been to just get the books scanned because until they are digitized and OCR is done, you aren't even in the game. As we get more and more content on line, the work that Matthew's team gets to be more and more important and more and more doable."