Inside the Google Books Algorithm

More

googlebooks1.jpg

Google is famous for the brilliance of its algorithm for searching web pages. While the company looks at dozens of factors in determining which results to display, the heart of the search engine is using links between pages to rank their relevancy. We have come to depend on Google to give us exactly what we want.

But what about when the company has to reach outside the web? The printed volumes represented on Google Books form a completely different kind of problem. Google's famous algorithm can't be deployed to search through books because they don't link to each other in the way that webpages do. There is no perfect BookRank corollary for PageRank.

All of which made me wonder: How does Google Books work? What makes it tick? It turns out that it's actually a great place for the company's engineers to learn how to function in a linkless, physical world.

"There is a meaningful effort to say, how do we tune for books? We've got a lot of people doing very focused on the web. How do we take the lessons from what we learned on the web and invent new things that are unique to books?" Matthew Gray, lead software engineer of Google Books, told me.

The system they've come up with has become increasingly sophisticated, as highlighted by their latest tweak, Rich Results, which begins rolling out this afternoon. The feature selectively presents you with one extra-large result when it detects that you're probably searching for an individual title and not a specific mote of information or general topic.

Rich Results is the latest in a series of smaller front-end tweaks that have been matched by backend improvements. Now, the book search algorithm takes into account more than 100 "signals," individual data categories that Google statistically integrates to rank your results. When you search for a book, Google Books doesn't just look at word frequency or how closely your query matches the title of a book. They now take into account web search frequency, recent book sales, the number of libraries that hold the title, and how often an older book has been reprinted.

So, if you search "Help" now, you get a big blow-up of Kathryn Stockett's 2009 book, not one of the dozens of other books with the same title. Or if you search "dragon tattoo," you get Stieg Larsson's blockbuster, not the 2008 children's book actually called Dragon Tattoo.

"One of the fundamental things we've learned is that the whole is greater than the sum of the parts," Gray said.

This is deeply Google thinking but without the dominant algorithm. It's a Google subspecies that evolved by feeding on a different corpus. There is less data about books than web pages, but there is more structure to it, and there's less spam to contend with. Yet the focus on optimizing an experience from vast amounts of data remains. "You want it to have the standard Google quality as much as possible," Gray said. "[You want it to be] a merger of relevance and utility based on all these things."

googlebooks2.jpg

The most difficult part of making Google Books work, said James Crawford, the team's engineering director, was determining the intent of the service's heterogeneous user base. Scholars who search Google Books have very different wants and expectations from casual users looking to find a trade fiction title.

"Sometimes they are looking for a preview. Sometimes they are looking for information about that book. Third, they want to buy a copy of that book," Crawford said.

Rich Results will help people who are looking specifically for a title, but Crawford said that they aren't ruling out other presentations or features for other user types (e.g. quasi-scholars like myself.)

All the Google Books tweaks I've noticed are small. Earlier this year, they introduced a sidebar for customizing your search. This summer, they added a Books-specific "Suggest" function, so when you type "sh" you get the suggestion of "Sherlock Holmes" instead of "Shoppers," which is what you get on the web. Now you can sort by date, too, or restrict your queries by subject.

But you add them all up and apply them to the 15 million books Google has scanned and the truly unprecedented nature of Google Books starts to emerge. It's not perfect -- and the Google Books Settlement is a whole separate issue -- but it is unique.

"We're in the middle of doing something radical. No one has ever pulled together this whole collection, scanning books from 40 different libraries," Crawford said. "I would say our general approach here has been to just get the books scanned because until they are digitized and OCR is done, you aren't even in the game. As we get more and more content on line, the work that Matthew's team gets to be more and more important and more and more doable."

Jump to comments
Presented by

Alexis C. Madrigal

Alexis Madrigal is a senior editor at The Atlantic, where he oversees the Technology Channel. He's the author of Powering the Dream: The History and Promise of Green Technology. More

The New York Observer calls Madrigal "for all intents and purposes, the perfect modern reporter." He co-founded Longshot magazine, a high-speed media experiment that garnered attention from The New York Times, The Wall Street Journal, and the BBC. While at Wired.com, he built Wired Science into one of the most popular blogs in the world. The site was nominated for best magazine blog by the MPA and best science Web site in the 2009 Webby Awards. He also co-founded Haiti ReWired, a groundbreaking community dedicated to the discussion of technology, infrastructure, and the future of Haiti.

He's spoken at Stanford, CalTech, Berkeley, SXSW, E3, and the National Renewable Energy Laboratory, and his writing was anthologized in Best Technology Writing 2010 (Yale University Press).

Madrigal is a visiting scholar at the University of California at Berkeley's Office for the History of Science and Technology. Born in Mexico City, he grew up in the exurbs north of Portland, Oregon, and now lives in Oakland.

Get Today's Top Stories in Your Inbox (preview)

Why Do Men Assume They're So Great?

Katty Kay and Claire Shipman, authors of this month's Atlantic cover story, sit down with Hanna Rosin to discuss the power of confidence and how self doubt holds women back. 


Elsewhere on the web

Join the Discussion

After you comment, click Post. If you’re not already logged in you will be asked to log in or register. blog comments powered by Disqus

Video

Where Time Comes From

The clocks that coordinate your cellphone, GPS, and more

Video

Computer Vision Syndrome and You

Save your eyes. Take breaks.

Video

What Happens in 60 Seconds

Quantifying human activity around the world

Writers

Up
Down

More in Technology

Just In