How Netflix Reverse Engineered Hollywood

As the thousands of genres flicked by on my little netbook, I began to see other patterns in the data: Netflix had a defined vocabulary. The same adjectives appeared over and over. Countries of origin also showed up, as did a larger-than-expected number of noun descriptions like Westerns and Slashers. There were ways of saying where the idea for the movie came from ("Based on Real Life" "Based on Classic Literature") and where the movies were set ("Set in Edwardian Era"). Of course, there were the various time periods, as well—from the 1980s, and so on—and references to children ("For Ages 8 to 10"). 

Most intriguingly, there were the subjects, a complete list of which form a window unto the American soul: 

As the hours ticked by, the Netflix grammar—how it pieced together the words to form comprehensible genres—began to become apparent as well.

If a movie was both romantic and Oscar-winning, Oscar-winning always went to the left: Oscar-winning Romantic Dramas. Time periods always went at the end of the genre: Oscar-winning Romantic Dramas from the 1950s.

The single-word adjectives (such as romantic) could basically just pile up, though, at least to a point: Oscar-winning Romantic Forbidden-Love Movies. 

And the content-area categories were generally tacked onto the end: Oscar-winning Romantic Movies about Marriage

In fact, there was a hierarchy for each category of descriptor. Generally speaking, a genre would be formed out of a subset of these components:

Region + Adjectives + Noun Genre + Based On... + Set In... + From the... + About... + For Age X to Y

There were a few wildcards, too, like everyone's favorite, "With a Strong Female Lead" and "For Hopeless Romantics."

And, of course, there were all the genres that are for movies or TV shows starring or directed by certain individuals. 

But that was it. All 76,897 genres that my bot eventually returned, were formed from these basic components. While I couldn't understand that mass of genres, the atoms and logic that were used to create them were comprehensible. I could fully wrap my head around the Netflix system. 

I should note that the success of my bot had made me giddy by this point. A few Netflix categories put together are funny and intriguing. What could we do with 76,897 of them?!

And it was then that Ian Bogost, my colleague, suggested that we build the generator you see at the top of this article. 


Imaginary movies for an imaginary genre. Illustration by Darth.

 

Decoding Netflix's Grammar 

To build a generator, however, our understanding of the grammar needed to get precise. I turned to another piece of software called AntConc, a freeware program maintained by a professor in Japan. It's generally used by linguists, digital humanities scholars, and librarians for dealing with corpuses, large amounts of text. If you've ever played with Google's Ngram tool, then you've seen at least one of the capabilities of AntConc. 

What AntConc can do, essentially, is turn a bunch of text into data that can be manipulated. It can count the number of times each word appears in the mass of text that forms Netflix's database, for example.

So, it becomes trivial to create a list of the top 10 ways that Netflix likes to describe movies in their personalized genres. 

Or you can have it count the appearance of all 3-word phrases that begin with "from" and that would output the top decades in Netflix genres, with the 1980s rightfully and expectedly on top. When you're looking for an '80s movie, nothing else will do, you know? 

By searching for phrases beginning with "Set in" I found all the locations mentioned in genres: 

By searching for phrases beginning with "For," I created a list of the age-specific genre descriptions. Netflix has content "for kids" generally, as well as for ages 0 to 2, 0 to 4, 2 to 4, 5 to 7, 8 to 10, 8 to 12, and 11 to 12.  

I took all of this data about Netflix's vocabulary and I created one large spreadsheet. Separately, I calculated the top actors, directors, and creators, and stashed those in a separate file. 

Ian then took these spreadsheets and created several different grammars. The first and easiest method just lets lots of adjectives pile up and throws all the different descriptors into the mix very often. That's the GONZO setting in the generator. It outputs amazing stuff that you immediately want to copy and paste to your friends like:

  • Deep Sea Father-and-Son Period Pieces Based on Real Life Set in the Middle East For Kids
  • Assassination Bounty-Hunter Secret Society Dramas Based on Books Set in Europe About Fame For Ages 8 to 10
  • Post-Apocalyptic Comedies About Friendship

Gosh, those are good, no? The second you read one, don't you just want that movie to exist? Can't you just imagine it? All that to say, Gonzo, for me, is films that should exist but won't. Or at least pitches that should exist and might soon.

Then, we scaled back the fun stuff, allowing only a few adjectives into the titles. Suddenly, we found ourselves staring at the extant movie-production logic of the Hollywood studios. Basically: endless recombination of the same few themes.

  • Classic Action Movies
  • Family-Friendly Westerns
  • Buddy Period Pieces

That's the Hollywood button. (And that's Hollywood.)

Finally, we played and played around with different grammatical structures until we started to see Netflix's trademark level of specificity. 

  • Raunchy Absurd Slashers
  • Fight-the-System Political Love Triangle Mysteries
  • Chilling Action Movies About Royalty

As we worked on the generator, I could tell someone had gone down this road before. A single human brain had had to make the decisions that we had. How many adjectives? How long should they be? And even more basic: what should the adjectives be? Why cerebral and not brainy? Why differentiate between gory and violent? 

As a writer, I kept asking myself: why are the adjectives just right? Mind-bending and sandal-and-sword (you know, Conan!) and Twisty Tale and Rogue-Cop and Mad Scientist and Underdog and Feel-Good and Understated. 

The words themselves were carefully chosen. By whom?

There were questions we still had, too. From a Los Angeles Times article, we knew the basics of tagging. But how did the tags relate to Netflix's "personalized genres"? What algorithm converted this mass of tags into precisely 76,897 genres?

If most people attempting to understand Netflix's genres were like the classic blind man trying to comprehend an elephant, I felt like I could see the front half of the beast, perhaps, but not the whole thing. I needed someone to explain the back end.

So, after I'd secured my data, I called up Netflix's PR liaison, a Dutch guy named Joris Evers who keeps a miniature windmill on his desk. I told him we had to talk. 

After I filled him in on what we'd done, I waited to hear his reaction, wondering if I was about to have my Netflix account permanently canceled. Instead, he said, "And now you want to come in and talk to Todd Yellin, I guess?"

Presented by

Alexis C. Madrigal

Alexis Madrigal is the deputy editor of TheAtlantic.com. He's the author of Powering the Dream: The History and Promise of Green Technology. More

The New York Observer has called Madrigal "for all intents and purposes, the perfect modern reporter." He co-founded Longshot magazine, a high-speed media experiment that garnered attention from The New York Times, The Wall Street Journal, and the BBC. While at Wired.com, he built Wired Science into one of the most popular blogs in the world. The site was nominated for best magazine blog by the MPA and best science website in the 2009 Webby Awards. He also co-founded Haiti ReWired, a groundbreaking community dedicated to the discussion of technology, infrastructure, and the future of Haiti.

He's spoken at Stanford, CalTech, Berkeley, SXSW, E3, and the National Renewable Energy Laboratory, and his writing was anthologized in Best Technology Writing 2010 (Yale University Press).

Madrigal is a visiting scholar at the University of California at Berkeley's Office for the History of Science and Technology. Born in Mexico City, he grew up in the exurbs north of Portland, Oregon, and now lives in Oakland.

Join the Discussion

After you comment, click Post. If you’re not already logged in you will be asked to log in or register. blog comments powered by Disqus