How Netflix Reverse Engineered Hollywood

After some troubleshooting and help from Bogost, the bot got up and running and simply copied and pasted from URL after URL, essentially replicating a human doing the work. It took nearly a day of constantly running a little Asus laptop in the corner of our kitchen to grab it all.


Imaginary movies for an imaginary genre. Illustration by Darth.

 

As the software ran, I began to familiarize myself with the data. I randomly selected a snippet, so you can see what the raw genre data looks like:

Emotional Independent Sports Movies
Spy Action & Adventure from the 1930s
Cult Evil Kid Horror Movies
Cult Sports Movies
Sentimental set in Europe Dramas from the 1970s
Visually-striking Foreign Nostalgic Dramas
Japanese Sports Movies
Gritty Discovery Channel Reality TV
Romantic Chinese Crime Movies
Mind-bending Cult Horror Movies from the 1980s
Dark Suspenseful Sci-Fi Horror Movies
Gritty Suspenseful Revenge Westerns
Violent Suspenseful Action & Adventure from the 1980s
Time Travel Movies starring William Hartnell
Romantic Indian Crime Dramas
Evil Kid Horror Movies
Visually-striking Goofy Action & Adventure
British set in Europe Sci-Fi & Fantasy from the 1960s
Dark Suspenseful Gangster Dramas
Critically-acclaimed Emotional Underdog Movies

The first thing that I noticed was that not every genre had streaming movies attached to it. The reason for that is the streaming catalog rotates and the genres that I was looking at represented the total possible universe of different genres, not just the ones that people were being shown on that particular day in this particular geography (the United States). So, right now, category 91,300, "Feel-good Romantic Spanish-Language TV Shows" doesn't show me anything I can stream. But category 91,307, "Visually Striking Latin American Comedies" has two movies and category 6,307, "Visually Striking Romantic Dramas" has 20. 

So this is the main caveat to keep in mind as we go through this data: The existence of a genre in the database doesn't precisely correspond to the number of movies that Netflix has in its vaults. All the genre's existence means is that, based on an algorithm we'll get into later, there are some movies out there that fit the description.

As the thousands of genres flicked by on my little netbook, I began to see other patterns in the data: Netflix had a defined vocabulary. The same adjectives appeared over and over. Countries of origin also showed up, as did a larger-than-expected number of noun descriptions like Westerns and Slashers. There were ways of saying where the idea for the movie came from ("Based on Real Life" "Based on Classic Literature") and where the movies were set ("Set in Edwardian Era"). Of course, there were the various time periods, as well—from the 1980s, and so on—and references to children ("For Ages 8 to 10"). 

Most intriguingly, there were the subjects, a complete list of which form a window unto the American soul: 

As the hours ticked by, the Netflix grammar—how it pieced together the words to form comprehensible genres—began to become apparent as well.

If a movie was both romantic and Oscar-winning, Oscar-winning always went to the left: Oscar-winning Romantic Dramas. Time periods always went at the end of the genre: Oscar-winning Romantic Dramas from the 1950s.

The single-word adjectives (such as romantic) could basically just pile up, though, at least to a point: Oscar-winning Romantic Forbidden-Love Movies. 

And the content-area categories were generally tacked onto the end: Oscar-winning Romantic Movies about Marriage

In fact, there was a hierarchy for each category of descriptor. Generally speaking, a genre would be formed out of a subset of these components:

Region + Adjectives + Noun Genre + Based On... + Set In... + From the... + About... + For Age X to Y

There were a few wildcards, too, like everyone's favorite, "With a Strong Female Lead" and "For Hopeless Romantics."

And, of course, there were all the genres that are for movies or TV shows starring or directed by certain individuals. 

But that was it. All 76,897 genres that my bot eventually returned, were formed from these basic components. While I couldn't understand that mass of genres, the atoms and logic that were used to create them were comprehensible. I could fully wrap my head around the Netflix system. 

I should note that the success of my bot had made me giddy by this point. A few Netflix categories put together are funny and intriguing. What could we do with 76,897 of them?!

And it was then that Ian Bogost, my colleague, suggested that we build the generator you see at the top of this article. 


Imaginary movies for an imaginary genre. Illustration by Darth.

 

Decoding Netflix's Grammar 

To build a generator, however, our understanding of the grammar needed to get precise. I turned to another piece of software called AntConc, a freeware program maintained by a professor in Japan. It's generally used by linguists, digital humanities scholars, and librarians for dealing with corpuses, large amounts of text. If you've ever played with Google's Ngram tool, then you've seen at least one of the capabilities of AntConc. 

What AntConc can do, essentially, is turn a bunch of text into data that can be manipulated. It can count the number of times each word appears in the mass of text that forms Netflix's database, for example.

Presented by

Join the Discussion

After you comment, click Post. If you’re not already logged in you will be asked to log in or register.

blog comments powered by Disqus