If you use Netflix, you've probably wondered about the specific genres that it suggests to you. Some of them just seem so specific that it's absurd. Emotional Fight-the-System Documentaries? Period Pieces About Royalty Based on Real Life? Foreign Satanic Stories from the 1980s?
If Netflix can show such tiny slices of cinema to any given user, and they have 40 million users, how vast did their set of "personalized genres" need to be to describe the entire Hollywood universe?
This idle wonder turned to rabid fascination when I realized that I could capture each and every microgenre that Netflix's algorithm has ever created.
Through a combination of elbow grease and spam-level repetition, we discovered that Netflix possesses not several hundred genres, or even several thousand, but 76,897 unique ways to describe types of movies.
There are so many that just loading, copying, and pasting all of them took the little script I wrote more than 20 hours.
We've now spent several weeks understanding, analyzing, and reverse-engineering how Netflix's vocabulary and grammar work. We've broken down its most popular descriptions, and counted its most popular actors and directors.
To my (and Netflix's) knowledge, no one outside the company has ever assembled this data before.
What emerged from the work is this conclusion: Netflix has meticulously analyzed and tagged every movie and TV show imaginable. They possess a stockpile of data about Hollywood entertainment that is absolutely unprecedented. The genres that I scraped and that we caricature above are just the surface manifestation of this deeper database.
Netflix cooperated with my quest to understand what they internally call "altgenres," and made VP of product innovation Todd Yellin, the man who conceived of the system, available for an in-depth interview. Georgia Tech professor and Atlantic contributing editor, Ian Bogost, worked closely with me recreating the Netflix grammar, and he programmed the magical genre generator above.
If we reverse engineered Yellin's system, it was Yellin himself who imagined a much more ambitious reverse-engineering process. Using large teams of people specially trained to watch movies, Netflix deconstructed Hollywood. They paid people to watch films and tag them with all kinds of metadata. This process is so sophisticated and precise that taggers receive a 36-page training document that teaches them how to rate movies on their sexually suggestive content, goriness, romance levels, and even narrative elements like plot conclusiveness.
They capture dozens of different movie attributes. They even rate the moral status of characters. When these tags are combined with millions of users viewing habits, they become Netflix's competitive advantage. The company's main goal as a business is to gain and retain subscribers. And the genres that it displays to people are a key part of that strategy. "Members connect with these [genre] rows so well that we measure an increase in member retention by placing the most tailored rows higher on the page instead of lower," the company revealed in a 2012 blog post. The better Netflix shows that it knows you, the likelier you are to stick around.
And now, they have a terrific advantage in their efforts to produce their own content: Netflix has created a database of American cinematic predilections. The data can't tell them how to make a TV show, but it can tell them what they should be making. When they create a show like House of Cards, they aren't guessing at what people want.
Imaginary movies for an imaginary genre. Illustration by Darth.
Operation Scrape All the Data
This journey began when I decided I wanted a comprehensive list of Netflix microgenres. It seemed like a fun story, though one that would require some fresh thinking, as many other people had done versions of it.
I started on Twitter, asking my followers to submit the categories that showed up for them on Netflix to a shared document. "To my knowledge, no such list exists, but obviously one should," I wrote. "And then we can see what Netflix is really doing to us."
That call for help yielded about 150 genres, which seemed like a lot, relative to your average Blockbuster (RIP). But it was at that point that Sarah Pavis, a writer and engineer, pointed out to me that Netflix's genre URLs were sequentially numbered. One could pull up more and more genres by simply changing the number at the end of the web address.
That is to say, http://movies.netflix.com/WiAltGenre?agid=1 linked to "African-American Crime Documentaries" and then http://movies.netflix.com/WiAltGenre?agid=2 linked to" Scary Cult Movies from the 1980s." And so on.
After walking through a few dozen URLs, I began to try out what seemed like arbitrarily high numbers. 1000: Movies directed by Otto Preminger. 3000: Dramas Starring Sylvester Stallone. 5000! Critically-Acclaimed Crime Movies from the 1940s. 20000! Mother-Son Movies from the 1970s. There were a lot of blanks in the data, but the entries extended into the 90,000s.
This database probing told me three things: 1) Netflix had an absurdly large number of genres, an order of magnitude or two more than I had thought, 2) it was organized in a way that I didn't understand, and 3) there was no way I could go through all those genres by hand.
But I also realized there was a way to scrape all this data. I'd been playing with an expensive piece of software called UBot Studio that lets you easily write scripts for automating things on the web. Mostly, it seems to be deployed by low-level spammers and scammers, but I decided to use it to incrementally go through each of the Netflix genres and copy them to a file.