If you use Netflix, you've probably wondered about the specific genres that it suggests to you. Some of them just seem so specific that it's absurd. Emotional Fight-the-System Documentaries? Period Pieces About Royalty Based on Real Life? Foreign Satanic Stories from the 1980s?
If Netflix can show such tiny slices of cinema to any given user, and they have 40 million users, how vast did their set of "personalized genres" need to be to describe the entire Hollywood universe?
This idle wonder turned to rabid fascination when I realized that I could capture each and every microgenre that Netflix's algorithm has ever created.
Through a combination of elbow grease and spam-level repetition, we discovered that Netflix possesses not several hundred genres, or even several thousand, but 76,897 unique ways to describe types of movies.
There are so many that just loading, copying, and pasting all of them took the little script I wrote more than 20 hours.
We've now spent several weeks understanding, analyzing, and reverse-engineering how Netflix's vocabulary and grammar work. We've broken down its most popular descriptions, and counted its most popular actors and directors.
To my (and Netflix's) knowledge, no one outside the company has ever assembled this data before.
What emerged from the work is this conclusion: Netflix has meticulously analyzed and tagged every movie and TV show imaginable. They possess a stockpile of data about Hollywood entertainment that is absolutely unprecedented. The genres that I scraped and that we caricature above are just the surface manifestation of this deeper database.
Netflix cooperated with my quest to understand what they internally call "altgenres," and made VP of product innovation Todd Yellin, the man who conceived of the system, available for an in-depth interview. Georgia Tech professor and Atlantic contributing editor, Ian Bogost, worked closely with me recreating the Netflix grammar, and he programmed the magical genre generator above.
If we reverse engineered Yellin's system, it was Yellin himself who imagined a much more ambitious reverse-engineering process. Using large teams of people specially trained to watch movies, Netflix deconstructed Hollywood. They paid people to watch films and tag them with all kinds of metadata. This process is so sophisticated and precise that taggers receive a 36-page training document that teaches them how to rate movies on their sexually suggestive content, goriness, romance levels, and even narrative elements like plot conclusiveness.
They capture dozens of different movie attributes. They even rate the moral status of characters. When these tags are combined with millions of users viewing habits, they become Netflix's competitive advantage. The company's main goal as a business is to gain and retain subscribers. And the genres that it displays to people are a key part of that strategy. "Members connect with these [genre] rows so well that we measure an increase in member retention by placing the most tailored rows higher on the page instead of lower," the company revealed in a 2012 blog post. The better Netflix shows that it knows you, the likelier you are to stick around.
And now, they have a terrific advantage in their efforts to produce their own content: Netflix has created a database of American cinematic predilections. The data can't tell them how to make a TV show, but it can tell them what they should be making. When they create a show like House of Cards, they aren't guessing at what people want.
Imaginary movies for an imaginary genre. Illustration by Darth.
Operation Scrape All the Data
This journey began when I decided I wanted a comprehensive list of Netflix microgenres. It seemed like a fun story, though one that would require some fresh thinking, as many other people had done versions of it.
I started on Twitter, asking my followers to submit the categories that showed up for them on Netflix to a shared document. "To my knowledge, no such list exists, but obviously one should," I wrote. "And then we can see what Netflix is really doing to us."
That call for help yielded about 150 genres, which seemed like a lot, relative to your average Blockbuster (RIP). But it was at that point that Sarah Pavis, a writer and engineer, pointed out to me that Netflix's genre URLs were sequentially numbered. One could pull up more and more genres by simply changing the number at the end of the web address.
That is to say, http://movies.netflix.com/WiAltGenre?agid=1 linked to "African-American Crime Documentaries" and then http://movies.netflix.com/WiAltGenre?agid=2 linked to" Scary Cult Movies from the 1980s." And so on.
After walking through a few dozen URLs, I began to try out what seemed like arbitrarily high numbers. 1000: Movies directed by Otto Preminger. 3000: Dramas Starring Sylvester Stallone. 5000! Critically-Acclaimed Crime Movies from the 1940s. 20000! Mother-Son Movies from the 1970s. There were a lot of blanks in the data, but the entries extended into the 90,000s.
This database probing told me three things: 1) Netflix had an absurdly large number of genres, an order of magnitude or two more than I had thought, 2) it was organized in a way that I didn't understand, and 3) there was no way I could go through all those genres by hand.
But I also realized there was a way to scrape all this data. I'd been playing with an expensive piece of software called UBot Studio that lets you easily write scripts for automating things on the web. Mostly, it seems to be deployed by low-level spammers and scammers, but I decided to use it to incrementally go through each of the Netflix genres and copy them to a file.
After some troubleshooting and help from Bogost, the bot got up and running and simply copied and pasted from URL after URL, essentially replicating a human doing the work. It took nearly a day of constantly running a little Asus laptop in the corner of our kitchen to grab it all.
Imaginary movies for an imaginary genre. Illustration by Darth.
As the software ran, I began to familiarize myself with the data. I randomly selected a snippet, so you can see what the raw genre data looks like:
Emotional Independent Sports Movies
Spy Action & Adventure from the 1930s
Cult Evil Kid Horror Movies
Cult Sports Movies
Sentimental set in Europe Dramas from the 1970s
Visually-striking Foreign Nostalgic Dramas
Japanese Sports Movies
Gritty Discovery Channel Reality TV
Romantic Chinese Crime Movies
Mind-bending Cult Horror Movies from the 1980s
Dark Suspenseful Sci-Fi Horror Movies
Gritty Suspenseful Revenge Westerns
Violent Suspenseful Action & Adventure from the 1980s
Time Travel Movies starring William Hartnell
Romantic Indian Crime Dramas
Evil Kid Horror Movies
Visually-striking Goofy Action & Adventure
British set in Europe Sci-Fi & Fantasy from the 1960s
Dark Suspenseful Gangster Dramas
Critically-acclaimed Emotional Underdog Movies
The first thing that I noticed was that not every genre had streaming movies attached to it. The reason for that is the streaming catalog rotates and the genres that I was looking at represented the total possible universe of different genres, not just the ones that people were being shown on that particular day in this particular geography (the United States). So, right now, category 91,300, "Feel-good Romantic Spanish-Language TV Shows" doesn't show me anything I can stream. But category 91,307, "Visually Striking Latin American Comedies" has two movies and category 6,307, "Visually Striking Romantic Dramas" has 20.
So this is the main caveat to keep in mind as we go through this data: The existence of a genre in the database doesn't precisely correspond to the number of movies that Netflix has in its vaults. All the genre's existence means is that, based on an algorithm we'll get into later, there are some movies out there that fit the description.