YouTube Videos Are a Gold Mine for Health Researchers
“Digital exhaust” from online life could be transformed into health insights. Should it be?

Earlier this summer, a team at England’s Keele University published a behavioral study on children with autism. But it didn’t do it by interviewing subjects, or administering questionnaires. Instead, it used YouTube videos. Bappaditya Mandal and his colleagues trained an artificial intelligence to study the body movements of children with autism, using it to classify their behaviors as either typical or atypical. The researchers’ goal, Mandal told me, is to use computers to more quickly evaluate edge cases that might normally require lab equipment or invasive tactile sensors.
Mandal’s research builds on algorithms that track the appearance of tremors or seizures in children with epilepsy. Epilepsy is slightly more common in people with autism, and vice versa. Video analysis can help scientists and families establish a narrative—when behaviors appear, what triggers them, and which parts of the body are most affected—“all the things the doctors need to know in order to do a good diagnosis,” Mandal explained.
He and his team scraped YouTube videos to build a database, repurposing the clips as valuable analytic data. Some of the videos were uploaded by autism advocacy groups; others were uploaded by parents. Mandal notes that although children’s faces appear in the database, the software doesn’t scan their faces or identify them; it just uses machine learning to read their body language. But the children—and their parents—did not opt in to having their home videos used for scientific research.
Using passive social-media or smartphone data to infer someone’s health status or to study health dynamics broadly is called digital phenotyping, and it’s a growing field of study. Researchers are now using the great wealth of information that users provide to Facebook, Twitter, YouTube, and Instagram to create algorithms that might detect HIV, obesity, Parkinson’s disease, and suicide risk, allowing, they hope, for preventative interventions. Refining machine-learning algorithms requires immense amounts of data, and YouTube offers an abundance of accessible, public videos, already collated and categorized with tags such as #autism, #behavior, #parents, and #kids. Because most social-media sites make content public by default, researchers can generally scrape and use what they want.
Digital phenotyping is just a subset of a larger phenomenon. Every status update and every selfie has a second life—as the raw materials for ad targeting, data brokering, and research. Twitter data are useful for detecting earthquakes. The White House is funding research to see whether Fitbit data can be used to prevent mass shootings. YouTube’s “mannequin challenge” videos are used to train self-driving-car systems. Internet-connected thermometers might detect flu outbreaks. All the flotsam and jetsam of digital life offer insights far beyond what users consider when they hit “upload.”
Using video analysis to study atypical behaviors associated with autism dates back to at least 2005. More recently, researchers have hoped that with enough training data, machine-learning tools might notice the same things a pediatrician would: Does the child respond to a parent calling his name? Can the child easily shift her focus from one object to another? By quantifying these responses, algorithms could learn to pick up patterns from uploaded videos. A 2018 autism behavioral study, for example, used YouTube videos and wearables to classify typical and atypical movements. A decade ago, researchers relied on home videos to train their algorithms. Now the social-media age offers enormous amounts of potential training data.
But Matthew Bietz, a bioethicist at UC Irvine, argues that the abundance of readily accessible data can obscure the potential privacy risks of scraping sites such as YouTube for research. “I think sometimes these [AI studies] are being driven by the people with the tools and not the people with the problem,” he says.
Bietz studies how “digital exhaust”—the data we create with our mouse strokes, selfies, and even battery usage—is turned into health data. Digital exhaust feels ephemeral, but lasts forever and can be personally identifiable. Bietz notes that it can be nearly impossible to ask permission from the people who generate this exhaust to use it in large-scale scraping projects, and there’s no specific threshold that makes a sample too large to ask for permission. Researchers have taken different approaches to consent when a study has a large cohort, including directly asking the platform for access to user data, and running opt-in ads.
“I think that this is one of those places where there’s still a decent amount of controversy, and we haven’t quite decided on what the best way to do this is,” Bietz says. “But I think one thing that we can agree [on] is that sort of hiding behind the claim, Oh, it’s public and thus it’s ethical—that doesn’t fly anymore.”
Researchers who use digital phenotyping hope to one day detect conditions from depression to Parkinson’s to schizophrenia as early as possible. Dennis Wall is a Stanford researcher and a co-founder of Cognoa, an app that combines a parent questionnaire and AI video analysis to assess autism risk in young children. In an interview, he explained the many compounding problems of a delayed autism diagnosis. “If early intervention isn’t provided to the child, their long-term prognosis is worse,” he said. On the other hand, “those kids who receive early intervention often progress to a point where they no longer qualify for an autism diagnosis.”
But, Wall said, the wait time for a screening with an autism specialist can be, in extreme cases, upwards of 12 months. All the while, parents rack up expenses—from visits to specialists to stopgap therapies—that often aren’t covered by insurance until their child receives an official diagnosis. An app that might help pediatricians make faster diagnoses or flag problematic behaviors for parents sooner could be revolutionary.
Social media have always been a complicated trade-off. On a site like YouTube, being watched is hard work, and it can pay millions—while all that browsing gives Google valuable data. Digital phenotyping only deepens this complexity, creating new risks and opportunities to consider as we try to measure the costs and benefits of our ever more digital lives. When we’re this comfortable eschewing privacy for rewards, the conditions are already set for daily life to become a lab.