Read: Raised by YouTube
Using passive social-media or smartphone data to infer someone’s health status or to study health dynamics broadly is called digital phenotyping, and it’s a growing field of study. Researchers are now using the great wealth of information that users provide to Facebook, Twitter, YouTube, and Instagram to create algorithms that might detect HIV, obesity, Parkinson’s disease, and suicide risk, allowing, they hope, for preventative interventions. Refining machine-learning algorithms requires immense amounts of data, and YouTube offers an abundance of accessible, public videos, already collated and categorized with tags such as #autism, #behavior, #parents, and #kids. Because most social-media sites make content public by default, researchers can generally scrape and use what they want.
Digital phenotyping is just a subset of a larger phenomenon. Every status update and every selfie has a second life—as the raw materials for ad targeting, data brokering, and research. Twitter data are useful for detecting earthquakes. The White House is funding research to see whether Fitbit data can be used to prevent mass shootings. YouTube’s “mannequin challenge” videos are used to train self-driving-car systems. Internet-connected thermometers might detect flu outbreaks. All the flotsam and jetsam of digital life offer insights far beyond what users consider when they hit “upload.”
Using video analysis to study atypical behaviors associated with autism dates back to at least 2005. More recently, researchers have hoped that with enough training data, machine-learning tools might notice the same things a pediatrician would: Does the child respond to a parent calling his name? Can the child easily shift her focus from one object to another? By quantifying these responses, algorithms could learn to pick up patterns from uploaded videos. A 2018 autism behavioral study, for example, used YouTube videos and wearables to classify typical and atypical movements. A decade ago, researchers relied on home videos to train their algorithms. Now the social-media age offers enormous amounts of potential training data.
But Matthew Bietz, a bioethicist at UC Irvine, argues that the abundance of readily accessible data can obscure the potential privacy risks of scraping sites such as YouTube for research. “I think sometimes these [AI studies] are being driven by the people with the tools and not the people with the problem,” he says.
Read: AI could make detecting autism easier
Bietz studies how “digital exhaust”—the data we create with our mouse strokes, selfies, and even battery usage—is turned into health data. Digital exhaust feels ephemeral, but lasts forever and can be personally identifiable. Bietz notes that it can be nearly impossible to ask permission from the people who generate this exhaust to use it in large-scale scraping projects, and there’s no specific threshold that makes a sample too large to ask for permission. Researchers have taken different approaches to consent when a study has a large cohort, including directly asking the platform for access to user data, and running opt-in ads.