Back in 2000, a group of mildly inebriated geneticists set up a lighthearted sweepstakes to guess how many genes the human genome would turn out to contain once it was fully sequenced. More than 460 bets were placed, and the lowest guess of 25,947 eventually won when the Human Genome Project was completed in 2003. Fifteen years later, the exact number of human genes is still being debated, with estimates ranging from 19,000 to 22,000. And regardless of the true count, it’s clear that many of these genes are largely unknown.
Since 2003, several researchers have noticed that scientists tend to study genes that are already well studied, and the genes that become popular aren’t necessarily the most biologically interesting ones. Even among genes, it seems, the rich get richer. This trend hasn’t changed in the past two decades, according to a new study from Thomas Stoeger from Northwestern University. Through a massive analysis of existing biomedical data, he found that he can predict how intensely a given gene is studied based on a small number of basic biochemical traits. Most of these, he says, reflect how easy a gene was to investigate in the 1980s and 1990s, rather than how important it is.
“People said that knowing all the genes was going to change everything,” says Luis Amaral, who led the new study. But the 16 percent of genes that were known in 1991 still accounted for half of all biomedical papers in 2015. By contrast, more recently discovered genes are more poorly known, and a quarter (27 percent) have never been the focus of a scientific paper. Based on current trends, Stoeger estimates that it would take at least five decades before every gene was characterized at the most basic level, let alone fully understood. “There’s a chance that we are missing out on a lot of interesting biology,” he says.
“I find that very depressing,” says Jay Shendure from the University of Washington. “It is stunning that we sit here 15 years after the Human Genome Project, and still know little to nothing about so many genes. In a world of finite resources, it does not make sense to invest equal effort in every gene. But it’s clear that something is amiss in the status quo of research allocation.”
In what Amaral describes as a “heroic effort,” Stoeger spent years collating information from dozens of databases about every known gene. Using machine-learning tools, he then showed that he could accurately predict how many papers have been published about a given gene using just 15 traits.
Some of these telltale traits—how often the gene is mutated, or the negative consequences of losing it entirely—certainly reflect the gene’s importance and its relevance to human disease. They’re the kind of characteristics scientists should be paying attention to.
But other traits—how big the gene is, how active it is, how many tissues it is active in, whether it produces proteins that are secreted from cells, whether those proteins are soluble in water, and more—reflect how amenable the genes are to experiments. Highly active genes, for example, are easier to detect using older methods. “That definitely had a substantial impact on whether you were even able to study a gene in [the 1980s and 1990s],” says Sharon Plon from Baylor College of Medicine. And those historical quirks are better at predicting how the National Institutes of Health currently allocates its money than thousands of other features that more directly reflect what we now know about the role of genes in disease.
It’s possible, of course, that scientists have already identified all the really important genes, and are allocating their attention appropriately. There are good reasons, for example, why p53 is the most popular human gene: It protects our cells from cancer, and is itself mutated in half of all tumors. More broadly, Stoeger found that compared to the least popular genes, the most popular ones are three to five times more likely to have been linked to diseases in large studies, or to wreak havoc when they accrue incapacitating mutations. The problem is that those celebrity genes get over 8000 times more attention than their neglected counterparts. Scientists do tend to study important genes, Stoeger says, but even then, they do so disproportionately.
That’s partly because there are substantial barriers to studying something that no one else has studied before. A researcher might spend years trying to, for example, engineer a line of laboratory rodents that lack the gene in question. They might create bespoke antibodies or other chemical reagents that can help track or visualize the gene. This all takes time, money, and effort. “Many investigators identify an important gene and then spend their whole career studying it,” says Plon.
To do otherwise is risky. Stoeger showed that over the past two decades, junior researchers who focused their attention on the least studied genes were 50 percent less likely to eventually run their own lab. “Those people get pushed out of the biomedical workforce, and then don’t get a chance to set up a lab that explores some of the previously unknown biology,” he says.
Stoeger and Amaral “have done a remarkable job of comprehensively analyzing the reasons why many important genes are ignored,” adds Purvesh Khatri from Stanford University. “Their results underscore the need to change how we study human biology.”
Amaral blames the research imbalance on the erosion of funding from the National Institutes of Health, which forces scientists to compete for a dwindling number of grants and pushes them toward safer research. “When resources stop growing, the entire system is telling people not to take chances,” he says. The NIH does have grants that are meant to promote innovative, exploratory, high-risk research, but even these end up augmenting the same imbalances: Half of the papers that emerge from them still focus on the same 5 percent of well-studied genes. Even supposedly game-changing techniques like CRISPR have altered the landscape of gene popularity very little. “You get all these new tools but you end up using them on the same set of genes that you were using them on before,” says Amaral.
Within the past decade, only six genes have escaped the doldrums of obscurity and become newly popular, mainly because researchers recently realized that they are medically important. C9Orf72, for example, was recently identified as a common link between two neurodegenerative diseases—frontotemporal dementia and ALS. IDH1 is commonly mutated in brain cancers. SAMHD1 protects certain cells from HIV. “It’s clear that if sufficiently motivated, the field can tack,” says Shendure, “but I still would have expected more exceptions. We don’t want communism for genes, but we do want to lower the activation energy for intensively attacking the biology of genes that clearly merit more attention.”
Stoeger and Amaral have already created a wish list of genes that, based on their data, should be easier to study with modern methods, and are probably worthy of attention. They also think that agencies like the NIH should create grants that encourage junior scientists to pursue new and unpredictable lines of research, and, crucially, provide them with enough years of funding to offset the initial risk of heading down those paths. “If we don’t take targeted approaches to incentivize the study of unstudied genes, the system is not going to change,” Amaral says.