The Y Chromosome's Still-Uncharted Regions

The human genome has never actually been complete.

Fifteen years ago this April, scientists announced that the human genome sequence was complete. I regret to inform you this is not true.

If you have been misled, it is because many scientists themselves have long ignored the last unassembled regions of human DNA, which consist mostly of short, repeating sequences that do not look like genes. “These huge gaps still remain,” says Karen Miga, a genomics researcher at the University of California at Santa Cruz. That’s because it has been impossible to sequence and assemble those repeating stretches of DNA—until now.

In a major milestone, Miga and her colleagues reveal the complete 300,000-letter sequence of one of those odd, poorly understood regions: the centromere of the Y chromosome.

It’s astonishing that a centromere sequence has never been assembled before, given how fundamental they are. Chromosomes are tightly packed structures of DNA, and centromeres are a specialized region on them. When a cell divides, threadlike proteins attach to the centromere to pull chromosomes apart. Without functioning centromeres, cells can end up with too few or too many chromosomes—like in Down syndrome. Malfunctioning centromeres have also been linked to diseases like cancers.

“Here’s this region on every chromosome that is absolutely essential,” says Beth Sullivan, a molecular biologist at Duke who was not involved in the study.  “You’d think we’d know a lot about the centromere.”

Yet centromeres have been tough to crack. They contain similar or even identical sequences that are perhaps 170 letters long and repeated hundreds or thousands of times. Traditional sequencing machines chop up a strand of DNA into short pieces that are “read” and then assembled like a puzzle. “The problem with centromeres is all the pieces look the same. It’ll be like putting together a puzzle of the Sahara Desert,” says Sullivan. Biologists studying genes have the benefit of reams of gene-sequence information, but those studying centromeres have essentially been stuck in the pre-sequencing days of the 1990s.

In comes nanopore sequencing, a new technology that can read longer stretches of DNA. Miga and her colleagues decided to tackle centromeres with it. Nanopore sequencing still cannot span the hundreds of thousands of letters of the Y chromosome’s centromere in one go. But it gives you fewer and bigger puzzle pieces. The sequence is much easier to assemble.

The Y chromosome centromere Miga and her colleagues sequenced and assembled came from an anonymous man in Buffalo, New York, whose DNA was also used for most of the Human Genome Project. The sequence didn’t contain too many surprises. That’s good news, because it means nanopore sequencing—a still relatively new technique—isn’t coughing up errors. And it opens to the door to more centromere sequencing. “To me, this is just the bedrock of future analysis,” says Miga.

Sequencing one centromere is a technical curiosity, but sequencing many centromeres is where the real interesting stuff will come. For example, the Y chromosome has long been used to study past human migrations and map genetic variation. Centromeres add another layer to the data because they vary so much. Not only do the letters of the underlying repeated sequences change, but the length of centromeres can vary by as much as 20 times from person to person on the same chromosome. “If you want to look at a human variation, I think this is the place to look,” says Steve Henikoff, who studies centromeres at the Fred Hutchinson Cancer Research Center. He called the new study a “landmark” in the study of centromeres.

Scientist will want to look at the centromeres of other chromosomes, too. Miga started with the Y chromosome simply because it was the easiest. Its centromere is only hundreds of thousands of letters long, whereas the centromere on chromosome 17, which Sullivan studies, is 4 million letters long. Defects in it have been linked to diseases, most notably breast cancer. If scientists could fully sequence the long centromere, they could understand how subtle changes—like minor typos in the sequence or the order of repeats—affect centromere function, too.

It will be harder to span these longer centromeres. Matthew Loose, a biologist at University of Nottingham who recently led a project sequencing the human genome (minus the centromeres) with nanopore technology, says he thinks it  will be “routine” in the “near future” to get more complete genome sequences.

And finally, it’s actually not just centromeres. A large chunk of the Y chromosome, for example, is actually heterochromatin, which is yet another region of highly repetitive DNA. “The Y chromosome is this gnarly chromosome,” says Miga.

With nanopore sequencing, scientists are just beginning to close the gaps—building toward a truly complete human genome sequence.