Robert Lang is a master of origami, known for his elegant and almost impossibly accurate sculptures. On his website, you can find his “crease patterns”—all the folds that go into his compositions, drawn out on flat sheets of paper. The patterns are beautiful in their own right, not least because it is almost impossible to look at one and divine what it will eventually become. How could you ever guess that this would become a beetle, or that this folds into a rhino, or that this is a tarantula-in-the-making?

That challenge, incidentally, is exactly what many scientists have struggled with for decades, because life—all life—depends on origami.

Specifically, it depends on proteins—essential molecular machines, which do all the critical jobs that keep us alive. They’re built according to instructions encoded in our genes, which are used to assemble a long sequence of building blocks called amino acids. This two-dimensional chain naturally folds into a complicated three-dimensional shape in a feat of spontaneous origami. It’s that shape that determines what proteins can do; it’s that shape that we need to understand. If we want to use CRISPR, the newly famous gene-editing technique, we need to know the shape of Cas9, the protein that actually does the editing. If we want to create drugs against viruses like flu or HIV, we need to know the shape of the proteins on their surface.

But divining these shapes is really hard. Traditionally, you’d grow pure crystals of the protein, bounce X-rays off the crystals, and interpret the resulting patterns. It’s laborious, tricky, and expensive work. The alternative is to predict how a protein will fold based on the sequence of its amino acids. Remember that these folds happen spontaneously, so there must be some code that translates the properties of the amino acids—say, their size or electrical charge—into a 3-D shape. But deciphering that code is also really hard. It’s like looking at Robert Lang’s crease patterns and trying to see a rhino or beetle.

This “protein-folding problem” was first posed over 50 years ago, after Max Perutz and John Kendrew won the 1962 Nobel Prize in Chemistry for solving the very first protein structure the old-fashioned way. The problem has since spawned an international competition, and a computer game. IBM’s Blue Gene supercomputers were designed specifically to simulate protein folding. Scientists have made tremendous headway, but the protein-folding problem is still “one of the grand challenges in biology and chemistry,” says David Baker from the University of Washington.

Over the years, his team has made many important steps towards fully solving the protein-folding problem, and their latest is perhaps the most impressive. Of the 15,000 protein families, a third have at least one member whose structure has been solved with traditional techniques, another third have at least one shape that was predicted by a computer, and a third have nothing. In one fell swoop, Baker’s team has now computed the shapes of 614 of those unsolved families—around 12 percent of them.

A selection of the protein structures that Baker’s team recently solved. Credit: AAAS.

“This is a very impressive result, especially given that between 2000 and 2015, the NIH poured hundreds of millions of dollars into the Protein Structure Initiative, with nowhere near as spectacular results,” says Patricia Clark from the University of Notre Dame, who was not involved in the study. That initiative used traditional X-ray crystallography. “You had to make the proteins, which was hard, and you had to solve their structures, which was hard,” says Baker. “Many were solved but it turned out to be very expensive. Here, we achieved the same goal, but the cost was very much less because it was all done on the computer.”

To be fair to the NIH, Baker’s methods weren’t available back in 2000—although, their seeds certainly were. In the late 1990s, he developed a program for predicting protein structures called Rosetta, which remains the leader in the field. Rosetta works by looking at a protein’s sequence, considering the many ways it can fold, and finding the most stable one. The problem is that even small proteins can potentially fold in an unimaginably large number of ways, and there just isn’t enough computing power to analyze all the possibilities. It’s like trying to find the lowest point on Earth, explains Baker. “Even if you have an accurate altimeter, if you’re never in the Middle East, you’ll never find the Dead Sea.”

There are ways around that. Baker’s team launched a program called Rosetta@Home, where citizens can donate some of their own computing power to the task. They developed a computer game called Foldit, which allows the intuition of human players to steer Rosetta towards the right structures. And mainly, they’ve been working on ways of narrowing down the list of possible structures, so that Rosetta knows where to focus its search.

One method relies on the fact that when proteins fold, two amino acids that are far apart in sequence can end up touching each other in 3-D space. If one of these amino acids changes, it could destabilize the shape of the entire protein—so its partner often changes too to compensate. You can detect these pairs of correlated amino acids by comparing different versions the same protein in different species. When that protein folds, those correlated pairs are likely to touch. And having that information greatly reduces the number of possible structures for Rosetta to consider.

In 2014, team member Sergey Ovchinnikov used this technique in an international structure-prediction competition called CASP. “He made one prediction of stunning accuracy for a far more complex structure than anyone had modelled,” says Baker, “That was very exciting” Buoyed by that success, in 2015, Ovchinnikov repeated the trick for 58 entire families of proteins, including 400,000 members altogether. The structures of six of these have since been solved through traditional methods, confirming that Ovchinnikov’s predictions were accurate.

But this technique has one important limitation. To accurately identify those tell-tale pairs of correlated amino acids, you need a lot of protein sequences from many different species. And in his 2015 study, Ovchinnikov had exhausted what he could find in public databases. “It seemed like there wasn’t really more to do, but we got rescued in a way,” says Baker.

In recent years, biologists have been busy sampling microbes from soils, water, bodies, and all kinds of other environments. In the process, they’ve been amassing protein sequences at breakneck pace. After approaching the Joint Genome Institute for their sequences, Ovchinnikov predicted the structures of 614 protein families, including upwards of a million different proteins. Some control the development of embryos, others move iron around cells, and yet others break down other proteins—just to name a few. “It takes strides toward the goal of describing the entire protein structure universe,” says Karen Allen from Boston University.

Still, she notes that computer predictions can’t replace the traditional, laborious methods like X-ray crystallography. Why? Because proteins are built to a precision that would make human engineers blush; every atom is always in exactly the right position. The models that Rosetta produces aren’t accurate enough to capture that precision. They’re good enough for, say, understanding what a virus’s protein does, but not for designing a drug that targets that protein. For that, you still need the X-rays.

But even there, the computer models help. Recently, a German team solved the structure of a confusing three-part protein after Ovchinnikov gave them the model that Rosetta had churned out. Only then could they make sense of the data from their X-ray experiments. Clark also predicts that Baker’s technique will be increasingly useful to scientists who study microbes, since most of these can’t be grown in a lab. For this “unculturable majority,” it’s almost impossible to purify enough of a protein to analyze with X-rays. Computer predictions sidestep that problem.

The technique will also become more useful with time, as scientists collect more protein sequences. So far, most of these come from microbes, but researchers are increasingly collecting data from animals, plants, and other complex organisms. “We’re getting in touch with people from various genome projects and trying to collect them all,” says Baker. “If we get enough, we can determine the structural biology of a lot of the tree of life.”