“With just a mountain of images, no one will ever inspect them one by one. You can never make a discovery," he says. “You need to convert it into something that machines can understand.”
This means trying to train computers to recognize patterns as well as we can, one of the thorniest problems in computer science. Computers are still second to humans on this and they have much longer learning curves, says Matias Carrasco Kind, an astronomer at the University of Illinois at Urbana-Champaign.
“We can recognize faces in a big crowd, blurry objects in a picture, and notice people from behind or from the way they walk,” he says. “By just looking at a few examples, we can extrapolate much better than computers, which need a much larger training set, and more time to process. [And computers have] a harder time [thinking] ‘outside the box.’”
This is especially true when it comes to characterizing galaxies. You have to account for brightness, which is different across pixels; the galaxies’ shape and symmetry; and their orientation, including whether we’re looking at them face-on or sideways. Humans can do this very quickly, which is why astronomers created something called the Galaxy Zoo. For the better part of a decade, thousands of citizen scientists have volunteered to organize galaxies from the Sloan Digital Sky Survey, a robotic mission that has mapped an astonishing chunk of the observable universe, including some 208 million galaxies. The Zoo website shows you a small photo of a galaxy, and you answer simple questions, like whether or not it’s a spiral. It’s oddly meditative, but decidedly slow work—it's taken years to populate the Zoo, and it doesn’t come close to classifying everything the SDSS has seen. Shamir estimates that at this rate, it would take human volunteers 120,000 years to classify everything that comes through the LSST.
So Shamir is trying to give computers an edge. He and coauthor Evan Kuminski fed some galaxies to a machine-learning algorithm called Wndchrm, which can classify images based on data in their pixels. Shamir, who designed it, has also used the algorithm to categorize microscope images and distinguish a fake Jackson Pollock painting from the real thing.
It works by turning physical attributes into numbers, and it uses 2,885 of these numerical descriptors for each galaxy image. Each relates to characteristics like textures, shapes and edges, allowing the algorithm to categorize a galaxy as spiral or elliptical.
Shamir and Kuminski trained it with 300 galaxies Shamir classified himself. Confident Wndchrm had learned the ropes, they fed it 3 million SDSS galaxies, and then compared its classification with Galaxy Zoo images. They only used “superclean” SDSS galaxies, which meant many different Zoo volunteers had looked at the same images and 95 percent of them agreed on the galaxies’ physical attributes. Wndchrm gave each classification a bit of a hedge: say, 85 percent certainty it’s a spiral, 15 percent certainty it’s an elliptical. This uncertainty was built in because a galaxy’s attributes are often subjective, and difficult for a computer to discern, Shamir says.