Once it’s stolen, valuable data tends to crop up for sale in the shady alleyways of the Internet. Online forums frequented by hackers are popular places for hawking data dumps, and full-blown marketplaces on the dark web provide anonymity to buyers and sellers.
For an organization that’s charged with protecting sensitive data—which is nearly any company with payroll records or employee health files—one good way to know when a data breach has occurred is to monitor these markets. That’s where Matchlight, a service from Baltimore-based Terbium Labs, comes in.
Matchlight scans the recesses of hacker forums and marketplaces on both the surface web and the dark web—a part of the Internet accessible only through the anonymizing Tor network—and notifies clients if their confidential data turns up.
The service has two parts: The first is a web crawler, also known as a spider, that automatically searches and indexes the websites where stolen data is likely to appear. On the part of the Internet that most people browse every day, Google is the king of indexing. Every traffic-hungry site conforms to certain standards in order to get picked up by Google’s spiders and rise up as far as possible in its search results.
This makes Matchlight’s job relatively easy on the surface web. But there’s a different story on the dark web. “We’re trying to index what people don’t want indexed,” said Danny Rogers, Terbium’s CEO. “There’s no desire to make things easy to find. Fundamentally, it’s a more hostile environment to crawl.”
Marketplaces that sell illicit goods on the dark web come and go: The FBI shut down the notorious Silk Road market in 2013, and a set of coordinated raids in 2014 took down 400 dark-web markets hosted in 17 different countries. That’s a small slice of all the sites on the dark web, but these fluctuations make it difficult to monitor the most active marketplaces—so to help its spider out, Terbium’s employees are also on the lookout for new sites in need of indexing.
Once Matchlight has an index of what’s being traded on the Internet, it needs to compare it against its clients’ data. But instead of keeping a database of sensitive and private client information to compare against, Terbium uses cryptographic hashes to find stolen data.
Hashes are functions that create an effectively unique fingerprint based on a file or a message. They’re particularly useful here because they only work in one direction: You can’t figure out what the original input was just by looking at a fingerprint. So clients can use hashing to create fingerprints of their sensitive data, and send them on to Terbium; Terbium then uses the same hash function on the data its web crawler comes across. If anything matches, the red flag goes up. Rogers says the program can find matches in a matter of minutes after a dataset is posted.