#WikipediaProblems: How Do You Classify Everything?

Design the wrong system, and “selenium” and “sex” get sorted the same.

Like a card catalog, but, you know, for everything (Glyn Lowe)

How do you organize all the world’s information?

A decision made by the editors in charge of Wikipedia’s newest, biggest project reveals the difficulty of such a task.

Wikidata is the newest project from the Wikimedia Foundation, the organization which runs Wikipedia. As Becca wrote here last year, Wikidata promises a single, shared infrastructure of knowledge beneath Wikipedia in every language. This underlying data layer, which is Wikidata, can be read by both humans and machine, and it propagates changes from one language’s version of Wikipedia to other languages. If Canada, for instance, gets a new finance minister, and someone edits Wikipedia in English to reflect that change, then Wikidata will propagate that information to Wikipedia in other languages.

That’s a relatively straightforward use of Wikidata, though. Its promise is machine analysis of the Wikipedia body of knowledge in a complex, ongoing, holistic way — and for computation like that, its macro-organizational system matters.

Imagine a live visualization of the entire Wikipedia system, organized partly by the subject matter of article. What system needs to exist to make that possible?

During the early part of its development, Wikidata used a hierarchical taxonomy to organize its data entries. The system was called GND—a German initialism, Gemeinsame Normdatei, which translates to “Integrated Authority File.” GND was originally meant to organize bibliographic information across library systems, though it was expanded recently by Internet technologists to work for non-library systems, too.

Which sounds pretty good, right? If you have to schematize the set of known information, you might start with a system originally built for librarians. After all, libraries are institutions tasked with the sustainability and categorization of knowledge. They’re the historical experts.

Which all sounds good... until you encounter certain problems. Silly, wonderful, ridiculous problems.

Gerard Meijssen, an employee of the Wikimedia Foundation, talked about them to Emw, a Wikidata editor, in a recent blog post. Here are the kind of problems GND entailed for Wikidata:

GND groups everything into huge taxonomical categories. Those categories are:

  • Person.
  • Organization.
  • Place.
  • Event.
  • Work.
  • Term.

... and that’s it. Everything known in the world must fit into one of those containers. Everything ever knowable, to some degree, must fit into one of those containers. Information, that system proposes, comes in six essential types.

(Which makes the “person” macro-classification a little poignant: At the universal base layer, we give a starring role to ourselves. A person is unlike an organization or place or work; a person is so unlike anything else to deserve its own piece of cosmic tupperware. It’s an anthropocentric view, and, given our current understanding of what kind of things live in the universe, maybe a bit of a correct one.)

So what are the problems of this system, with its six terms?

First of all, it doesn't differentiate, at this level, between the physical and the abstract, and that includes people. Turns out giving “person” a starring role is its downfall, as Barack Obama and Jay Gatsby then have to share a macro-taxonomical level. You can’t separate out the fictional from the non-fictional at this level.

Second, in the words of EmW:

Any item that is not a person, place, event, organization or work is classified as a “term,” which contains virtually no information. We need to be able to classify things like gravity, carbon, DNA, cancer, clarinet, Twelver Shia Islam, fashion boot, dog and potato as more than simply "terms".

Sousaphone, selenium, sex: They’re so similar, said GND, that they should be sorted as the same species of stuff. But that’s, well, silly, and when Wikidata ditches GND it will lose that organizational fluke.

Overall, writes Emw, “these main types are fine as a way to classify items of general interest in a large library, but they're much too small to form a sound basis for a classification system for all human knowledge.”

So what will Wikidata use now? At the moment, it won’t use an ontology at all. It  hopes to get by using only “instance of” and “subclass of,” two types of relationship which don’t prescribe a single, totalizing taxonomy.

But some Wikidata editors, including Emw, hope to figure out what comes next.  “Although it is complex,” he writes, “the world has structure, and classes or types are a useful way to express that structure.” He mentioned the Suggested Upper Merged Ontology as a possible system to adopt. SUMO, as that system is called, is a little more than a decade old. Unlike GND, it classifies everything as either essentially “physical” or “abstract,” then subdivides physical things as “objects” or “processes.”

Right now, though, there is no macro-organizational system being used by Wikidata. Data goes in, linked, complex, alive — but not sorted.