#WikipediaProblems: How Do You Classify Everything?

Design the wrong system, and “selenium” and “sex” get sorted the same.
Like a card catalog, but, you know, for everything (Glyn Lowe)

How do you organize all the world’s information?

A decision made by the editors in charge of Wikipedia’s newest, biggest project reveals the difficulty of such a task.

Wikidata is the newest project from the Wikimedia Foundation, the organization which runs Wikipedia. As Becca wrote here last year, Wikidata promises a single, shared infrastructure of knowledge beneath Wikipedia in every language. This underlying data layer, which is Wikidata, can be read by both humans and machine, and it propagates changes from one language’s version of Wikipedia to other languages. If Canada, for instance, gets a new finance minister, and someone edits Wikipedia in English to reflect that change, then Wikidata will propagate that information to Wikipedia in other languages.

That’s a relatively straightforward use of Wikidata, though. Its promise is machine analysis of the Wikipedia body of knowledge in a complex, ongoing, holistic way — and for computation like that, its macro-organizational system matters.

Imagine a live visualization of the entire Wikipedia system, organized partly by the subject matter of article. What system needs to exist to make that possible?

During the early part of its development, Wikidata used a hierarchical taxonomy to organize its data entries. The system was called GND—a German initialism, Gemeinsame Normdatei, which translates to “Integrated Authority File.” GND was originally meant to organize bibliographic information across library systems, though it was expanded recently by Internet technologists to work for non-library systems, too.

Which sounds pretty good, right? If you have to schematize the set of known information, you might start with a system originally built for librarians. After all, libraries are institutions tasked with the sustainability and categorization of knowledge. They’re the historical experts.

Which all sounds good... until you encounter certain problems. Silly, wonderful, ridiculous problems.

Gerard Meijssen, an employee of the Wikimedia Foundation, talked about them to Emw, a Wikidata editor, in a recent blog post. Here are the kind of problems GND entailed for Wikidata:

GND groups everything into huge taxonomical categories. Those categories are:

  • Person.
  • Organization.
  • Place.
  • Event.
  • Work.
  • Term.

... and that’s it. Everything known in the world must fit into one of those containers. Everything ever knowable, to some degree, must fit into one of those containers. Information, that system proposes, comes in six essential types.

(Which makes the “person” macro-classification a little poignant: At the universal base layer, we give a starring role to ourselves. A person is unlike an organization or place or work; a person is so unlike anything else to deserve its own piece of cosmic tupperware. It’s an anthropocentric view, and, given our current understanding of what kind of things live in the universe, maybe a bit of a correct one.)

So what are the problems of this system, with its six terms?

First of all, it doesn't differentiate, at this level, between the physical and the abstract, and that includes people. Turns out giving “person” a starring role is its downfall, as Barack Obama and Jay Gatsby then have to share a macro-taxonomical level. You can’t separate out the fictional from the non-fictional at this level.

Second, in the words of EmW:

Any item that is not a person, place, event, organization or work is classified as a “term,” which contains virtually no information. We need to be able to classify things like gravity, carbon, DNA, cancer, clarinet, Twelver Shia Islam, fashion boot, dog and potato as more than simply "terms".

Sousaphone, selenium, sex: They’re so similar, said GND, that they should be sorted as the same species of stuff. But that’s, well, silly, and when Wikidata ditches GND it will lose that organizational fluke.

Overall, writes Emw, “these main types are fine as a way to classify items of general interest in a large library, but they're much too small to form a sound basis for a classification system for all human knowledge.”

So what will Wikidata use now? At the moment, it won’t use an ontology at all. It  hopes to get by using only “instance of” and “subclass of,” two types of relationship which don’t prescribe a single, totalizing taxonomy.

But some Wikidata editors, including Emw, hope to figure out what comes next.  “Although it is complex,” he writes, “the world has structure, and classes or types are a useful way to express that structure.” He mentioned the Suggested Upper Merged Ontology as a possible system to adopt. SUMO, as that system is called, is a little more than a decade old. Unlike GND, it classifies everything as either essentially “physical” or “abstract,” then subdivides physical things as “objects” or “processes.”

Right now, though, there is no macro-organizational system being used by Wikidata. Data goes in, linked, complex, alive — but not sorted.

Presented by

Robinson Meyer is an associate editor at The Atlantic, where he covers technology.

How to Cook Spaghetti Squash (and Why)

Cooking for yourself is one of the surest ways to eat well. Bestselling author Mark Bittman teaches James Hamblin the recipe that everyone is Googling.

Join the Discussion

After you comment, click Post. If you’re not already logged in you will be asked to log in or register.

blog comments powered by Disqus

Video

How to Cook Spaghetti Squash (and Why)

Cooking for yourself is one of the surest ways to eat well.

Video

Before Tinder, a Tree

Looking for your soulmate? Write a letter to the "Bridegroom's Oak" in Germany.

Video

The Health Benefits of Going Outside

People spend too much time indoors. One solution: ecotherapy.

Video

Where High Tech Meets the 1950s

Why did Green Bank, West Virginia, ban wireless signals? For science.

Video

Yes, Quidditch Is Real

How J.K. Rowling's magical sport spread from Hogwarts to college campuses

Video

Would You Live in a Treehouse?

A treehouse can be an ideal office space, vacation rental, and way of reconnecting with your youth.

More in Technology

Just In