Navigating the Galaxies

New programs are trying to make sense of the uncodified information on the Internet


THE great problem of the information age is that there's too much information. The easier it becomes to store any kind of data on a computer or to dump material onto the Internet, the harder it can be to find what you are looking for. As more people have exchanged E-mail, joined "newsgroups" and other online discussion forums, and set up their own "home pages" on the World Wide Web, the conventional wisdom has held that the computer age will be a time of decentralized, truly democratic information flow. According to this theory, editors and, say, government censors will no longer have the ability to filter, select, or in other ways distort the news, because any citizen with a computer and a modem will be able to prowl through the Internet's vast data troves and find the truth.


Computers are obviously changing the way people communicate, and giving us online friends and colleagues we have never met. Yet for the time being they seem to be making editors and other data-winnowers more rather than less important. Precisely because no one can keep up with all the discussion groups, all the new Web sites, and all the online libraries, people who will do preliminary screening and point others toward promising sites have an increasingly valuable service to sell. Already the Internet teems with recommendations for "hot lists" and "cool sites" and digests of the best postings from various bulletin boards. The most popular online discussion forums tend to be not purely democratic but quasi-authoritarian in spirit, with an active "Sysop" (systems operator) who both steers and stimulates debate.

The people performing such functions will not be quite like traditional print editors -- mainly because the act of reading on the Internet seems destined to remain very different from reading a printed page. Reading from even the nicest computer screen is so unpleasant -- and the expectation is so strong that the computer will always be doing something more active than just displaying text -- that computers will remain better suited to jumping from topic to topic than to the sustained intellectual, artistic, or emotional experience that print can provide. People can read books by the hour; it is hard to imagine anyone's spending even ten minutes straight reading a single document on a computer screen. Yet while editing standards for the screen may differ from those for print, the basic editorial functions of selecting, highlighting, and ordering remain important in the Web world.

THE natural impulse of the computer culture is to look for ways to automate everything. Software designers are now working on systems that might automatically edit the material that is being abundantly and automatically propagated. In an article in last month's issue of this magazine ("The Java Theory") I discussed two computerized systems that can under certain circumstances be tremendously effective in finding data: the electronic version of the Encyclopaedia Britannica and the Lexis-Nexis database. Each has made the most of an unusual advantage. The Britannica's edge is its pre-existing Propaedia, a conceptual index to the encyclopedia's full contents. The Propaedia makes it possible for electronic searching programs to look not just for specific names or phrases in the Britannica but for discussions of broad themes. Lexis-Nexis has the advantage of standardization. With transcripts of many newspaper and magazine articles, wire-service stories, and broadcast news programs that reach back as far as the early 1980s, all collected in one computer system, Lexis-Nexis allows researchers to find information even when they don't know where it originally appeared.

The conditions that allow these two systems to work -- a sophisticated index in the Britannica's case, a centralized data collection for Lexis-Nexis -- are conspicuously absent elsewhere on the Internet. Last December the Digital Equipment Corporation unveiled its Alta Vista search system (found at http://www.altavista.digital.com), which searches for Web pages containing particular phrases or names far more quickly than other search systems such as Yahoo and WebCrawler. But to use even the ultra-high-speed Alta Vista effectively you must know what you are looking for before you start. The response to a general query, about an idea or trend, may point toward hundreds of Web sites with no indication of which really has the data you want.

Recently, at two laboratories near Boston, I saw projects that were designed to cope with the Internet's limitations and ultimately to make online data less of a gimmick and more of a tool. One location was the famed Media Lab at the Massachusetts Institute of Technology, founded in the mid-1980s and since then the object of both admiration and suspicion in the computer business. The admiration has been for the glamour of the lab's projects and the success of its director, Nicholas Negroponte, in attracting both money and press attention. The suspicion concerns whether the lab's self-consciously "visionary" projects will turn out to have practical, profitable uses in the long run -- and if they do, where the profits will go.

In the back of everyone's mind is the nightmare example of the Xerox PARC lab in Palo Alto. Through the 1970s Xerox PARC generated some of the most influential ideas in computing, but almost none of them did Xerox any good. For example, the concepts of the computer mouse and the graphical interface, now nearly universal because of their application first in Apple products and then in Windows, came from the Xerox lab but enriched other companies.

I have no idea whether the tricks I saw when guided around the Media Lab by a researcher named David Small will ever make it to market. For example, Small demonstrated a work in progress that could be thought of as a very elaborate way to avoid carrying business cards, or as an exploration of futuristic ways to transmit information. This project, directed by MIT's Neil Gershenfeld, uses shoes as computers and the human body as a network. Data could be stored in computer chips in your shoes, and could travel to your fingertips through your body, which can carry a small current. When you shook someone's hand, it would be like making a modem connection: the computers in people's shoes could swap basic information -- fax number, E-mail address -- with one another. Hmmmm.

Small, an MIT graduate student who was wearing a T-shirt reading THINK SMALL, works much of the time in a warehouselike room lit only by the glow of computer screens -- "to keep up my sallow complexion" (for a glimpse, see his Web site at http://www.media.mit. edu/~dsmall). His main work area is a two-by-five-foot Lego board covered with elaborate Lego structures. When he picked up a Lego plane, complete with propeller, and started flying it zoom, zoom, in the fashion of a four-year-old, my worries about the commercial prospects of the Media Lab increased. In fact the plane was the housing for an advanced and expensive position-sensing device, and as Small moved it, a large computer screen showed the "world" (that is, the Lego structures) as it appeared from the vantage point of the plane's nose.

To my eye, the most enticing toys at the Media Lab were the products of the Visible Language Workshop Project, of which Small's Lego apparatus was just one example. The Visible Language Project was created by Muriel Cooper, an MIT graphic designer turned computer expert who was highly influential at the lab until her death in 1994. Under her direction researchers attempted to use computer graphics not as a substitute for text (the Macintosh-Windows approach) but as a way of making text more meaningful. This can be seen as merely an extension of the centuries-old evolution of typography, in which varying fonts and type sizes enhance the meaning of printed words. Yet the tools of modern computing can make it look like a major evolutionary leap.

The dozen or so examples Small and his colleague Brygg Ullmer showed me were all intriguing, including Small's own project allowing users to cruise through the complete works of Shakespeare -- or the Bible, or other vast texts -- as if they were piloting a spaceship through a galaxy. Two projects in particular made me wish that I were a venture capitalist who could put them into effect. One, conceived , is called the Galaxy of News. It starts out on the computer screen with what looks like a view of several distant nebulae. As you use the mouse to move closer to these star clusters, they turn out to be large subject areas -- government, sports, entertainment, and so on. As you near one of these areas, tiny rays appear, leading to subtopics -- local politics, national politics, foreign affairs, and so on. The farther you move in any direction, the more refined the topics become, until at last the rays lead to a variety of headlines from relevant articles. This may sound unendurably gimmicky -- and the whole experience was reminiscent of a Star Trek scene. But speaking as one who has bragged about being skeptical of graphics and preferring just words, I found this enhancement powerful and natural -- much quicker to make sense of, for example, than a long list of article titles produced by a Nexis search.

A less flamboyant but equally appealing system was Highway News, created by a woman named Yin Yin Wong. As you look at your computer screen, you seem to be flying at low altitude above a flat midwestern plain marked with billboards announcing different topics -- sports, corporate news, and so on. If you descend to get a closer look, you can see the names of subtopics behind each billboard, and then the articles you're looking for. Again, this may sound obtrusive, but I can imagine using it for research.

THESE and other Visible Language products are all exploring how to display information. The idea is that some automatic indexing system will sort and link articles, assigning them to the right galaxy or billboard area. The need for such a product might seem to bring us back to the original problem of the Internet -- that its information, unlike that of the Encyclopaedia Britannica, is not already indexed. I prefer to think that it leads instead to the East Coast laboratories of Sun Microsystems, in Chelmsford, Massachusetts, where an ambitious project to create an automatic "conceptual indexer" is under way.

The principal investigator of Sun's indexing project, William Woods, has worked for nearly thirty years in the realm where mathematics, formal logic, and linguistics intersect. People from each of these disciplines have attempted to create abstract models of how human language works, along with specific models of the structural and semantic quirks of specific languages. The goal has generally been to create "expert systems" that can mimic human understanding in fields ranging from petroleum geology to the interpretation of x-rays. In addition, this research has been applied in "natural-language interfaces" -- which would, for example, allow you to ask a computer, "How much tax do I owe this year?" and have it calculate the answer.

Natural-language systems, as a rule, don't work well. You have to learn to talk like a computer if you want them to understand you; if you slip into the gappy, unpredictable style that real people use when they speak, the computer gets confused. The classic illustration of the tangles of real language is the Groucho Marx line "Time flies like an arrow, but fruit flies like a banana."

Rather than attempting to create a full natural-language system, the Sun indexing project has attempted to develop a computer program "smart" enough to categorize new information it encounters. One of the project's fundamental concepts is that the words in the English language (and the things or ideas they represent) fit together in a "kind of" relationship: a salmon is a kind of fish, a fish is a kind of animal, and an animal is a kind of living thing. Therefore, when a computer found the word "salmon" in a passage, it would know for indexing purposes that the passage concerned not just salmon but also fish, animals, and living things.

It would take hundreds of thousands of such hierarchical rules to account for the meanings in the English language, with its notorious ambiguities and nuances. Yet Woods says that his team at Sun has found that they can produce a useful indexer with a relatively smaller number of such facts. A recent version of their system contained 34,000 individual "kind of" relationships. Apart from tracing the hierarchies themselves, the team's work has involved countless fine-tuning steps -- for instance, helping the indexer to recognize that although "intend"is related to "mean,"and "mean"is related to "cruel,""intend" and "cruel" are not related terms. Similarly, the system must be ready to distinguish between "standard" as an adjective meaning "normal" and "standard" as a noun meaning "criterion" or even "flag." It has rules for boiling down words to their root meanings, and for recognizing how those meanings are changed by prefixes like "un-" and "dis-" or suffixes like "-able" and "-ent." And on and on.

On and on might seem to stretch toward infinity: the program requires rules to allow for special cases and then rules to offset the errors those produce. Indeed, when I tried a working model of the system in Sun's offices, half a dozen glitches showed up in ten minutes. Several dozen magazine articles and radio transcripts by one writer had been fed into the machine, which had then attempted to index individual sentences for their meaning. A few mistakes were obvious. It indexed several references to South America, but it had never been taught that Brazil belonged in that category. It listed a number of references to foodstuffs, with the frequently occurring subcategory "ale." Since I had written the articles in question and didn't remember ever mentioning ale, I went to the sentences involved and found that they all contained the word "reality." The program was smart enough to know that "ale" was a kind of food, that "re-" was a prefix, and that "-ity" was a suffix for root words ending in e, but no one had ever told it that "reality" is not something you can drink.

Still, these were exceptions. I was less impressed by the failures than by how many sensible judgments the indexing system had made. Unlike computerized systems for translating one language into another, which often produce gibberish or ridiculous sentences, when the indexer failed it failed gracefully. It gave a number of clues for finding information, so that if one was misleading, the others would get me there. At all times it let me observe and second-guess the structure of its logic. If I was looking for references to "aphasia," which it considered a kind of speech problem, it would show me references to discussions of other kinds of speech problems, in case these had information I sought.

"When will this be ready?" I asked eagerly after my test drive. Wood and his associates looked uneasily at one another and said, "Soon, we hope." When I asked whether I would eventually be able to run the system on my home machine, they laughed at me. One said, "We can barely make it run on ours" -- a supercruiser-style Sun workstation. Then I realized that this system, like the Visible Language Workshop entertainments, doesn't have to fit on your home machine. It was meant to be applied to the data resources of the Internet, giving us a way to survey, map, and settle that frontier.


The Atlantic Monthly; April 1996; Navigating the Galaxies; Volume 277, No. 4; pages 119-124.