The Ultimate Toy

Debugging the computer "Eagle"
More

In the past month or so, the IP had been found responsible for a number of bugs. This fact bothered Veres. It did not matter that the IP was a much more complex piece of hardware than many others in Eagle, nor that when he had helped to design it Veres had been a novice and pressed for time. Like most members of the team, Veres felt what Holberger called "the peer pressure": "If I screw up this, I'll be the only one, and I'm not gonna be the only one."

Though they were friends, Veres was a little annoyed with Guyer, and Guyer knew it. So Guyer did not feel inclined to blame this latest bug on the IP. It appeared that either the IP or the system cache was responsible for the failure. Another engineer in the team had designed the system cache, and he wasn't in the lab that afternoon. Laughing about their choice, Veres, Holberger, and Guyer decided to interrogate the system cache first.

They hooked up analyzers to the circuits of the system cache and took some pictures, but they received no immediate enlightenment. Weary after more than ten hours in the lab, Holberger and Veres departed, leaving Guyer alone with Gollum. He spent the night with it.

Some time after midnight, Guyer was sitting in front of a couple of logic analyzers, peering into their screens, when suddenly he touched his mouth, wheeled around in his chair, and started flipping through one of the large bound volumes on the lab table. He had discovered that the diagnostic program periodically changed the location of the target instruction, the one that Gollum failed to execute. Maybe the IP wasn't getting the word to change the instruction's location. The system cache was supposed to see to it that the IP did so. Maybe the system cache was indeed to blame. With mounting enthusiasm, Guyer recorded his theory in the log book and returned to the machines to gather evidence. By three that morning, however, he had found none, and his enthusiasm for his theory had waned. He felt, he said, "sort of neutral." It was beginning to look more complex, he thought on his way out of the lab.

Around dawn, Veres sat down in front of Gollum again. He examined Guyer's entries in the log book. These deepened Veres's suspicions that the problem involved one of the figurative time bombs that they had been encountering frequently in the past several weeks. One way to approach such a difficult bug was to follow clues back through the diagnostic program. First, Veres found out exactly where in the program the failure initially occurred. Next, he examined "addresses." One could imagine the machine's memory system as a collection of mailboxes, organized in neighborhoods. Like each mailbox, each neighborhood has a unique address; that address is a number, which is called a "tag." At the moment of failure, Veres discovered, the I-cache contained a collection of mailboxes identified by the tag 21. But in the system cache at that moment, the tag for what should have been the corresponding collection of mailboxes was 45. Which was the right number? The answer would reveal whether the IP or the system cache was at fault. Veres hooked up logic analyzers to Gollum once again.

Old hands in the group had used oscilloscopes for debugging previous machines. Veres once said, "An oscilloscope is what cavemen used for debugging fire." Analyzers were newer, more versatile tools; one of their important features was that they had memories. They could take and save pictures of electronic events that occurred in 256 different cycles of the computer's operation. Veres ran the diagnostic program all over again. Then he started looking back through the pictures that his analyzers had taken and saved. He had scarcely begun this tedious search when Holberger arrived, and when Holberger saw what Veres was up to, he retreated, to work with another Hardy Boy on Coke.

Nothing turned up. By the time Guyer came in that afternoon, Veres had not found a single new clue. Holberger held a short conference. "We need ideas. We're going to defer it," he said. So Guyer worked on other problems that night. But Veres went home with the two tags on his mind. One was right, the other wrong. Obviously, there had to be a way of finding out which was which.

Veres was up early the next morning. He didn't want any interference; he wanted time alone with this bug. He got into the shower and his working day commenced. "I get quite a lot of work done in the morning while taking a shower," Veres remarked later on. "Showers are kind of boring things, all things considered." That morning, he conceived a new approach. Evidently, the cause of the failure lay further back in the diagnostic program than a logic analyzer could look. So why not search from the other direction, forward through the program, instead of backward? He would run the diagnostic up to the start of the fourth pass, and then every time Gollum performed a JSR and Return, he'd have the computer halt so that he could get a picture with the analyzer and certain printed information on the system console. This technique would take some time; but he was willing to try it.

Holberger entered the lab a few hours after Veres did. The scene that greeted him made him smile wryly. Nearby Gollum, on the floor, lay a great heap of computer paper, and Veres was sitting beside the pile. "I found it," he said to Holberger.

At iteration 122 of the subtest in question, the I-cache contained the block of instructions numbered 21. Millions of ticks of the computer's clock and thousands of instructions later, at iteration 151 of the subtest, Veres had observed the system cache telling the IP to replace tag 21 with tag 45. He had exonerated the system cache. The IP must have disobeyed the order, because at iteration 158 Veres had found the I-cache still harboring tag 21. "Which, I'm very sorry to say, is wrong."

Holberger and Veres moved swiftly. They hooked up analyzers. Seen at such a moment, among their machines, flicking switches, speaking cryptically in a language that even another computer engineer, from a different project, would have found largely incomprehensible, they resembled airline pilots in a cockpit preparing for takeoff. In fairly short order, the crucial picture appeared on the small blue screen of one of the analyzers.

"There it is."

"Yup."

They saw the IP throwing out tag 45 and keeping the old, invalid tag 21. A few more pictures showed that the IP was, quite literally, getting its signals crossed. The IP received from the system cache the signal to throw away tag 21, but before the IP could obey, the signal from the system cache was altered, by another signal coming from another part of the machine. The solution lay in delaying the arrival of that second signal, so that the IP would always have time to clear out an old collection of mailboxes before it was asked to perform some other task.

The solution took the material form of a circuit that cost eight cents wholesale. This circuit produced a signal. Writing up the engineering change order, Holberger christened the signal "NOT YET." Other engineering teams used formal, technical names for signals. The Eclipse Group usually looked for something simple that fit, and if they couldn't come up with an appropriate title they'd use their own given names. That, Holberger noted, defined the Eclipse Group's style. It was a way -- a small one, to be sure -- of leaving something of oneself inside one's creation.

They were having fun now. They installed the new circuit. They ran the diagnostic. Holberger wrote in the logbook, "With this ECO installed, Eclipse 21 runs 10 passes." Just one more routine chore remained. They had to make sure that the new circuit didn't foul up some other operation. So they started running the other diagnostic programs that Gollum had already mastered, and everything was proceeding satisfactorily when all of a sudden the console started scratching out an error message.

"We didn't do it," said Holberger. "We didn't do it right."

It seemed to Holberger that they were on the brink of another lengthy search, and he had no appetite for it. They hooked up analyzers and studied some pictures, but in a desultory way. The problem looked complex. Then Veres remembered that they had forgotten to do something basic.

He took out the new circuit. They ran the failing program. Gollum committed the new failure anyway. So it was not the new circuit that had caused this problem. Greatly relieved, smiling now, Holberger pointed out that they had placed the IP circuit board out on the "extender." The IP was hooked up to Gollum, but it was sitting in a small frame of its own, outside the main frame of the machine. Use of the extender is standard debugging practice; it makes the board easy to get at. But computers aren't created with extenders in mind, and in some cases a perfectly good circuit board won't function correctly while out on an extender. They put the IP board back in its proper place, among the other boards, and the failure no longer occurred.

That afternoon, Gollum was made to perform all of the basic diagnostics, including Eclipse 21. The machine did not fail once. They had reached a milestone, but it was one that they had thought they had reached before. The trickiest diagnostic programs lay ahead. Veres said he had "a feeling of accomplishment." He added, "But then again, there's lots more feeling of accomplishment to go."

Jump to comments
Presented by
Get Today's Top Stories in Your Inbox (preview)

Sad Desk Lunch: Is This How You Want to Die?

How to avoid working through lunch, and diseases related to social isolation.


Elsewhere on the web

Join the Discussion

After you comment, click Post. If you’re not already logged in you will be asked to log in or register. blog comments powered by Disqus

Video

Where Time Comes From

The clocks that coordinate your cellphone, GPS, and more

Video

Computer Vision Syndrome and You

Save your eyes. Take breaks.

Video

What Happens in 60 Seconds

Quantifying human activity around the world

Writers

Up
Down

More in Technology

More back issues, Sept 1995 to present.

Just In