How DARPA's Augmented Reality Software Works
Why is the military succeeding where Google Glass failed?
Six years ago, the Defense Advanced Research Projects Agency (DARPA) decided that they had a new dream. The agency wanted a system that would overlay digital tactical information right over the top of the physical world.
So, they created a program called Urban Leader Tactical Response, Awareness and Visualization (ULTRA-Vis) to develop a novel and sophisticated augmented reality system for use by soldiers.
Through half a decade and with the help of several military contractors, they succeeded. "To enable this capability, the program developed and integrated a light-weight, low-power holographic see-through display with a vision-enabled position and orientation tracking system," DARPA says:
Using the ULTRA-Vis system, a Soldier can visualize the location of other forces, vehicles, hazards and aircraft in the local environment even when these are not visible to the Soldier. In addition, the system can be used to communicate to the Soldier a variety of tactically significant (local) information including imagery, navigation routes, and alerts.
Last week, I spoke with the core of the team for the lead contractor on the program, Applied Research Associates. They don't build the display—that was BAE Systems—but they do build the brains inside, the engine for doing the geolocation and orientation.
They think that their software, which they call ARC4, could end up in consumer products, and quickly. As they imagine it—and the DARPA prototype confirms—ARC4-powered systems would go beyond what Google Glass and other AR systems are currently able to accomplish.
In the following Q&A, we take a deep dive into how their technology works, what problems they solved, and how they see augmented reality continuing to develop.
There were four ARC associates on the line: Alberico Menozzi, a senior engineer; Matt Bennett, another senior engineer; Jennifer Carter, a senior scientist; and Dave Roberts, a senior scientist and the group leader in military operations and sensing systems.
Last year, I was obsessed with augmented reality. I was just very excited about it. But then when I looked at what, say, Google Glass, could do, I realized that it couldn't do much that was interesting in that realm.
CARTER: And they still can't.
I think I thought things were further along, because there was this time when people began to imagine what Google Glass might be able to do it, and those visions were more advanced than it is.
CARTER: You're right about the commercial space. People have shown things using really great graphic artists that they can't actually do. What we're talking about with ARC 4 is true augmented reality. You see icons overlaid on your real world view that are georegistered. And they are in your look direction in your primary field of view. So, basically, if you turn your head, you won't see the icon anymore. You can share information. You can tag stuff that's georegistered.
Google Glass was trying to do augmented reality. They were trying to do what we're doing. But I think they've failed in that they have this thing outside your primary field of view, and it's really information display, not augmented reality.
So, how'd you develop this system—what was your relationship with DARPA?
ROBERTS: We got started on all this six years ago. It started with the DARPA ULTRA-Vis program. We carried that through three different phases. It's now complete. We've been the prime contractor developing that technology. There were companies that came on and it became competitive. Over those different phases, we ended up doing well enough to carry on and the other companies dropped off. We think we're getting to that point where what we think of as augmented reality is going to become something that people see in the real world.
What were the big technology challeneges that you had to overcome? What were the things that you couldn't do six years ago?
ROBERTS: The two big fundamental technology challenges from the beginning have been number one, a display that can show information that's—I'm gonna get technical—focused at infinity with a large enough field of view and a high enough brightness that it's usable in outdoor conditions, so military folks in a squad can use it and see information overlaid on the real world.
The other big one was, I've got this display and I can put stuff on it that's bright enough to see, but how in the world can I make sure that this information is embedded in my real world. That's the pose estimation, or headtracking. It's the ability to know where I, as the user, am located and where I'm looking when I'm outdoors, so I can put information on top of the real world view so it's georegistered.
Basically, the system receives latitude, longitude, and elevation, three pieces of information associated with some object. And we get that over a network we're integrated with. But at the end of the day, the system has to take that information and render an icon out there in the real world that sticks exactly on what it is supposed to be sticking on. And that's the fundamental challenge we've been working on for these six years and can now do out in the field with real stuff.
When you looked at that as a technical challenge, break that down for me a little. What were the components of the challenge?
ROBERTS: There are sensors required to track a person's head, in terms of position and orientation. Right now, inertial sensors, gyroscopes—which are an angular rate sensor—or an accelerometer are used to understand the motion of the head. In addition, GPS is an available input to help understand position.
And then a magnetometer is typically used to understand azimuth, or where someone is looking, in heading. Those four pieces come together and people can fuse that data together to try to figure out position and orientation of the head. That's typically what people for the most part have been doing up until now.
There are problems with just using those sensors. One of the big problems is the use of a magnetometer. The earth's magnetic field is there and it helps sense it. It's a compass, basically. It's not terribly accurate, necessarily.
ALBERICO: The sensor may be accurate, but the field itself that it's trying to measure may not be just that of the earth. Obviously, that sensor becomes useful in its capacity to measure the earth's magnetic field to give you a measurement of azimuth, but if you have other things that superimpose their magnetic field on top of the earth, then end up measuring that, and it becomes useless for figuring out your azimuth. It's not so much any noise or inaccuracies of the sensor itself, but disturbances of the magnetic field itself.
What are a few of the big disturbances you see?
ALBERICO: It doesn't take much because the Earth's magnetic field is relatively weak, so it's disturbed by anything that's ferromagnetic. Steel, iron, or anything that generates a magnetic field, like electric motors. If you were to walk by a car, that's enough to give you a disturbance that would throw your azimuth estimate off a little bit if you just based it on the magnetometer.
So you need to introduce some other signal that can serve as a correction for the raw data coming out of the magnetometer.
ALBERICO: Yes, and that is one of the main benefits or achievements of our technology. Anything you see out there that people try to put together based on just magnetometer will suffer from disturbances.
So, how does this work in practice? And can it work everywhere or do you need special markers to make the world legible to your system?
ROBERTS: Because of the DARPA program, we're trying to solve this in outdoor environments where they are walking around quite a bit. Conventional augmented reality today, indoor or outdoor, they are limited to a single fixed location, and they might end up being very heavy using computer vision to jut maintain an understanding of the scene when they are in one fixed location. In order to break free of that mold—which is very novel—we combine the basic inertial sensing that we talked about with a series of what we call “signals of opportunity,” which are computer vision methods. For example, a lot of the terrain that these soldiers are operating on involves mountainous terrain.
So one of our methods that we use to overcome this azimuth problem with magnetometers is to provide an absolute heading input based on a comparison of the known mountainous terrain from digital terrain elevation data (DTED), basically a height map across the world. We look at what the image is in the camera and compare it to the actual DTED data. By doing a match between that data, we're able to achieve a very accurate orientation of the user's head. In that case, the system doesn't need to rely on the magnetometer at all anymore. It can leverage this new opportunistic input to do that. And if the user is standing next to a vehicle, he can maintain very high accuracy, even though he is extremely magnetically disturbed.
And we call that signal of opportunity “horizon matching.”
What are the other signals of opportunity?
ROBERTS: Another one we incorporated involved using landmarks, or using something of known position. For a military user, he may plan a mission ahead of time. He may know the position of cell tower or a mountain peak or a steeple of a mosque that he'll be looking on during the mission. If he can enter those coordinates into the system ahead of time, he can very efficiently look toward that point of interest and do a single alignment step to align his system to that. And from that point forward, it uses computer vision to understand the orientation of his head based on that absolute fix. We call that “landmark matching.”
Another one involves the sun itself. Whenever the camera has the sun in its field of view, it can essentially pick out pixels where the sun is on the image and it can support a computation of the orientation of the head. Using the sun alone, we can overcome the deficiency of the magnetometer.
Also, instead of the terrain of mountains, we can use the terrain of buildings, using LIDAR data coming in from aircraft to create models of the environment.
It's all of these methods that when you bring them together provide these opportunistic signals to enable you to have high accuracy across a broad operational envelope.
Is this all automatic? If I or some future commercial application, if a user were to go into downtown Oakland and then the Berkeley hills, is the software going to know, 'Oh, OK, this is the signal of opportunity that I should be using?' Or is that pre-set in a planning mode?
ROBERTS: The idea is that these are automated methods. Horizon matching, for example, would be something that's automatically assessed. There are all sorts of confidence measures that are internal to the system and it's all brought together in the sensor fusion. Urban terrain would be automatic. Sunmatching would be automatic. The landmark matching requires a single click of the user, but there are extensions that wouldn't require that. It could know automatically that you're looking in a particular direction and do vision-based matching based on an understanding of what that landmark might look like. The intent is for all of these things to be automated and not require cognitive loading to the user. The whole idea is for the user to be aware of his surroundings at all times and not have to think about stuff that is outside of his mission or task.
For a consumer application for someone walking down the street, that person is only concerned with where my friend is and where this message came from. We want them to see an icon overlaid on their friend and have their attention directed to their friend. Or maybe to a point on a building that a friend just marked.
CARTER: That's something we can do with networked users. So, for military, it's more of a team environment, but those are networked users.
You can almost think of this almost as your Facebook feed. You can choose whose information you want to see and whose you don't. This is something that we think this application could eventually be used for. You're walking down the street and you want to see this information displayed from a friend who tweeted something that they think is cool. And they could tag buildings. Information like that can be shared over a networked group of people.
And you guys would license this technology to someone who would build that? Or you'd build it yourself?
ROBERTS: The way we approah this is that we're developing the software capability, the engine behind the scene. And it can work with any number of displays. Whether that is a high-end military display or a more cost-effective consumer product from any number of the companies developing those displays. You're right, the licensing is really what we would provide. We're not developing the hardware of the display or the computing module itself. Our software and algorithms, even though they use vision are very lightweight. We are leveraging vision only when it's required. We're not trying to do the full slam that a lot of the fixed location applications are doing. That requires a heavy amount of computer vision to do. And even people who want to go out into the outdoor environment have to carry a backpack with a computer in it. We're implementing all of this on mobile processing devices.
Give me a sense of what one of these algorithms actually looks like. How does it do the weighting of the various signals? How is the integration done?
ALBERICO: We view this as a plug and play system to accepting signals of opportunity as they become available. And it is scalable as we implement more capabilities to make use of more and more signals of opportunity—the same architectures will be able to integrate them.
If no signal of opportunity were available, we're back to an inertial system aided by GPS, which makes use of accelerometers, gyros, and a magnetometer. And we go about actually giving weights to those measurements as well. The gyro and accelerometer, etcetera, based on the characterization of the actual sensor data.
And then, any signal of opportunity that would like to plug and play would have to provide a measure of confidence and that's how we've designed this.
What integrates everything together—and takes into consideration these measures of confidence—is an Extended Kalman Filter, EKF. It's a popular way of integrating various measurements coming from various sensors. Of course, we've customized it to be able to, for example, be able to detect disturbances in the magnetic field and perform other checks on these sensor signals to determine their sanity. There is a lot of that going on when you design one of these systems.
But the underlying principle and engine is the Extended Kalman Filter.
Bennett, I assume that you did a piece of this that we haven't touched on as I've been dragging us in these various directions.
BENNETT: I'm a systems integrator, knitting everything together, making all the various parts and pieces work together seamlessly. I do a lot of hardware and software integration. I do a lot of calibration. I've developed a lot of our calibration methodologies.
This is interesting. I've been writing a lot about Google cars and one of the big things is calibration of the LIDAR, so I've been wanting to learn more about this.
BENNETT: Calibration is a big deal for these systems. It's kind of a garbage in, garbage out situation. If you don't have calibration between camera, inertial suite, and display, you won't see an accurate georegistration of the icons on your display.
We have an instrumented room in our facility that we use to calibrate the camera to the inertial sensor suite. We have some fiducials on the wall and we know exactly where they are in 3D space and we take a bunch of synchronized data. That is to say, camera images are synchronized with the inertial data, and then we run some specialized algorithms to find the fiducials and produce calibration parameters that account for the camera intrinsics, which is how the lens distorts the image, and camera extrinsics, which are parameters that describe the 3D orientation between the camera and the sensor suite.
We have a sensor suite that we've used to integrate the systems, but you could do this for any system that combines a camera and a set of inertial sensors. So that is to say, any smart phone in the world or tablet.
So, you've got the camera-inertial navigation assembly that's been calibrated and you need to calibrate that to a display. The way we've done historically is a camera positioned behind the display, but what we're doing now is allowing the users to do that themselves. If you take the image from the camera and reproject that on the display, then your system is calibrated when it lines up with what you see in the real world.
I know you're familiar with the Q-Warrior, that's our inertial sensor plus camera suite. When the sensors are fixed to the display, you'd only need to do that calibration once.
So, one issue it seems with these systems is the lag between people's real-world vision and the software displaying stuff on top of that world. How have you tackled latency?
ROBERTS: One of the big things is that these displays have an inherent latency attached to them. Let's say you did have a sensor module attached to the display, and let's say you were wearing it and you turned your head. There is going to be a certain amount of time—and it's on the order of 50ms—where the sensors, when they sense that motion, it might take that amount of time just to render it on the display. So, the way that we overcome that basically involves trying to predict what the motion is going to be, or what the motion entails during that 50ms of time. If we can forward-predict in time what the sensor signals would be, then we can render information on the display that will match the environment.
ALBERICO: Latency is something that is always there. You can't eliminate it because you have the ground truth behind the glass—reality—and it is instantaneous. You need time to transfer measurements from the sensor acquisition, process, and render them. 40-50ms is typically what we see.
But as soon as the user starts moving, even a tiny movement, we go through and we start predicting what will happen next based on that little motion. So, we send to the renderer, the position we predict for 40-50ms into the future, so that by the time it gets there, it's what it needs to be for the symbology to be properly aligned.
ROBERTS: At the end of the day, all that means is that when I turn my head, the icon stays locked on the real world and I don't see that lag effect. That's one thing we've been able to do very well. In a lot of the AR things out there, there are a lot of issues out there.
To do that prediction, is that a thing where you put tons of measurements into a machine learning model and crank it out?
ALBERICO: We actually have started looking at what can you say about the human motion that would allow you to make better predictions. That's under development, too. But right now what's working is an extrapolation of the motion. As soon as we detect a bit of motion, we extrapolate that into the future. If the user keeps up with that bit of motion for 40-50ms then the outcome is that the icon will be locked.
ROBERTS: We do it without just throwing more comp cycles at the problem. We're not just throwing more computational power at the problem, but using it more intelligently.