To employ autonomous weapons systems like pilotless aircraft and other artificial intelligence-powered innovations, the U.S. military will have to overhaul not just its strategy and tactics in every domain, but also the way it tests its technology, according to the Defense Department’s first ever AI chief.
The Pentagon “is not well postured yet” for testing and evaluation, or T&E, of artificial intelligence, or any other kind of cutting edge software that requires continuous updating, said retired Air Force Lt. Gen. John “Jack” Shanahan, the inaugural director of the DOD’s Joint Artificial Intelligence Center (JAIC), from 2018-2020.
Still, “there’s nobody better in the world at T&E than our military services … we’ve been doing it forever,” he told an audience at the Center for a New American Security.
That applies even to software as well as hardware, he added, so long as it is developed in a “very linear process.” Software like the Air Force’s Theater Battle Management Control System or Distributed Common Ground System was upgraded in “blocks” every couple of years, which allows for extensive multi-stage testing. “Upgrades went through development tests, operational tests, initial fielding, follow-on fielding, and all that,” he said.
But AI is different, he said: “We have got to get used to the fact that these updates may be happening in hours and days, not months and years.”
Like other systems that used the latest software engineering techniques, AI had to be updated regularly, Shanahan said, especially during a conflict. “If you don’t do that, it’s going to go stale. It’s not going to work as advertised. The adversary is going to corrupt it, and it’ll be worse than not having AI in the first place,” he said.
But how would such testing work in the midst of a shooting war, Shanahan mused, when the stakes and the pressure are high?
“For continuous integration/continuous deployment, I think we ought to be thinking about it down at unit level. … Does it always have to go back to some centralized T&E facility? Not in the heat of war,” he said. “Does someone wear a special patch, like they’ve been through Top Gun or the Air Force Weapons School, that says I am qualified to do AI T&E at the unit level? Maybe something like that. We’ve got to think our way through it.”

T&E can no longer be viewed as a “one and done” proposition that a system has to complete before deployment, he said. Instead, it had to be seen as a continuous process and one stage in a system life cycle.
“So we do have to look at this as a full life cycle approach, and that’s where we can start mitigating and managing risk, at the design and development phase, all the way through T&E, all the way through fielding and all the way through sustainability,” Shanahan said.
He added that the competitive pressure to deploy game-changing technology like AI before U.S. rivals creates a risk that T&E will get short changed.
“If we start saying we’re going to lose the competition against China unless we put this out in the field as fast as possible, that’s risky, because we will find [on the battlefield] that systems don’t work as intended, and the adversary always gets a vote. They will try to counter our systems,” he said.
Shanahan, now retired and working as a consultant, spoke during the launch of the new CNAS report “Safe and Effective: Advancing Department of Defense Test and Evaluation for Artificial Intelligence and Autonomous Systems.”
Testing AI is challenging for other reasons too, report author and CNAS scholar Josh Wallin said. When the Defense Advanced Research Projects Agency (DARPA) tested its autonomous dogfighting program, it did so in a digital simulation, where the program handily beat a human pilot. But when they put the program in a real plane with a human safety operator—a highly experienced test pilot who could disable the AI, take control, and land the plane if needed—it suddenly stopped performing as it had done in the digital simulator.
“They really quickly ran into this problem, which is … that so many of the novel behaviors that they’d been excited about in simulation were reasons that the test pilots would shut off the autonomy and kill the test right from the beginning,” he said.
User acceptance issues are a major challenge for the deployment of AI, he explained, adding that one of the major recommendations of the report was “just how important it is to integrate operators early from a development perspective and also from a testing perspective. You can’t wait until you get to operational tests to start talking to operators. You have to do it much earlier than that.”
Another big concern is integration, Wallis said: How autonomous systems would interact, not just with friendly operators and enemy troops, but with each other. Because it isn’t practical to test AI in every possible different combination of circumstances, he said, “there are always going to be edge cases” where systems behave in unanticipated ways, and because humans might be more removed from the process, there are a lot of questions about how to deal with such situations.
“When we’re deploying different [unmanned aerial systems] with each other, or with different [unmanned surface vehicles], when we have a [command and control] system that’s integrating some form of autonomy—How do all of these things actually work together now that we’re removing a lot of the operator role?” Wallin asked.
Wallis said he worries about DOD “getting bogged down in process when we’re developing these systems.” An AI-enabled administrative system that handles HR issues, for example, should be tested very differently and according to different criteria from a weapons system.
“I’m concerned about not moving quickly, because we’re lumping everything together rather than looking very specifically at what are the things that actually make these systems different,” he said.