Office of Graduate Affairs

Geeking Out With...Joseph Getto

Headshot of Joseph Getto outside

April 2, 2026

Abigail Arnold | Office of Graduate Affairs

Geeking Out With…is a feature in which we talk to graduate students about their passions. You can check out past installments here.

Joseph Getto is a second-year master’s student in the Conflict Resolution and Coexistence program. His research focuses on how artificial intelligence behaves differently when presented with information in different languages, specifically focusing on escalation dynamics. He joined Geeking Out With…to talk about his research into the way war games demonstrate how aggressive or escalatory AI models, specifically large language models (LLMs), can be.

This interview has been edited for clarity.

How did you get interested in researching AI escalation dynamics? Did you have previous experience working with AI?

I had heard about some of the studies coming out about AI and escalation; I had also read an article in Foreign Policy magazine saying that if you ask an AI something in a different language, you might get a different result. I started reading studies on AI leading to nuclear escalation and wondered if that might be different in a different language. I chose to compare English and Mandarin because I wanted to use some publicly available war games, a lot of which explore conflict between the US and China in the Taiwan Strait. I modified an existing game for my research, which took the better part of a year.

While I had previously researched AI and its risks as a debate coach for college teams to help them debate the topic, that was a few years ago, and it’s only gotten worse since then. I had to teach myself all the skills I needed to set up the AI for this project.

How did you set up the AI models for your research?

I set them up on two computers. Each computer was assigned the role of one of the parties in the conflict (the US or China). I had documents in both English and Mandarin explaining the roles of the parties, what they’d be doing, and the background, which I copied and pasted into the LLMs. In general, they were able to choose between major de-escalation, minor de-escalation, the status quo, minor escalation, major escalation, and nuclear escalation, with each turn of the game offering specific options in these categories. At the beginning of the first turn, an accident at sea has caused the deaths of some American soldiers, and some American soldiers are being held by China, so there are seventy-five points of escalation on the board already. The participants can make both military and diplomatic choices.

I didn’t want the two computers communicating with each other because they could have learned from it and I wanted to simulate a situation in which each participant didn’t know what the other was doing. Instead, they would make choices simultaneously, and I would add up the points for the choices made, which would tell me what happened that turn. So if, for example, the computer representing China chose both military and diplomatic de-escalation but the computer representing the US chose military escalation and diplomatic de-escalation, there would still be some points added to the board due to the weighting of different choices. I would then consult the point table and pull out a new script for the next turn based on it. The point values were based on some research but were not super scientific; escalations were weighted higher than de-escalations because it is easier to escalate than to de-escalate.

What have you found in your research?

Within these models, I found that English is by far a more escalatory language than Mandarin. Of the three AI models I tested, Claude is the most stable and least escalatory, causing a nuclear war or crisis 0% of the time. ChatGPT caused a nuclear war or crisis 40% of the time in English and 0% in Mandarin. Grok caused a nuclear war or crisis 86% of the time in English and still 73% of the time in Mandarin; this difference is not statistically significant overall, but there was a statistically significant difference between English and Mandarin within individual turns with Grok. Overall, the results aligned with my thesis.

A lot of papers claim the AI models engage in strategic reasoning, proven by the war games they run. When I looked at my results, this was not true at all. If it were truly strategic reasoning, then the language would not change the results. At first, it looked like the differences with Grok and Claude might not be as statistically significant as those with ChatGPT, but the differences were statistically significant within individual turns. As a single person, I could not do as many runs of the models as I would have liked, but I could perhaps prove further significance with more resources.

What were you surprised by in the research process?

Initially, I didn’t expect all of the challenges! I also never thought I’d be doing quantitative research as I’m much more of a qualitative person. As we moved along, I realized how important the way things are written are. Initially, the way I wrote the prompts sounded more negative about conflict, but I realized they were subtly moving the AI towards de-escalation. I had to re-write everything and make sure it was more neutral and balanced the negatives and positives. Then I ran the whole thing again. Since AI is more or less a predictive generator that produces tokens to guess what it assumes you’re going to say or what you want, it creates a dangerous feedback loop – if you’re a military leader wanting to do a strike, it will tell you it’s a great idea. And when it came to running the data, I wish I had taken more statistics as opposed to theory classes!

How has your perspective on AI evolved as you’ve worked on this research?

I was a little hesitant because I had never used AI until I started researching this. Now I am absolutely terrified of its military applications, but I have actually started using it more for polishing my cover letters or things like that since I tend to be a wordy person. I use it to condense and make things sound better.

Overall, I am deeply afraid of how these things are being used. Something that came up in my research is automation bias – the idea that we tend not to double check things that come from computers and to have a bias that the information is good and the computer is smarter than us. Especially if AI is being used to advise militaries on what to do or how to pick targets, this is very dangerous. On a much smaller scale, since it’s been fifteen years since I took a statistics class, I asked AI to check my statistics, but it said that China de-escalated on every turn, which I knew was not right. It didn’t pick up on this until it was called out.

When it comes to military applications, Human in the Loop is a system that guarantees a person needs to verify any strikes suggested by AI; it still has its problems but is better than the alternative. In my research, I found that ChatGPT was not consistent and would go from complete de-escalation to major escalation from turn to turn; it did not make coherent sense or have a pattern but still caused nuclear war 40% of the time. If we look at international relations theory, we see escalation that is managed and predictable for the other side means the situation generally stays within confines and doesn’t go out of control. Randomness is what can cause danger.

Are there others who have helped you in your research?

I’ve mostly worked alone, but I reached out to some professionals in the field. In particular, I started attending the war gaming lab at MIT and learning from them and getting documents about how to build a war game to help teach myself how to approach that. As a board game enthusiast, I had some idea of the process, but it was good to get help. I also approached researchers at Rand for information about nuclear escalation by AI. I also received a fellowship from the Alva Myrdal Centre for Nuclear Disarmament at Uppsala University, which was a huge help. I worked with Professor Sophia Hatz there, and she helped advise me.

When you’re not working on your research, what do you like to do?

I like to go out with friends and do trivia nights – I am a big trivia buff. I also like playing games like Civilization with friends online and going camping. I still do debate and forensics judging for colleges and high schools. My partner and I moved here from Kansas City, so we are working to meet new people here.

What advice do you have for other students exploring their passions?

Don’t be afraid to talk to people. I wouldn’t have known about the fellowship or thought of this project idea unless I had reached out to Dr. Hatz when I saw her post on LinkedIn about founding the AI section of the center. I reached out and expressed my interest in the topic and asked about upcoming events, and she told me about the fellowship and suggested applying. I also applied to related events even though I didn’t know anyone there, and they helped shape what I want to do and let me meet other people who are interested in the same topics. One of my mentors, Jennie Gromoll, is someone I met at one of these events: the summer school at the Odessa Center for Nonproliferation. She is fantastic, and I wouldn’t have met her if I hadn’t signed up!