Limitations of AI Detection Tools

Studies have demonstrated that AI-generated text is not always easy to identify, even by experienced faculty, and that AI detection tools are unreliable and can be biased against non-native speakers and students who are underrepresented in higher education. Below is a collection of articles and studies you can use to understand limitations of AI detection tools.

  • AI Detection tools have high false positive rates and are easy to evade. "Detecting AI may be impossible. That’s a big problem for teachers." by Geoffrey A. Fowler (Washington Post- June 2, 2023) ... which cites this preprint from a CS perspective.
    This article writes about Turnitin, a software education company, which claims that its AI detecting has a false positive rate of 1%, meaning it mistakenly identifies an article written by humans as by AI. The rate is even higher when the author tested the tool. The author raises the concern that the false positive rate is too high and can cause students to be falsely accused of cheating, and the tool is easy to evade. As LLMs improve, it gets even harder to distinguish between human writing and that of AI.
  • "We tested a new ChatGPT-detector for teachers. It flagged an innocent student." - Washington Post Article, Geoffrey A. Fowler (April 3, 2023)
    This article details testing of a ChatGPT detector coming from Turnitin to 2.1 million teachers; which has been found to not be accurate.
  • “Who Wrote This? Detecting Artificial Intelligence–Generated Text from Human-Written Text." - Brock University Article (Hosted by University of Calgary), Rahul Kumar & Michael Mindzak (2024)
    This study presents a small experiment of 135 responses from participants that include faculty, graduate and undergraduate students. The experiments show that human participants can only recognize AI-generated text at a level of just over 24% (True Positive Rate). Given that AI detection tools are generally not reliable either, the study raises the question of what should be the way forward, between developing better strategies for detecting plagiarism in education and designing new assessment assignments that are AI-proof. 
  • “Testing of Detection Tools for AI-Generated Text” - International Journal for Educational Integrity, Debora Weber-Wulff et al., (December 25, 2023)
    ABSTRACT: Recent advances in generative pre-trained transformer large language models have emphasized the potential risks of unfair use of artificial intelligence (AI) generated content in an academic environment and intensified efforts in searching for solutions to detect such content. The paper examines the general functionality of detection tools for AI-generated text and evaluates them based on accuracy and error type analysis. Specifically, the study seeks to answer research questions about whether existing detection tools can reliably differentiate between human-written text and ChatGPT-generated text, and whether machine translation and content obfuscation techniques affect the detection of AI-generated text. The research covers 12 publicly available tools and two commercial systems (Turnitin and PlagiarismCheck) that are widely used in the academic setting. The researchers conclude that the available detection tools are neither accurate nor reliable and have a main bias towards classifying the output as human-written rather than detecting AI-generated text. Furthermore, content obfuscation techniques significantly worsen the performance of tools. The study makes several significant contributions. First, it summarizes up-to-date similar scientific and non-scientific efforts in the field. Second, it presents the result of one of the most comprehensive tests conducted so far, based on a rigorous research methodology, an original document set, and a broad coverage of tools. Third, it discusses the implications and drawbacks of using detection tools for AI-generated text in academic settings.
  • Black Students Are More Likely to Be Falsely Accused of Using AI to Cheat - Education Week Article, Klein, A. (2024, September 18).
    Overall, about 10 percent of teens of any background said they had their work inaccurately identified as generated by an AI tool, Common Sense found. But 20 percent of Black teens were falsely accused of using AI to complete an assignment, compared with 7 percent of white and 10 percent of Latino teens.