UPDATED 12:26 EST / JULY 28 2023

AI

ChatGPT detectors still have trouble separating human and AI-generated texts

The growth of ChatGPT and other chatbots over the past year has also stimulated the growth of software that can be used to detect whether a text is most likely to originate from these automated tools.

That market continues to evolve, but lately there is some mixed news that not all detector programs are accurate, and at least one has actually been discontinued.

Earlier this week, in fact, OpenAI LP, the company that brought ChatGPT to the world, quietly killed off its detector tool called AI Classifier. The company claimed it did so because of its low accuracy rate. Decrypt discovered the announcement as an addendum to the original January post announcing its availability.

One of the biggest customers for these detector tools are educators looking to validate written assignments from their students. In the spirit of that market segment is a new comparative review of eight different tools, including the now recently departed AI Classifier, and how they performed.

The report, entitled “Detecting LLM-Generated Text in Computing Education: A Comparative Study for ChatGPT Cases,” was shared as a pre-print for an upcoming issue of the journal of the Association of Computing Machinery, written by four university professors teaching in Canada, Indonesia and Ecuador.

The review collected 124 submissions that were actually written by computer science students, along with 40 papers that were created by ChatGPT. The papers were drawn from typical computer science assignments, such as protocol implementation and database analysis.

None of the papers included any programming code snippets, to make them easier to analyze. The student-written papers were authored before 2018 to guarantee that they were legitimately human-created. Each paper was then evaluated by each of the tools in April, using current versions at the time.

Each tool was evaluated under three metrics: accuracy in identifying AI-generated text, false positives — that is, text flagged as AI-generated when it was from human origins — and the detection resilience of the program in terms of altering text with paraphrasing tools (researchers used QuillBot) or by deliberate editing.

They found that CopyLeaks’ AI Content Detector was the most accurate overall of the eight tools, finding human text 99% of the time and AI-generated text 95% of the time. GPTKit was the tool with no false positives, while the others ranged from a single false positive to 52.

Finally, Giant Language Test Room was the most resilient detector against the QuillBot paraphrasing. Below is a before-and-after screenshot of another detector tool showing how its evaluation changed:

One of the tools tested was GPTZero, which claims to be the first detector, and was one of the poorest in terms of all three metrics: It was the tool that got 52 false positive results. Some of the other detectors combine results from multiple models. OpenAI’s Text Classifier uses 34 different models, for example. Some of the detectors, such as CopyLeaks, highlight text that was written by humans and text written by the AI algorithm, while others highlight just one or the other or compute various probabilities.

The team found that ChatGPT was far better at creating English than Spanish texts and less accurate when it generated computer code. “Modern detectors are still in need of improvements so that they can offer a [foolproof] solution to help maintain academic integrity,” the team wrote. “Further, their usability can be improved by facilitating a smooth API integration, providing clear documentation of their features and the understandability of their models, and supporting more commonly used languages besides English.” They also concluded that “the current state of LLM-generated text detectors suggests that they are not yet ready to be trusted blindly for academic integrity purposes.”

The team claims this is the first independent such review in this area, and by that it seems the professors mean they are transparent about how they created their metrics and how they assembled their samples.

A second academic team reported its results examining 14 different tools earlier this month in MIT Technology Review. These academics wrote 54 of their own essays to ensure that they were original in a variety of languages, including Bosnian, Czech, German, Latvian, Slovak, Spanish and Swedish. The non-English texts were then machine-translated into English.

They also used both manual paraphrasing as well as by QuillBot. The 14 detector tools weren’t enumerated, other than mentioning GPTZero, Turnitin and Compilatio. The last two weren’t assessed by the first team.

This second team found the detector tools were good at identifying text written by a human with an average 96% accuracy, whatever that means. However, they fared more poorly when it came to spotting AI-generated text, especially when it had been edited.

Although the tools identified ChatGPT text with 74% accuracy — and again, that is hard to track, because how these numbers are calculated isn’t explained — this fell to 42% when the ChatGPT-generated text had been modified. No specific tool metrics were published, however, so it’s difficult to make any comparisons of the efficacies of the different tools.

The future of AI detector tools

So where does this leave the detector marketplace? Not in the best of circumstances. Students determined to pass off their work as human-generated just have to be clever about paraphrasing the AI-generated text or use a tool such as QuillBot to alter enough text to pass muster. Not all detectors are created equal, as these tests reveal, and we have already seen many instructors altering the circumstances by which students submit essays to prevent the more obvious cheating and guarantee that they actually wrote the works without any machine help.

Both academic research teams made specific mention of the false-positive problem, especially if the detectors are going to be used to detect academic plagiarism.

And an instructor looking to evaluate students’ work also has issues, because many of the detectors don’t have any access to application programming interfaces, or they charge for it. This means papers have to be copied and individually uploaded, and that can be a chore, particularly if the detector can’t recognize PDF or other file formats.

The world is more than just academic essays, and the AI-generated texts can be useful, under the appropriate human supervision and checking for correct factual citations. But the detector market has a long way to go before its results can be trusted.

Images: Pixabay, ACM

A message from John Furrier, co-founder of SiliconANGLE:

Your vote of support is important to us and it helps us keep the content FREE.

One click below supports our mission to provide free, deep, and relevant content.  

Join our community on YouTube

Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.

“TheCUBE is an important partner to the industry. You guys really are a part of our events and we really appreciate you coming and I know people appreciate the content you create as well” – Andy Jassy

THANK YOU