Skip to main content

AI Can Write a More Believable Restaurant Review Than a Human Can

Yale SOM’s Balázs Kovács used ChatGPT to write a series of Yelp-style reviews, as well as collecting real reviews from the site, and then asked human subjects to decide which was the real thing. They were more convinced of the authenticity of the AI-written reviews.

Like a lot of people, Balázs Kovács has come to rely on Yelp reviews when it comes to choosing a new restaurant.

“I don’t look at the numbers,” he says. “I read to connect with the experience. It’s more personable if someone writes about their experience. If they complain about having to wait 45 minutes for soup, I know what that means.”

A professor of organizational behavior at Yale SOM whose research interests include large language models and generative AI programs like OpenAI’s ChatGPT, Kovács knew that generative AI had become a lot more sophisticated over the past few years. Previous AI-generated text had read as simplistic and crude, but the most recent version of ChatGPT sounded almost human. And if AI text did sound human, he wondered, how easy would it be to create fake Yelp reviews?

Surprisingly easy, it turned out. In a series of experiments for a new study, Kovács found that a panel of human testers was unable to distinguish between reviews written by humans and those written by GPT-4, the LLM powering the latest iteration of ChatGPT. In fact, they were more confident about the authenticity of AI-written reviews than they were about human-written reviews.

The idea of testing the capacity of artificial intelligence by seeing how successfully a computer could mimic a human being goes all the way back to 1950, when Alan Turing proposed “the imitation game”—now known as the Turing test. For decades, it was a test that AI systems always failed. “In the past, it was so bad, we could always tell,” Kovács says. “We cannot tell anymore.”

Kovács had begun his study by collecting an assortment of reviews from a Yelp dataset, all written in 2019, before the release of generative AI. He fed them to GPT-4 and then asked the program to generate similar reviews of the same restaurants, at about the same length and using the same style, with human-style quirks, like typos and all-caps for emphasis. He ended up with a set of 100 fake reviews.

Then he showed a random assortment of these reviews mixed in with a selection of 100 real reviews to a group of paid research study participants and asked them to identify which were written by humans and which were written by AI. As incentive, he promised the participants a bonus payment if they were able to correctly classify 16 of 20 reviews. Of the 151 participants, only 6 received the extra payment.

In the second phase of the experiment, instead of asking the participants for a simple yes or no answer, he asked them to place their responses on a five-point scale, from “most likely human” to “most likely AI.” Overall, they were able to identify human-written reviews about half the time, roughly the same level of success as flipping a coin.

The responses to the AI-written reviews, though, were more interesting. The participants were only able to identify them correctly about a third of the time, and when they incorrectly identified a review as human-written, they were more confident in their responses. In other words, GPT-4 was able to produce restaurant reviews that read as more human than work by actual humans.

The results of these experiments were more or less what Kovács had expected, but he was surprised by how well GPT-4 had learned to write human language. “I wouldn’t have expected it to be more human than human,” he says. (The technical term for this is “AI hyperrealism.”)

Kovács is quick to note that an AI device that performs a human task better than a human is not always a bad thing. “In the case of the self-driving car,” he says, “a lot of people are not good drivers. If an AI drives a car perfectly and never makes a mistake, that’s what I want.”

An influx of AI-written restaurant reviews is not necessarily catastrophic. It doesn’t take AI to write a review of a restaurant you’ve never been to. And if people decide they’re no longer able to trust crowdsourced reviews, they’ll just return to seeking recommendations the way they did in the pre-Yelp era: by asking their friends or relying on professional critics. But, says Kovács, the study demonstrates the potential of AI-created content to wreak havoc in other domains. “People have to know it’s very scary. I can do these manipulations in an afternoon. Professionals can do an even better job than me. If people can’t tell anymore what’s written by an AI or a human, there will be more fake news than before.”

In February, right before the New Hampshire primary, a fake robocall from President Biden went out urging voters not to go to the polls; the call was later shown to be the work of an operative for the rival Democratic candidate Dean Phillips. And just a few weeks after Kovács published his paper, an investigation at a high school in Maryland showed that an employee had used AI software to fabricate a recording of what appeared to be the school’s principal delivering a racist, anti-Semitic rant. The recording led to calls for the principal to be fired. (After the New Hampshire incident, the FCC clarified that faking a voice in a robocall is illegal, but there are still no laws on the books to prosecute the perpetrator of the Maryland incident; the best law enforcement could do was charge him with other crimes, like stalking and disrupting school activities.)

Fake Yelp reviews are only the tip of the iceberg, Kovács says. “There are fake images, fake movies,” he says. “People won’t be able to tell. There will be a lot of consequences. This is an election year. Nobody’s going to believe anything.”

Department: Research