Balázs Kovács, a professor of organizational behavior at Yale SOM, and his colleague Gaël Le Mens, a professor of economics and business at Universitat Pompeu Fabra in Barcelona, share an interest in studying typicality—that is, how well something, be it a restaurant, a mystery novel, or a social media post, conforms to the general opinion of what that sort of thing should be.
To determine that abstract ideal, they spent three years training a natural-language processing model called BERT to recognize and evaluate “typicality,” an ideal determined by aggregating the opinions of as many people as possible. They fed it thousands of book descriptions and tweets and, with a team of assistants hired on the internet who manually tagged each item, spent hours fine-tuning the model.
In the fall of 2022, they finally had enough data for a paper, which was accepted by the journal Sociological Science. But between the time the paper was accepted and when it was printed, a new piece of AI software called ChatGPT was released to the general public. ChatGPT had absorbed a huge dataset of publicly available information—including millions of books and, probably, tweets—and could automatically simulate human responses, including answers to the question of whether something was typical of its genre. It didn’t need to be trained with additional data tagged by research assistants.
In other words, three years of work could now be duplicated instantaneously.
“You could cry,” says Le Mens.
But instead the researchers decided to investigate whether ChatGPT, after no training, performed as well as BERT with additional training with the data tagged by research assistants. If so, ChatGPT could be a valuable tool in their work and in the work of other social scientists whose research relies on coding large amounts of text. They recently published their results in a new paper in the journal Proceedings of the National Academy of Sciences.
“This came at a good time,” Kovács says. “ChatGPT is the right tool for questions we’ve been asking for a long time.” And because they’d been thinking about typicality ratings for a long time, they knew exactly how to use it.
Their experiment was fairly simple. They gave ChatGPT descriptions of 1,000 mystery novels, and asked it to rate how typical each description was of the genre on a scale of 0 to 100, with 100 being the most typical. Then they compared ChatGPT’s typicality ratings with ratings previously produced by a team of human judges recruited via a crowdworking platform. They also did the same with descriptions of romance novels and then with tweets by Republican and Democratic members of Congress, which were rated based on how typical they were of their respective parties.
They discovered that ChatGPT, even without training, performed just as well, or even better, than BERT when it came to determining the typicality of book descriptions and tweets. The implications for future research were enormous.
“It’s good and bad,” Le Mens says. “It’s good because people with less research money and who are at smaller colleges can do research and process text in a way that’s more cost efficient. They also don’t need a computer science background. It makes this type of research less elitist.”
For the political tweet study, for example, Kovács and Le Mens spent £3,000 hiring people to read each tweet and determine its typicality. Even using a paid version of ChatGPT, the cost of examining the same data using the tool was $4.
If you use the tool without making sure it’s accurately mimicking what it’s supposed to measure—that is, if you don’t use human raters to test what the model is doing—you could find that the model is very biased in some ways.
The downside is that researchers may not always understand that, like any other tool, ChatGPT isn’t applicable to every research project. With Kovács and Le Mens’s project, it worked well because determining typicality depends on aggregating opinions from many different people, but that might not be the case with every experiment.
“It’s a big leap from one setting to another,” Le Mens warns. “One should be happy, but one should not be overenthusiastic. If you use the tool without making sure it’s accurately mimicking what it’s supposed to measure—that is, if you don’t use human raters to test what the model is doing—you could find that the model is very biased in some ways.” Another question is whether ChatGPT will ever be able to explain its ratings the way a human can.
He and Kovács are excited to see what the program can do next. They’re already planning follow-up papers comparing European political party manifestos with government policies and promises made in corporate press releases with what the companies actually do.
They’re also looking to see if ChatGPT can take different perspectives into account, like, for example, what an American would consider a typical Italian restaurant as opposed to an Italian. ChatGPT can also analyze pictures and audio, and Kovács is looking forward to following up on previous work he’s done about the music industry.
“Everyone is talking about AI,” he says. “A lot of tasks become much faster and cheaper. In this one small example, AI seems to perform as well as humans. However, you don’t know what the future brings.”