When AI Learns the Why, It Becomes Smarter—and More Responsible
A new Yale SOM study shows that when generative AI is trained to understand why certain headlines resonate—not just which ones perform best—it avoids clickbait and produces more engaging, trustworthy content. The researchers say this hypothesis-driven approach could help AI generate new knowledge across fields while advancing more responsible AI design.
Which headline are you more likely to click on?
Headline A: “Stocks Plunge Amid Global Fears.”
Headline B: “Markets Decline Today.”
Online publications frequently test headline options like this in what’s called an A/B test. In this case, a publication shows headline A to half of its readers, headline B to the other half, then measures which receives more clicks.
Marketers have long used A/B tests to determine what drives engagement. Generative AI is now positioned to accelerate the process, automating the tests and iterating rapidly on headlines—or any other content—to optimize outcomes like click-through rates. But sometimes, according to Yale SOM’s Tong Wang and K. Sudhir, simply knowing what works and shaping content accordingly leads to bad outcomes.
“After fine-tuning an LLM—such as GPT-5—on A/B test data, it may conclude that the winning strategy is simply to use words like ‘shocking’ as often as possible, essentially producing clickbait,” Sudhir says. “The model is exploiting superficial correlations in the data. Our idea was: if the AI can develop a deeper understanding of why things work—not just what works—would that knowledge help it avoid these shallow patterns and instead generate content that is more robust and meaningful?”
Tong and Sudhir, working with pre-doctoral research associate Hengguang Zhou, used an LLM designed to generate competing hypotheses about why one headline is more engaging than another. The model then tested these hypotheses against the full dataset to see which ones generalized broadly. Through repeated rounds of this process, the LLM converged on a small set of validated hypotheses grounded not in superficial correlations but in deeper behavioral principles.
This method mirrors how researchers develop knowledge: starting with abduction, where a small set of observations sparks potential explanations, and then moving to induction, where those explanations are tested on a broader sample to see which ones hold. The team believed that this knowledge-guided approach would allow the LLM to boost engagement without tricking readers—teaching it to write headlines people click on because they are genuinely interesting and relevant, not because they rely on superficial clickbait cues.
Hey, I’m AInsights
Ask me questions about the article and its underlying research. Let’s chat!
For their new study, they set out to test and refine this approach. They started with 23,000 headlines, describing 4,500 articles, from the online media brand Upworthy, which is focused on positive stories. The publication had already run A/B tests on all of these headlines, so the researchers knew which headlines would induce more readers to click through.
The team began by giving the LLM various subsets of articles and their associated headlines, along with their click-through rates. Using this information, the model generated a set of hypotheses about why one headline might be more compelling than another. After forming these hypotheses, the researchers asked the LLM to generate new headlines for a larger sample of articles, systematically varying the hypotheses used. They then evaluated the quality of each generated headline with a pre-trained scoring model built on Upworthy’s A/B-test results.
This process allowed the team to identify the combination of hypotheses—or the “knowledge”—that consistently improved headline quality. Once this knowledge was extracted, they fine-tuned the LLM to write headlines that maximize click-through rates while being guided by the validated hypotheses. In other words, the model learned not only to optimize for engagement, but to do so for the right underlying reasons.
This work is not simply about realizing better content generation. The fact that this can propose hypotheses from a small set of data allows it to generate new theories and, ideally, improve our understanding of the world.
“A headline should be interesting enough for people to be curious, but they should be interesting for the right reasons – something deeper than just using clickbait words to trick users to click,” Wang says. “The problem with the standard approach of fine-tuning an AI model is that it focuses narrowly on improving a metric, which can lead to deceptive headlines that ultimately disappoint or even annoy readers. Our point is that when an LLM understands why certain content is more engaging, it becomes much more likely to generate headlines that are genuinely better, not just superficially optimized.”
The researchers tested the results of their model with about 150 people recruited to judge the quality of headlines from three different sources: the original Upworthy headlines (written by people), headlines generated by standard AI, and then headlines generated by the new framework. They found that human-generated and standard AI headlines performed about equally well, chosen as the best one roughly 30% of the time. The new model ranked best 44% of the time.
When participants were asked about their choices, many of them noted the traditional AI model created “catchy” headlines that evoked curiosity, but that they resembled clickbait, which made participants wary. An analysis of the language used in the headlines—comparing word choice from traditional AI with the new model—corroborated this skepticism, revealing that the standard AI model did, in fact, rely much more heavily on sensational language.
“Importantly, the potential for this work is not simply about realizing better content generation,” Wang says; what makes it even more consequential is how the content generation was improved: by teaching an LLM to generate its own hypotheses. “The fact that this can propose hypotheses from a small set of data allows it to generate new theories and, ideally, improve our understanding of the world.”
Sudhir points to ongoing work with a company in developing personalized AI coaching for customer service agents. If some interactions lead to better outcomes than others, then this new framework could be used to review scripts from customer interactions and generate hypotheses about why one approach is superior to others; and after validation, that knowledge could be used to offer personalized advice to agents on how to do better.
“In many social science problems, there is not a well-defined body of knowledge,” Sudhir says. “We now have an approach that can help discover it.” The input data needn’t be textual, either; it could be audio or visual. “In a larger sense, this is not just about better headlines—it’s about accelerating knowledge generation. As it turns out, knowledge-guided AI is also more responsible and trustworthy.”