Seeing Peer Ratings Pushes Professionals to Align Their Evaluations
Evaluation platforms are designed to harness the “wisdom of the crowd.” But new research from Yale SOM’s Tristan Botelho reveals that even savvy professionals tend to defer to the ratings given by their peers who evaluated before them.
Few people like to consider themselves followers, but when it comes to evaluating goods, services, and even our own colleagues, it turns out many of us are. That matters because collective evaluation processes—whether a restaurant’s ratings on Yelp or performance reviews at work—can play a significant role in sinking or launching a product, service, or someone’s career.
New research from Yale SOM’s Tristan Botelho shows how the design of evaluation processes affects outcomes, by demonstrating how people who submit evaluations are influenced by the ratings they’ve already seen. “A lot of what we see in terms of evaluative outcomes is actually directly affected by the structure” of the evaluation process, Botelho says. “So these design choices actually have significant implications for the outcomes.”
People’s livelihoods and careers may be at stake in those ratings, he notes, especially for those who work on gig platforms such as Upwork or Uber. “Why this is so fascinating to me is because evaluation processes dictate most of the resources in our economy and society,” he says.
Botelho sought to understand how seeing prior evaluations affects subsequent ones—or as he puts it, what happens when the audience can become the evaluator. For his new study, he used data from an online platform where investment professionals, mostly people working at hedge funds and mutual funds, submit and evaluate investment recommendations amongst one another. As part of the platform, these professionals can also rate the quality of one another’s investment recommendations. That data allowed him to compare evaluations made by professionals who had seen prior ratings with evaluations that were made before any such ratings were visible. This platform also had the advantage of being made up of evaluators who all had a baseline level of professional expertise, since their day jobs consist of essentially evaluating investment recommendations.
Existing research offers several hypotheses on the circumstances under which evaluations tend to converge (or resemble prior evaluations): competitive threat, in which people follow their peers out of fear of losing a competitive advantage; reputation and status management, in which people use their evaluation as a sign that they are in alignment with their peers; and peer deference, in which an evaluator assumes their peer has expertise and defers to their judgment. The structure of the platform in Botelho’s data set helped eliminate at least one of those hypotheses and various alternatives: all of the participants are anonymous, ruling out the possibility that ratings might be influenced by reputation or status management.
The platform offered its participants the chance to rate any recommendation on a scale of 1 to 5 along one or both of two axes: the quality of the investment analysis (its justification rating) and its expected performance as a stock investment (its return rating). The first four ratings of either type were kept hidden from other users, but once the fourth one had been submitted, the average of all ratings was made public to all subsequent professionals.
Botelho first found that seeing an existing rating made a professional less likely to rate the investment recommendation themselves after reading it. For recommendations whose rating was not yet public, 10% of viewers chose to submit a rating, while just 4% did for recommendations whose rating was public.
Unlike users of the platform, Botelho could see the private ratings. Those, he found, tended to be distinct from one another. But once a recommendation had received four total ratings and its score became public, subsequent evaluations converged. Overall, evaluations made after the scores became visible were 54% to 63% closer to the prior average rating than when they were hidden. Further, this convergence was immediate and not a gradual shift over time.
This convergence occurred in a context that at first glance would seem to be relatively immune to peer pressure, Botelho notes.
“These are all professionals,” he explains. “Their day-to-day job is evaluating recommendations like this, so they shouldn’t really care what other people say, and they’re all anonymous, so whatever they say about someone else’s work product or recommendations has no bearing on them whatsoever. But yet we still see that just the mere visibility of what others have said affects not only whether they share their opinion, but, conditional on sharing their opinion, they’re more likely to follow the crowd’s evaluation.”
Might it be that the ratings reflected a professional consensus about the quality of the recommendations? That might make sense if highly rated recommendations do genuinely perform better in the stock market. But that explanation didn’t pan out. Botelho found that the initial private ratings were unrelated to the future performance of the stock that was recommended, demonstrating that a higher rating was not “just an accuracy story.”
Botelho fears that “the design choices by a lot of these platforms or organizations are creating a self-fulfilling prophecy,” he says. “That’s problematic—it separates the haves and the have-nots. The initial ratings in this context separate everything. Recommendations with a higher private rating received more attention from others, something every professional values.”
One factor that did mitigate against the herd instinct was subject-area expertise. When an evaluator had demonstrated expertise in the industry of the recommendation (for instance, someone with experience in healthcare evaluating a recommendation for a pharmaceutical company), they resisted convergence and provided a rating that differed from the prior ones.
These findings point to ways that companies can adjust their evaluation processes to obtain more meaningful results. For example, a platform could institute an extended “quiet period,” not posting a public average until at least 50 people had privately submitted their ratings. Or a company vote on whether to introduce a new product could be sequenced to have more subject-area experts provide their input first.
“A lot of companies are trying to open the decision-making process, allowing employees to vote for things. I think these innovations hold a lot of promise,” says Botelho. “However, these findings show that these processes could introduce issues, if you’re not careful. Instead, how can we structure evaluation processes in a way that we can come to an unbiased answer?”