Review aggregators: the trouble with replacing critics with consensus

Earlier this summer I went to a panel debate on the future of criticism in the arts. Two things came out of the discussion, firstly the art of criticism should be appreciated and secondly, the internet was causing ructions in the field. The effect of the internet wasn’t deemed to be all bad, but the professional critics had a particular disdain for one of the internet’s more revolutionary innovations - the review aggregator.

What they were talking about were popular outlets such as Rotten Tomatoes and Metacritic. These websites essentially take the verdict critics have given on a film and then average out these scores to come up with a standard rating for that screen offering. So in the past you would need to open that day’s paper to obtain just one critic’s review of a film, whereas today you can get an average score from lots of critics without having to read one word of a review, that’s progress right?

The more I thought about this, the more statistical questions I had for how you derive one complete score from a dataset that consists of various critics’ ramblings. This was made worse when the panel of critics at the debate talked about how they were ambivalent about assigning a numerical score to something they had just written a considered and nuanced article about.

The problem is that the review aggregators rely on this, but the critics seemed to approach this aspect of their review as an afterthought demanded by their editor.

Take Rotten Tomatoes, which divides reviews into ‘good’ and ‘bad’, submitted by the critics themselves or judged by their staff after reading a review. The average is then calculated and the percentage of ‘good’ to ‘bad’ is displayed. To be fair, they have taken the time to rigorously sort who can be an approved critic and who cannot, but a critic’s verdict is still distilled down into only two categories.

Rotten Tomatoes then assigns three levels of scores to a film. On their ‘Tomatometer’, if a film achieves an average score of 60% or more ‘good’ reviews on the scale, it’s considered ‘Fresh’. Below this 60%, it’s considered ‘Rotten’. The third level is ‘Certified Fresh’, which means it got a score of 75% or more from 40 critics and five ‘top critics’ (whose scores are weighted accordingly).

Metacritic attempts to take a more numerical approach. They convert a critics rating into a score on a scale of 1 to 100, this is then averaged, but some critics (we don’t know who) are given more weight than others. They then display the weighted average score and colour code it (green for good, yellow for average and red for bad.)

This system of weighting some critics and not others is a bone of contention. At the critic’s panel event, one member of the audience related the story of how his low budget film had managed to gain a theatrical release and garner some positive reviews. However, he was annoyed that the review aggregators were displaying an inferior score for his film because his film had not gone down well among their weighted critics.

So what is the difference between the scores that Rotten Tomatoes and Metacritic give? I took 20 recent releases and compared the ratings given by both websites to each film.

{mbox:significance/graphs/rev-agg-rating.jpg|width=630|height=406|caption=Click to enlarge|title=Ratings (on a scale od 0-100) given to 20 recent film releases}

Both websites give very different ratings to many of the films in the graph above - however Rotten Tomatoes have a far superior sample size to base their scores on.

{mbox:significance/graphs/rev-agg-sample-size.jpg|width=630|height=424|caption=Click to enlarge|title=Sample sizes from which average ratings are derived}

So while Metacritic try and feed a more exact score from each review into the calculation, this could be a failing because they are trying to be precise with something that cannot be scaled accurately. This is a problem when the aggregator's score is taken as an accurate measure of performance. As happened when a contract clause for the game Fallout: New Vegas stipulated that it had to achieve a Metacritic score of 85 or higher to trigger the release of royalties to the developer.

Of course both websites also offer a separate user review section which gets around both the sample size and review score problems. There are always going to far more ordinary film fans to offer reviews and each site’s online forms can standardise the survey questions, so nothing gets lost in translation.

However, users occupy the whole spectrum of cinematic tastes. It doesn't matter how big the sample size is for the rating given to A Walk Among the Tombstones, it still doesn't tell me if this will appeal to me if I have an affection for films where Liam Neeson acts all moody and beats up lots of baddies. Netflix have recognised this and their recommendation algorithm tries to learn about your genre preferences rather than tell you what everyone liked.

Personally I would not consult a review aggregator before deciding on what to watch, but that’s probably because I’m a film snob. However, I am a hypocrite because I recently wanted to buy a blender and went straight to the review section of Amazon to research this. The top seller had a sample size of 201 reviews and 55% of the reviews gave the product five stars. That was good enough for me so I bought it.

But user submitted reviews open up more problems because the democracy of allowing anyone to post reviews can skew the data and Amazon is a good example of this. Fake reviews have become an increasing problem for book releases in a highly competitive marketplace and then there’s the fake reviews that are just posted for comedy purposes. Like this example which was at the expense of Haribo’s hapless gummy bears. The point is that Amazon holds lots of data that could start to filter this kind of stuff out such as IP addresses, profile history and email addresses.

From films to fire alarms, the internet has made a vast amount of product reviews available and this is now vitally important to a product’s success or failure. In one study, 70% of people said they trust online reviews as a source of consumer advice. So this problem isn’t going away and surely consumers deserve to have the most accurate average review scores given to them.

When search engine optimizers try to fool Google into pushing a website up the rankings, Google fights back by making a huge effort to cleverly refine their algorithm to counter them. Review aggregators need to keep up too if they are to really offer an accurate representation of the consensus view.