Wednesday 11 November 2015

TANDQ 11: Rate This Post

In 2014-2015 I wrote an education column called "There Are No Dumb Questions" for the website "MuseHack". As that site has evolved, I have decided to republish those columns here (updating the index page as I go) every Wednesday. This eleventh column originally appeared on Thursday, January 29, 2015.

Are rating systems skewed?


If responses are voluntary, yes. If they’re not - the ratings are probably still skewed. Despite this fact, people will often check a product’s “rating” before making a purchase. Online reviewers (for movies, video games, etc) will also tend to use a “star” system variant in their regular column/show. Perhaps you’ve even been asked to code up a rating system for someone else to use? Regrettably, while there is more than one type of rating scale out there, the problem of skew - which lends itself to an overestimation of reality - is pervasive. Let’s explore that further.

The first problem is one of averaging. If every review is given equal weight, we can end up with a situation like in this xkcd comic, where the most important review is lost in the other noise. (“You had one job!” comes to mind - though of course that phrase itself isn’t accurate.) In the same vein, an item that has 3 positive reviews out of 4 would get the same mean rating as an item with 75 positive reviews out of 100. But while the percentage is the same, the second item is of lower risk to the consumer, because it’s had 96 more people try it out. There’s also the question of when these reviews were posted - are all the positive reviews recent, perhaps after an update? All of this is useful information, which becomes lost (or is perhaps presented but ignored) once a final average score is made available. That’s not to say that the problem has never been addressed - the reddit system, for instance, tackled the problem mathematically. Randall Munroe (of xkcd, see above) blogged about this back in 2009. But in general, the issue of weighted averaging is not something a typical consumer considers.

Even after all of that, there is a second problem. Who is writing these reviews? Everyone who made the purchase, or who saw the movie? Of course not. Generally, a high emotional response (either good or bad) is needed to motivate us to provide the feedback. This means that a majority of responses will either be at the highest level (5), or the lowest (1). Does anyone reading this remember when YouTube had a five star rating system? It has since become “I like this” (thumbs up) or “I dislike this” (thumbs down), because five years ago, YouTube determined that on their old system, “when it comes to ratings it’s pretty much all or nothing”. Now, given these polar opposite opinions, one might expect a typical “five star” graph to form a “U” shape, with a roughly equal number of high and low rankings, tapering down to nothing in the middle. Interestingly, that’s not the graph we get.


J Walking


The graph from the YouTube blog link above is typical, known to some as the “J-shaped distribution” or “J-curve” (not to be confused with the one in economics). It’s so named because there are an overwhelming number of high “five star” reviews on the right, tapering back to almost nothing in the middle - with a small hook on the left, as the “one star” reviews slightly nudge the curve back up. Calculating the mean of a system like this, where both the mode and the median number are equivalent to the maximum, will place the “average” somewhere in the 4’s. In fact, this column came about because of a tweet I saw questioning why an “average” review (3 out of 5) would be considered by some people to be “bad”. Setting aside the fact that some dislike being called “average”... if the J-curve predicts a mean higher than four, the three IS below that. Isn’t that “bad”?


The trouble with comparisons is how useless they are, until you acknowledge what it is you’re comparing yourself against. If you’re comparing a “3” against the rating scale, it’s average - even above average, if the scale is running 0-5, not 1-5! On the other hand, if you’re comparing a “3” against existing ratings for similar products, or on prior data for the same product, the “3” might seem less good... it’s origins may even be called into question. Which actually brings up a third problem, namely that a person may intend to rate something at a “3”... but upon logging in and seeing all the other people who have rated it higher, succumb to “peer pressure”! Giving it a “4” in the heat of the moment! And we haven’t even touched on the problem of illegitimate accounts, created solely to inflate (or lower) the average score of a product. (Of course, what you should probably be doing is comparing the “3” against other scores from that same reviewer. Ideally their scores on your own previous outputs.)

Now, is there a way we can fix this rating system problem? One solution might be to force every user/viewer to provide a review. If all the people with a “meh” opinion were forced to weigh in, it would fill the J-curve. But should their input be given equal weight? After all, being forced to do something you don’t want to do is liable to either lower your satisfaction, or cause you to make stuff up. (Though implementation is not impossible - for instance, AnimeMusicVideos.org requires ratings for a certain number of downloaded videos in order to download more.) Another solution might be to adjust the scale, as YouTube did (or going the other way, IMDb uses 10 stars), but this merely tends to expand or compress the J-curve, rather than actually solve the underlying issue. In fact, I have not come across any foolproof rating system in my research - even the great critic Roger Ebert once said “I curse the Satanic force that dreamed up the four-star scale” in his post: You give out too many stars. (I recommend reading that, as it also points out the problem of having a middle position.)

Meaning it comes down to this: I don’t have a perfect solution. Much like Steven Savage and the issue of franchises from earlier this week, I’m just putting it out there. In particular, should we really trust the ratings we find online? Is achieving an unbiased rating system impossible (short of reading minds) - but the effect something we can ultimately compensate for, the more we understand it? Then again, despite this being the age of social media where everyone's weighing in, reviews might be better left in the hands of the professionals - those people who are paid to assign such ratings for a living. I dunno, do you have an opinion?

For further viewing:

1. A Statistical Analysis of 1.2 Million Amazon Reviews

2. The Problem With Online Ratings

3. Top Food Critics Tell All: The Star Rating System (3 min video)


Got an idea or a question for a future TANDQ column? Let me know in the comments, or through email!

3 comments:

  1. A lot of the same idea here from the crew at 538 http://fivethirtyeight.com/features/fandango-movies-ratings/
    http://fivethirtyeight.com/datalab/rating-subjective-experiences-is-hard-but-fandango-is-really-bad-at-it/

    ReplyDelete
    Replies
    1. Thanks for that, David! That's pretty amazing... and not merely because Fandango is so far off the J-curve, but also because an analysis elsewhere implies that some of what I say has a broader relevance. :)

      Delete
    2. glad I could help corroborate your theory :-)

      Delete