What is a better way to sort by a 5 star rating?

sorting statistics user-experience rating bayesian

23,113

Solution 1

Prior to 2015, the Internet Movie Database (IMDb) publicly listed the formula used to rank their Top 250 movies list. To quote:

The formula for calculating the Top Rated 250 Titles gives a true Bayesian estimate:
weighted rating (WR) = (v ÷ (v+m)) × R + (m ÷ (v+m)) × C
where:

R = average for the movie (mean)

v = number of votes for the movie

m = minimum votes required to be listed in the Top 250 (currently 25000)

C = the mean vote across the whole report (currently 7.0)

For the Top 250, only votes from regular voters are considered.

It's not so hard to understand. The formula is:

rating = (v / (v + m)) * R +
         (m / (v + m)) * C;

Which can be mathematically simplified to:

rating = (R * v + C * m) / (v + m);

The variables are:

R – The item's own rating. R is the average of the item's votes. (For example, if an item has no votes, its R is 0. If someone gives it 5 stars, R becomes 5. If someone else gives it 1 star, R becomes 3, the average of [1, 5]. And so on.)
C – The average item's rating. Find the R of every single item in the database, including the current one, and take the average of them; that is C. (Suppose there are 4 items in the database, and their ratings are [2, 3, 5, 5]. C is 3.75, the average of those numbers.)
v – The number of votes for an item. (To given another example, if 5 people have cast votes on an item, v is 5.)
m – The tuneable parameter. The amount of "smoothing" applied to the rating is based on the number of votes (v) in relation to m. Adjust m until the results satisfy you. And don't misinterpret IMDb's description of m as "minimum votes required to be listed" – this system is perfectly capable of ranking items with less votes than m.

All the formula does is: add m imaginary votes, each with a value of C, before calculating the average. In the beginning, when there isn't enough data (i.e. the number of votes is dramatically less than m), this causes the blanks to be filled in with average data. However, as votes accumulates, eventually the imaginary votes will be drowned out by real ones.

In this system, votes don't cause the rating to fluctuate wildly. Instead, they merely perturb it a bit in some direction.

When there are zero votes, only imaginary votes exist, and all of them are C. Thus, each item begins with a rating of C.

Solution 2

Evan Miller shows a Bayesian approach to ranking 5-star ratings:

where

nk is the number of k-star ratings,
sk is the "worth" (in points) of k stars,
N is the total number of votes
K is the maximum number of stars (e.g. K=5, in a 5-star rating system)
z_alpha/2 is the 1 - alpha/2 quantile of a normal distribution. If you want 95% confidence (based on the Bayesian posterior distribution) that the actual sort criterion is at least as big as the computed sort criterion, choose z_alpha/2 = 1.65.

In Python, the sorting criterion can be calculated with

def starsort(ns):
    """
    http://www.evanmiller.org/ranking-items-with-star-ratings.html
    """
    N = sum(ns)
    K = len(ns)
    s = list(range(K,0,-1))
    s2 = [sk**2 for sk in s]
    z = 1.65
    def f(s, ns):
        N = sum(ns)
        K = len(ns)
        return sum(sk*(nk+1) for sk, nk in zip(s,ns)) / (N+K)
    fsns = f(s, ns)
    return fsns - z*math.sqrt((f(s2, ns)- fsns**2)/(N+K+1))

For example, if an item has 60 five-stars, 80 four-stars, 75 three-stars, 20 two-stars and 25 one-stars, then its overall star rating would be about 3.4:

x = (60, 80, 75, 20, 25)
starsort(x)
# 3.3686975120774694

and you can sort a list of 5-star ratings with

sorted([(60, 80, 75, 20, 25), (10,0,0,0,0), (5,0,0,0,0)], key=starsort, reverse=True)
# [(10, 0, 0, 0, 0), (60, 80, 75, 20, 25), (5, 0, 0, 0, 0)]

This shows the effect that more ratings can have upon the overall star value.

You'll find that this formula tends to give an overall rating which is a bit lower than the overall rating reported by sites such as Amazon, Ebay or Wal-mart particularly when there are few votes (say, less than 300). This reflects the higher uncertainy that comes with fewer votes. As the number of votes increases (into the thousands) all overall these rating formulas should tend to the (weighted) average rating.

Since the formula only depends on the frequency distribution of 5-star ratings for the item itself, it is easy to combine reviews from multiple sources (or, update the overall rating in light of new votes) by simply adding the frequency distributions together.

Unlike the IMDb formula, this formula does not depend on the average score across all items, nor an artificial minimum number of votes cutoff value.

Moreover, this formula makes use of the full frequency distribution -- not just the average number of stars and the number of votes. And it makes sense that it should since an item with ten 5-stars and ten 1-stars should be treated as having more uncertainty than (and therefore not rated as highly as) an item with twenty 3-star ratings:

In [78]: starsort((10,0,0,0,10))
Out[78]: 2.386028063783418

In [79]: starsort((0,0,20,0,0))
Out[79]: 2.795342687927806

The IMDb formula does not take this into account.

Solution 3

See this page for a good analysis of star-based rating systems, and this one for a good analysis of upvote-/downvote- based systems.

For up and down voting you want to estimate the probability that, given the ratings you have, the "real" score (if you had infinite ratings) is greater than some quantity (like, say, the similar number for some other item you're sorting against).

See the second article for the answer, but the conclusion is you want to use the Wilson confidence. The article gives the equation and sample Ruby code (easily translated to another language).

Solution 4

You could sort by median instead of arithmetic mean. In this case both examples have a median of 5, so both would have the same weight in a sorting algorithm.

You could use a mode to the same effect, but median is probably a better idea.

If you want to assign additional weight to the product with 100 5-star ratings, you'll probably want to go with some kind of weighted mode, assigning more weight to ratings with the same median, but with more overall votes.

Solution 5

Well, depending on how complex you want to make it, you could have ratings additionally be weighted based on how many ratings the person has made, and what those ratings are. If the person has only made one rating, it could be a shill rating, and might count for less. Or if the person has rated many things in category a, but few in category b, and has an average rating of 1.3 out of 5 stars, it sounds like category a may be artificially weighed down by the low average score of this user, and should be adjusted.

But enough of making it complex. Let’s make it simple.

Assuming we’re working with just two values, ReviewCount and AverageRating, for a particular item, it would make sense to me to look ReviewCount as essentially being the “reliability” value. But we don’t just want to bring scores down for low ReviewCount items: a single one-star rating is probably as unreliable as a single 5 star rating. So what we want to do is probably average towards the middle: 3.

So, basically, I’m thinking of an equation something like X * AverageRating + Y * 3 = the-rating-we-want. In order to make this value come out right we need X+Y to equal 1. Also we need X to increase in value as ReviewCount increases...with a review count of 0, x should be 0 (giving us an equation of “3”), and with an infinite review count X should be 1 (which makes the equation = AverageRating).

So what are X and Y equations? For the X equation want the dependent variable to asymptotically approach 1 as the independent variable approaches infinity. A good set of equations is something like: Y = 1/(factor^RatingCount) and (utilizing the fact that X must be equal to 1-Y) X = 1 – (1/(factor^RatingCount)

Then we can adjust "factor" to fit the range that we're looking for.

I used this simple C# program to try a few factors:

        // We can adjust this factor to adjust our curve.
        double factor = 1.5;  

        // Here's some sample data
        double RatingAverage1 = 5;
        double RatingCount1 = 1;

        double RatingAverage2 = 4.5;
        double RatingCount2 = 5;

        double RatingAverage3 = 3.5;
        double RatingCount3 = 50000; // 50000 is not infinite, but it's probably plenty to closely simulate it.

        // Do the calculations
        double modfactor = Math.Pow(factor, RatingCount1);
        double modRating1 = (3 / modfactor)
            + (RatingAverage1 * (1 - 1 / modfactor));

        double modfactor2 = Math.Pow(factor, RatingCount2);
        double modRating2 = (3 / modfactor2)
            + (RatingAverage2 * (1 - 1 / modfactor2));

        double modfactor3 = Math.Pow(factor, RatingCount3);
        double modRating3 = (3 / modfactor3)
            + (RatingAverage3 * (1 - 1 / modfactor3));

        Console.WriteLine(String.Format("RatingAverage: {0}, RatingCount: {1}, Adjusted Rating: {2:0.00}", 
            RatingAverage1, RatingCount1, modRating1));
        Console.WriteLine(String.Format("RatingAverage: {0}, RatingCount: {1}, Adjusted Rating: {2:0.00}",
            RatingAverage2, RatingCount2, modRating2));
        Console.WriteLine(String.Format("RatingAverage: {0}, RatingCount: {1}, Adjusted Rating: {2:0.00}",
            RatingAverage3, RatingCount3, modRating3));

        // Hold up for the user to read the data.
        Console.ReadLine();

So you don’t bother copying it in, it gives this output:

RatingAverage: 5, RatingCount: 1, Adjusted Rating: 3.67
RatingAverage: 4.5, RatingCount: 5, Adjusted Rating: 4.30
RatingAverage: 3.5, RatingCount: 50000, Adjusted Rating: 3.50

Something like that? You could obviously adjust the "factor" value as needed to get the kind of weighting you want.

View more solutions

23,113

Vizjerai

Updated on July 05, 2022

Comments

Vizjerai almost 2 years

I'm trying to sort a bunch of products by customer ratings using a 5 star system. The site I'm setting this up for does not have a lot of ratings and continue to add new products so it will usually have a few products with a low number of ratings.

I tried using average star rating but that algorithm fails when there is a small number of ratings.

Example a product that has 3x 5 star ratings would show up better than a product that has 100x 5 star ratings and 2x 2 star ratings.

Shouldn't the second product show up higher because it is statistically more trustworthy because of the larger number of ratings?
Vizjerai almost 15 years

If I were to use the median method how would you determine which should be rated better 5x 5 star ratings with 4x 2 star ratings or 5x 5 star ratings with 4x 1 star ratings? Both would come up with 5 for the rating.
Welbog almost 15 years

That would be up to you at that point. It depends on which you think it's superior. Maybe you sort first by median, then by mean. Or maybe first by median, then by total number of votes.
Fear605 almost 15 years

Weighted median: Sort by median first, then by mean. Total number of votes improves the reliability (confidence level) of the score, but says nothing about the score itself.
Ang Jit Wei Aaron over 14 years

Even though I can recommend that book to everyone who is interested in that field, your answer does not provide a solution the question asked.
Richard Garside about 10 years

The wiki answers article quoted suggests that the formula is WR = (v * R + m * C) / (v + m) which seems more likely as C is taken into account and the values I'm getting seem better.
ParoX almost 10 years

The formula is actually the same one, you must of put the original one in incorrectly as (v/(v+m))*R+(m/(v+m))*C is the same as (v * R + m * C) / (v + m). Link: goo.gl/IW9s1A
alecbz over 9 years

Wilson confidence intervals only work for binomial distributions (eg, +1/-1 style ratings); it's not clear what approach to take for something like a 5 star rating scheme.
Daniel over 6 years

I think 1 vote for rating 5 is bigger than 5 votes for rating 4 if I did it right. And it's not right for ranking system
Dan Fabulich over 5 years

Thanks very much! I ported this answer to JavaScript. gist.github.com/dfabulich/fc6b13a8bffc5518c4731347de642749
Dan Fabulich over 5 years

I also ported this answer to SQL, assuming columns rated5, rated4, rated3, rated2, and rated1, which are counts of how many people gave that rating. select ((5*(rated5+1)+4*(rated4+1)+3*(rated3+1)+2*(rated2+1)+1*(rat‌ed1+1))/(5+rated5+ra‌ted4+rated3+rated2+r‌ated1))-1.65*SQRT(((‌(25*(rated5+1)+16*(r‌ated4+1)+9*(rated3+1‌)+4*(rated2+1)+1*(ra‌ted1+1))/(5+rated5+r‌ated4+rated3+rated2+‌rated1)) - POWER(((5*(rated5+1)+4*(rated4+1)+3*(rated3+1)+2*(rated2+1)+‌1*(rated1+1))/(5+rat‌ed5+rated4+rated3+ra‌ted2+rated1)), 2))/(6+rated5+rated4+rated3+rated2+rated1)) as x from mytable
Cypher about 4 years

This is hands down the best answer.
eozzy over 3 years

So if there's just one 5 start rating, then how come the average is 2.5? eg. starsort([1,0,0,0,0]) 2.4036636531319653
Dan Fabulich over 3 years

Evan Miller's formula looks complicated, but it's actually quite simple. First, before computing the average and the standard deviation, add five fake ratings for each widget: one 1-star, one 2-star, one 3-star, one 4-star, and one 5-star rating. Then, when it comes time to sort, subtract the standard deviation σ from the average first, multiplying σ by a constant factor z to put more weight on consensus, i.e. X = A - zσ. At z = 1.65, each widget has a 90% confidence of having a "true" average greater than X.
Dan Fabulich over 3 years

@eozzy See my explanation above. avg([2,1,1,1,1]) = ~3.3. We then subtract the standard deviation from that, times 1.65, to get 2.4.
Kieran Ryan almost 3 years

For simple answer just compare like with like i.e. compare corresponding 5 star ratings.. so in your example the product with 100x 5 star rating beats product with 3x 5 star ratings.
Mueez Khan about 2 years

can I use m = Total Votes in Db / total movies I.,e average votes for the rating of a movie?