How to Build a 5 Star Rating System with Wilson Interval in MySQL
In this article I’m going to explain how to build a 5 star rating system with MySQL. We will be using triggers to make the queries lightweight and fast. Triggers are supported in MySQL since version 5.0.2 so make sure you’re up to date. I will not explain how to implement the results in PHP, Ruby or whatever floats your boat. You can find plenty of tutorials on the web about that so no need to write another article. This article is based on Evan Miller’s tutorial ‘How Not To Sort By Average Rating‘.
The Problem
There are a lot of rating systems available and many of them are wrong. The majority of rating tutorials you’ll find on the web take the average rating, without taking the total number of votes into account. But it won’t stop there. Average rating systems can be found anywhere on the web. Even on big websites like Amazon. So why is it wrong?
Suppose you have 10 products on your website. Product 1 has a total of 10 votes, where 9 are positive and 1 is negative. With an average five star rating system that rating would be (9 / 10) * 5 = 4.5 stars. Now you have another product with 1 vote, which is positive. That would mean the product would have (1 / 1) * 5 = stars. So product 2 would rank higher than product 1, even though product 1 has 9 positive votes and only 1 negative vote.
The Solution
The solution to this problem is normal approximation interval. Wikipedia tells us:
“The most commonly used formula for a binomial confidence interval relies on approximating the distribution of error about a binomially-distributed observation, with a normal distribution… The Wilson interval is an improvement (the actual coverage probability is closer to the nominal value) over the normal approximation interval and was first developed by Edwin Bidwell Wilson (1927)”
Say what? In case you’re interested, Edwin Bidwell Wilson came up with a formula that would solve all our problems and it looks like this:
Evan Miller explains in his article: “Score = Lower bound of Wilson score confidence interval for a Bernoulli parameter. We need to balance the proportion of positive ratings with the uncertainty of a small number of observations.”
Now, we don’t need to dig deeper in this matter because Evan Miller came up with this SQL query:
SELECT widget_id, ((positive + 1.9208) / (positive + negative) - 1.96 * SQRT((positive * negative) / (positive + negative) + 0.9604) / (positive + negative)) / (1 + 3.8416 / (positive + negative)) AS ci_lower_bound FROM widgets WHERE positive + negative > 0 ORDER BY ci_lower_bound DESC;
What this query basically does, is give a rating a lower bound score. But let’s get back to our products. For example, with 9 positive votes and 1 negative vote this query would result in a lower bound of 0.595844. 1 positive and 0 negative votes would result in lower bound of 0.206543, which is lower. That means our first product would rank higher than our second and that’s exactly what we want. We’re going to use this query later in our triggers.
Stars
With a 5 star system there are more options than just positive or negative. Every time someone votes, 1 vote is added to the total votes. So we have to spread that vote over the positive and negative votes. Assume that 3 stars is not good and not bad either. So we take that as a middle. This means if someone votes 3 stars for our product, 0.5 goes to positive and 0.5 goes to negative.
Stars | Negative | Positive | Total |
---|---|---|---|
0 | 0 | 0 | 0 |
1 | 1 | 0 | 1 |
2 | 0.75 | 0.25 | 1 |
3 | 0.5 | 0.5 | 1 |
4 | 0.25 | 0.75 | 1 |
5 | 0 | 1 | 1 |
Enough with the theory. We’re going to write the queries in SQL.
The Table
Now let’s create a table with the following columns: id
, product_id
, positive
, negative
, stars
, total
and lower_bound
. In this example stars can have a half value. So 3,5 stars is possible. Total , product_id
and id
will be an integer, the others will be a float. We can achieve this with the following query:
CREATE TABLE `ratings` ( `id` int(11) unsigned NOT NULL AUTO_INCREMENT, `product_id` int(11) DEFAULT NULL, `positive` float NOT NULL DEFAULT 0, `negative` float NOT NULL DEFAULT 0, `stars` float DEFAULT 0, `total` int(11) NOT NULL, `lower_bound` float DEFAULT NULL, PRIMARY KEY (`id`) )
Alright, done.
Triggers
Triggers in MySQL are very useful when a defined action is executed for the table. The trigger can be executed when you run one of the following MySQL statements on the table: INSERT
, UPDATE
and DELETE
. We want to calculate stars, total and lower_bound as soon as a rating is inserted or updated. Yeah, triggers are your friend! This is what the insert trigger looks like:
CREATE TRIGGER `insert_rating` BEFORE INSERT ON `ratings` FOR EACH ROW SET new.total = new.positive + new.negative, new.stars = ROUND( (((new.positive / new.total) * 4) + 1) * 2, 0) / 2, new.lower_bound = ((new.positive + 1.9208) / (new.positive + new.negative) - 1.96 * SQRT((new.positive * new.negative) / (new.positive + new.negative) + 0.9604) / (new.positive + new.negative)) / (1 + 3.8416 / (new.positive + new.negative))
Every time a rating is inserted total, stars and lower_bound will be calculated. The stars are rounded to half decimals. Next is the update trigger which is almost the same as the insert trigger:
CREATE TRIGGER `update_rating` BEFORE UPDATE ON `ratings` FOR EACH ROW SET new.total = new.positive + new.negative, new.stars = ROUND( (((new.positive / new.total) * 4) + 1) * 2, 0) / 2, new.lower_bound = ((new.positive + 1.9208) / (new.positive + new.negative) - 1.96 * SQRT((new.positive * new.negative) / (new.positive + new.negative) + 0.9604) / (new.positive + new.negative)) / (1 + 3.8416 / (new.positive + new.negative))
We have created the triggers and now we are ready to test them.
INSERT into ratings (product_id, positive, negative) VALUES (1, 0, 0)
You should see the calculated total, stars and lower_bound
rows which are 0
or NULL
. Now we’re going to add the first rating. Someone rated 4 stars for our product number 1. So we’re going to run the following query:
UPDATE ratings SET negative = negative + 0.25, positive = positive + 0.75 WHERE id =1
And last but not least, to get the list of most popular products:
SELECT * FROM ratings ORDER BY lower_bound DESC
Conclusion
That’s it for this tutorial. Now you have a rating table in SQL with the Wilson score confidence interval. But there are a lot of ways to extend this tutorial. You might want to store the IP addresses in another table to make sure people only vote once.