Bayesian Averaging in SAS
Hypothetical situation, lets say you've got a list of movies that you want to rank in a website or a report or something, and you have user-submitted ratings for them, but some are more popular than others, so your data looks like this:
data ratings;
input name $ rating;
datalines;
Lincoln 9
Lincoln 8
Lincoln 9
Amour 9
Argo 5
Argo 10
;
run;
| Obs | name | rating |
|---|---|---|
| 1 | Lincoln | 9 |
| 2 | Lincoln | 8 |
| 3 | Lincoln | 9 |
| 4 | Amour | 9 |
| 5 | Argo | 5 |
| 6 | Argo | 10 |
The easiest thing to do would be to calculate an average rating for each movie like this:
proc sql;
select distinct name, avg(rating) as average
from ratings
group by name
order by average desc;
run;
| name | average |
|---|---|
| Amour | 9.00 |
| Lincoln | 8.67 |
| Argo | 7.50 |
But hey! That's not cool. It looks like Amour wins, because its average rating is 9. Maybe we want to consider Lincoln as better because 3 people think it's very high. A good way to deal with this is by instead taking a Bayesian Average.
This means we're going to add in some "dummy" votes for each movie, who give each movie the average rating a movie gets. How many (C) is a judgement call, the more we add, the harder we make it for an obscure movie to be near the top. Likewise, if a movie's first rating is low, it keeps it from suddenly dropping to the bottom of the list. If we expect thousands of ratings for each movie, a C=1000 might be appropriate. In this example, I use a small C of 10.
proc sql;
select avg(rating) into :average
from ratings;
select distinct
name,
(sum(rating) + &average * 10) / (count(*) + 10) as b_average
from ratings
group by name
order by b_average;
quit;
| name | b_average |
|---|---|
| Lincoln | 8.41 |
| Amour | 8.39 |
| Argo | 8.19 |
And look! Lincoln is back on top, since its bayesian average more closely reflects a product of the number of ratings it has and what those ratings are.