I recently added a Composite Rating page to the site. Aggregate analysis is often useful and more accurate than any one method on its own, so I figured I would use this post to explain some of the methods and discoveries of my analysis.
How the ratings are determined
Every rating system uses a different scale to rate its teams, so the first challenge in making a composite ranking is determining how to compare these different systems. What I've seen in the past the geometric mean (similar to the average that you're used to) of each team's rank. This method is perfectly acceptable for ranking the teams, but I wanted to determine a composite rating on the same scale as my system. The difference between an average rating and ranking are subtle, but meaningful. When several teams are rated at nearly the same level, the difference between their ratings is small, but the difference between the rankings could be large.
In order to average teams' ratings, I had to perform a transformation to place each system's ratings on a 0-1 scale. Every system I was analyzing distributed their ratings in approximately normal distributions, so I was able to use Excel's normdist() function to place each team on a 0-1 scale. From there I averaged each team's rating for every system to find their composite rating, excluding their maximum and minimum ratings (also known as an olympic average). By using the olympic average, I hopefully was able to minimize any outlier values. The math was pretty simple, but cleaning up the team names was a nightmare. Everybody uses different names for nearly half of the teams in the set (Wis. Whitewater/Wisc.-Whitewater/UW-Whitewater for example). Can't we all just use the same team names that I use?
Next, I decided to check which systems were most similar to the aggregate and to each other. First I found the absolute difference between each system and the system aggregate, and then I averaged those values. The inverse of that average is their similarity score. A quick primer on interpreting these values: a similarity score of infinity means that the systems produce exactly the same ratings, and a score of two is as dissimilar as possible on this scale (because every rating is going to produce a linear distribution). As the value increases, the two systems become more similar. A score of 10 means that the two systems have an average difference between teams of 0.100 (or essentially one win per season). Here's how each system compared to the average:
For comparison of just how big of a difference there is between a score of 39.9 & 10.5, here's a side-by-side scatter plot of the Laz Index and CSL Ratings against the system average.
Next I decided to compare the individual rating models to each other to see which models produced the most similar results. What I discovered was that the Laz Index and Maas Ranking are the two systems most similar to each other, with a similarity score of 67.1. For some context, the next highest similarity score was 39.6, meaning these two systems are nearly twice as similar as the next two closest. That can't be strictly coincidence. So I graphed these two systems against each other like I did above:
Notice how in the first graph the points are actually scattered above and below the x=y line, but here the points follow a smooth trajectory. This suggests that the Laz Index and Maas Ranking are probably using the same method to rank teams. Martien Maas has a detailed description of his methodology on his site, but there's no such description on the Laz Index site. If I were a betting man, I would wager that they're using the same formula, with some slightly different input methodology.
Next I found the two systems that are most dissimilar from each other, and it happened to be the CSL Ratings and the Nutshell Redrodictive, with a similarity score of 6.4. Here they are plotted against each other:
It would be hard for these two systems to be any different. As I mentioned earlier, the largest possible expected difference between any two systems would be a similarity score of 2. Both Ray Waits (Nutshell) and Craig Loest provide descriptions of their methodology, so it should be possible to determine why they produce such different results.
The Nutshell Retrodictive description is as follows (emphasis mine):
The Nutshell Ratings are based on two components:
1. Margin of Victory
Margin of victory is the number of points one team beats another team by. An upset is when the underdog beats the favorite. Each of these factors is treated in a different manner. Upsets award the underdog, but not as much as if the winner had been the favorite. There is no limit on the margin of victory since it only counts as a small portion of a team's rating. When an underdog wins, it does not gain a greater rating than the team it beat. An underdog is expected to earn its higher rating either by margin of victory or by beating more favorites. Once the team is the favorite it will gain points more quickly.
Short and sweet; I like it. The CSL description is slightly longer, so I won't post the whole thing here, but here's the important points:
A team's wins minus that team's losses
The sum of the wins of the teams I beat minus the sum of the losses of the teams that beat me
The sum of the wins of the teams beaten by the teams I beat, minus the sum of the losses of the teams that were beaten by the teams that beat me
Clearly, a win against a team with a "good" record will supply more of a reward in the second and third factors than win over a team with a "poor" record. Similarly, a team is not penalized very much for losing to a good team, but is penalized more punatively for losing to a poor team.
Each factor is weighted evenly and then summed.
In reading the descriptions, it becomes clear why they produce such vastly different results. Nutshell includes MOV, while CSL doesn't, and both descriptions explicitly mention how an underdog win would affect their ratings, which are completely opposite.
When I read the CSL description, I was struck by how similar it is to both the NCAA's Strength of Schedule calculation and DI college basketball's RPI, which makes sense considering the teams that are the biggest outliers in the system. The same quirks that I pointed out in my critique of the NCAA's SoS are likely to plague the CSL Ratings. Look back at the first plot of CSL versus the system average. The teams CSL overrates the most are teams from conferences like the UMAC, MWC, and ECFC (bad conferences), while the teams it underrates the most are teams from conferences like the WIAC, OAC, MIAC, and E8 (good conferences).
Interestingly, my system was most similar to the Born Power Index, Laz Index, and Maas Ranking - three of the systems most similar to the system average - but my ratings are one of the most dissimilar from the system average. I'm not sure how to feel about this, but I think it's probably a good thing. I set out to make a unique rating system, which I think I did, but usually when you're much different from the aggregate knowledge of others, it means you're probably wrong.
I guess we'll have to wait until next season to find out.