OK, so we’re three weeks into the season, and I’ve put the “new model” into play and publishing the probabilities. While I think the site looks a little cluttered now, I’m pretty happy how it’s working out, both on a game-by-game basis and as far as the sim projections are concerned. The margin of error looks good now, but it’s early days, and I’m expecting the clubs will shuffle around a bit to throw that out a bit. (Anyway, stasis is boring.)
If you’re new here and you’re curious about how all this is worked out, read on.
Firstly, the name GRAFT is a sort of a backronym for “G” Rudimentary Arithmetic Form Tracker. That’s all I’m doing – adding, subtracting, with a bit of dividing and multiplying.
For a good while I called it RAFT while it was just about the margins, until I introduced the attack/defense elements so I could also work out expected scores and each team’s attacking tendencies. To reflect the new version, I added the “G” for no reason because I definitely didn’t name it after myself.
The GRAFT algorithm is the engine of the beast. I compare two sides’ ratings, add the “interstate” factor if warranted, and from that I get a margin and a pair of “par” scores, which sets up where each team is supposedly at compared with each other.
After each game I compare the actual results with the par margins and scores and adjust accordingly by a factor of 10% (this actually means the two competing teams will move closer or further apart by 20%, so if they played next week at the same venue the expected margin would be 20% closer to what the actual scoreline was). I’ve messed around with that factor somewhat but 10% seems to get the best results while being flexible enough to account for teams improving or deteriorating over time.
I feel it is an absurdly reductive system in some respects, but it is mine! For what I put into it, its capability of self-adjustment is quite effective. Then again, any power rating system worth its salt is based on self-adjustment to at least some degree.
You should take a grain of that salt with GRAFT, though, because it does not take into account player changes, particularly whether the coach decides to rest half the first-choice side before the finals. The lines are based on three things only: the two clubs, and where they’re playing*. You would still have to do some homework based on the ins-and-outs and other ineffables. Of course there are plenty of other great ratings systems out there so you can sort of take your pick.
* – I’ve modeled against difference in resting days as another possible factor, but it doesn’t seem to matter much. The dead rubber thing is something I haven’t tested but I feel that since it would affect only a small number of games towards the end of the season, it hasn’t really been a priority of mine – I mean, how can you tell the difference between a team that’s deliberately tanking, a team that doesn’t care about winning anymore, and a team that’s trying its best for their fans but are really actually complete garbage.
For the 2018 season, I’ve made one major change to the GRAFT system this year by expressing it in a 1:1 scale of actual scoreboard points. In previous years it was 10 times the scoreboard.
I made this decision for a couple of reasons:
Firstly, I felt scaling the ratings in this way would be more intuitive.
Secondly, for all of RAFT/GRAFT’s life, the ratings had been expressed as integers – this was practical for most instances but then I’d get to instances where there was 5 points difference in the ratings, would I round up or down? By moving to floating numbers, while the dots look a little more daunting in the tables, it also means I can deal with edge cases with less of them “too close to call”.
Something that came in quite handy last weekend where the working out showed the Swans 0.1 points better than the Giants – with the old integer system I probably would’ve called this too close to call, now in this case while the line is still pretty much a draw, I could at least pick an expected winner, although not with a great deal of confidence. (The fact that Sydney won by 16 points doesn’t really vindicate that at all, because it’s supposed to be close, dammit!)
Despite the change of scale, the GRAFT system is still pretty much the same; it works out expected scores and then nudges them accordingly based on the actual result.
As a rule of thumb, consider teams on around 120 GRAFT points (or +30 over the league mean) as premiership quality and 100 points (or mean+10) probably gets you into the fianls. Teams that go on a bull tear like Essendon 2000 or Geelong 2008 have gotten as high as 150, while in GWS and Gold Coasts first few seasons they would’ve been floundering around 40 to 50. Low scores seem to be more common that similarly high score, but they’re also easier to recover from – it only takes a few good wins or near-misses for a lowly team to get back to the pack.
The Probability Problem
You may be familiar with Elo-based systems – the fun thing about those is that, properly calibrated, they give you the probability of the result straight out of the box. Then you have to figure out the line margin from that.
GRAFT does that arse-backwards. It works out the expected score and margin, from which one has to derive the probabilities.
For the longest time I used a horrible fudge to work out the probability, based on a uniform bell curve, which was not constrained by zero (that is, when I was running the Monte Carlo sims, it could throw out negative scores which in practice I bumped so at least the scores were positive while I preserved the margin).
So it was really quite an awful hack, statistics-wise. It worked OK in practice in some ways but the sliding normal curve just annoyed me and I was not confident about publicising them on the site.
Besides, the normal distribution doesn’t fit (literally, in the statistical sense) The historical score distribution from AFL/VFL games doesn’t conform to the bell curve, it sort of skews, with a bit of a tail towards the high scores – the median score is slightly below the mean.
I won’t get into too much detail here because the Towards A New Model series of articles covered that, but I went with the Gamma distribution because it was a nice fit with the actual result, both overall and under constraints, and with that I could devised a reasonably sane model that I could use to determine probabilities for individual matches, as well as whole seasons using Monte Carlo methods.
It’s worth posting again because I went to a fair amount of trouble working something out:
a = rating * .1 + 3
return scipy.stats.gamma(a, 0, 7.5)
(This is in Python form, “rating” is the expected score according to GRAFT, and the output is a “frozen” distribution model as implemented in scipy.)
Essentially it doesn’t rule out the possibility of a team not expected to do well suddenly cutting loose, but it acknowledges that it is harder than it is for a team on top of the league.
So now that I am using the Gamma algorithm to calculate means and probabilities, it tends to come across a bit more conservative than the margin lines that GRAFT comes out with – mostly because, overall, the mean of actual results is more conservative. It’s almost as if, as the GRAFT rating is foremost a form tracker that expresses how each team has performed up to that point, when they step onto the field they have to prove themselves worthy of that rating all over again.
When I was doing the prep work over summer, I found that, on averaging across each margin band as determined by GRAFT, the actual margins were about 25% closer to the mean than what GRAFT was expecting. That is, I would check the range of result about 20 points above the global mean, and find the average actual result was more like 15 points – sure, some blowouts for the favourite, and some major upsets against the grain, but overall a little closer to zero than the standard GRAFT had expected.
However, when I decided to tone down GRAFT itself by reducing the weekly adjustment factor, it didn’t actually improve the number of successful tips, and weirdly the actual margin averages were still regressing compared with the expectations.
But decoupling the Gamma line (which took into account that regression) from GRAFT seemed to work much better.
So, the probabilities and lines I am publishing for aggregation through http://graftratings.com/aft/graft_tips.csv are based on the Gamma model, however the ratings that I use to rank and compare the teams are still based on the usual week-by-week GRAFT mode.
By setting a conservative slant on the Gamma model has come in handy for the season sims because, since it effectively imposes a regression to the mean and brings all the teams closer together (awwww) it actually gives more credence to outliers to show up in the mix.
(Regressing ratings to the mean is a pretty common practice in ratings systems, in that it is applied in the off-season before the competition begins afresh. Weirdly, though, I don’t actually do that; For each new season, I reset the seed ratings based on the home-and-away performances from the previous season. This is done by weighting the for and against totals by halving the scores from the matches where the clubs met twice in the season.)
All of this is still a work in progress, for instance, now I’ve gotten to this point, I would like to work out and account co-variance – since once GRAFT emits the two teams’ expected scores for the Gamma curves, those two curves are independent, which is not quite how footy works – like, you can’t have both teams scoring at the same time. I think that will be quite tricky to work out but it’s the next logical step as far as improving the model.
In the meantime, the current version is set and being put into practice for match-by-match “predictions” and season projections. I’m really at the next stage of this, which is devising visualisations that I hope will be informative and intuitive, without being too misleading.
Of course at the moment I have spammed a bunch of numbers under each match info box, but it seems like maybe it’s too much? Anyway. This thing is pretty much a hobby for me (the site is low traffic so my bills are modest) and I get as much fun working out the design part of it and coming up with weird charts as much as the heavy maths.
And just in case you’re wondering, if I’m watching or listening to a game, I don’t even really think about this stuff. I just turn into every other moron yelling baaaaawl (actually I’m probably more “WHAT THE HELL WAS THAT FREE FOR”), have a bit of a laugh if something stupid happens, and then at the end of the game, that’s when I put the numbers in and turn the handle and have a look at what comes out the end.