We live in the era of Big Data – of massive databases that can be interrogated with complex statistical tools to produce all manner of insights – but let’s also hear it for the little guy. Big Data may be all the rage more widely, but Little Data remains highly important in the realms of horse racing analysis.
By its nature, much of what matters in horse racing is individualised and dynamic, transitory even. It is, in other words, on the small side in terms of sample sizes and timescales.
It is in the nature of betting on horse racing, and of betting on many uncertain outcomes, that you need to act upon incomplete information. It is also in the nature of betting on horse racing that the precedents you use to inform your decisions may well change over time.
In few places is this more apparent than in the field of so-called “trends”.
The theory behind trends is that outcomes in horse racing tend to be repeated across time. There is something in this, but there is also a clear danger of simplifying a complex problem primarily for convenience.
Trends tend to look at individual races, and of what it has taken to win, or run well in, that race previously. Actually, it would be a lot better if they did seek to establish more often what it took to “run well” in a race: more usually, they obsess with winners.
The flaw in this should be obvious to the more astute reader. Winners are just one extreme of a wide spectrum of performance, from winning like a champion to losing like a bum. By turning performance into a binary “did it win, or did it not?”, a huge amount of important information will be lost.
This is exacerbated by the fact that, if you are looking at one race, over a period of time, there can only ever be one winner each year, and your sample size will be puny unless you extend the search back many years (which will likely make it no longer relevant).
One solution is to use measures which are more nuanced than “winners only”. The more sophisticated the measure, the smaller (and therefore the more current) the sample can be while it remains possible to draw reasonable conclusions.
If not “winners only”, then how about placed horses? How about winners and placed horses being compared to how many winners and placed horses could be expected by chance (known as “impact values”)?
How about lengths beaten, or Timeform rating run to, or the proportion of rivals beaten? All of these have validity (and all of them have drawbacks or complications!), and are an improvement on the binary world of “win or not?”
One consequence of an increasingly nuanced way of looking at the information available is that effects which seemed to exist when looked at crudely (such as through wins only) often lessen or disappear. The corollary to that is that if an effect persists it is far more likely to have substance.
It is worth looking at an example in detail: age-related data and the Old Borough Handicap, due to be run at Haydock on 5 September. All runnings of the race from 2005 to 2014 were considered, with the 2008 contest not taking place.
Age Related Data in the Old Borough Handicap
In the first instance, we can simply consider the number of wins achieved by each age-group (Row 1). There seems to be no effect: each of the age-groups has won three races. Believe it or not, some people – including those who really should know better – look no further than this.
If nothing else, we should really consider opportunity, starting with the number of horses to have run from each age-group (Row 2). In terms of complexity and an evolution in analysis, this simple step is akin to life emerging from the primordial swamp. It can be seen that three-year-olds had far fewer opportunities to gain their three wins than the other age-groups.
But we can do better. Achievement compared to opportunity can be expressed as a strike-rate (Row 3): that is, the % of horses from that age-group which won. Things are looking better still for three-year-olds, but what if that age-group had only ever contested small-field runnings of the race?
Row 4 establishes the number of wins each age-group should have achieved by random (the sum of 1/(number of runners) for all cases in each category, so field size is taken into account) and then an impact value, which is wins achieved divided by wins expected at random.
But we have still considered winners and winners only. If we want to do better still – the evolutionary equivalent of walking on our hind legs or discovering fire – then we need to consider beaten horses also. If an effect exists, then it should apply across the board, and not just conveniently to those horses which win.
Age Related Data in the Old Borough Handicap (first 4 finishers)
The message is similar, but the measures can be relied upon more. Three-year-olds have actually provided fewer first-four finishers than the other two age-groups, but those seven have come from just a dozen runners. This translates into a superior strike-rate, impact value and % of rivals beaten.
That measure of % of rivals beaten takes things one step further than any measure based on first-four places by considering exactly where in the field a horse finished and the % of its rivals that finished behind it. “Par” will be 50%, and a figure of 68.8% is very high indeed.
There are drawbacks to % rivals beaten as a measure – principally that information further down the field gets accorded undue significance – but things tend to come out in the wash, even in fairly small sample sizes, and the 50% par is clear and intuitive.
The message is simple: offset the smallness of sample sizes inescapably associated with trends by using measures more sophisticated than winners only (and the more sophisticated those measures the better).
And it is simple in another respect: three-year-olds have done notably well in recent Old Borough Cups, even if wins alone disguise that fact!