Money Balling Cricket: Averaging Babar Azam’s runs

Arslan Shahid
12 min readApr 19, 2021

--

One of the key elements in the movie Moneyball(2011) is that Billy Beane (Brad Pitt) and Peter Brand (Jonah Hill) discover an anomaly in the way baseball clubs recruit baseball players. Baseball scouts (recruiters) value characteristics that have nothing to do with how well a player is going to perform. Peter (the statistics whizz) states that baseball teams should focus on buying wins by focusing on buying runs. A similar case could be made for cricket, one could argue that the most important statistic for a batsman is how much he/she can score.

Babar Azam has been a star addition into the Pakistani team; his consistent performances have won him plenty of fans. Despite being a young player, he is considered one of the greatest batsman ever to play for the Pakistani team. Let’s see how his performances break down.

Data

Before any analysis it’s imperative that you have the right data. Fortunately, you can find historic cricket data from this website (https://cricsheet.org/ — the website is blocked in Pakistan, so you’ll need a vpn to access the site).

Cricsheet provides ball level data for all international matches and domestic leagues (PSL,IPL etc). The dataset will be in yaml format, that you can extract using python or other tools . If you want to avoid the hassle of converting and constructing your own dataset, you can find the dataset I’ll be using here.

Pakistan Super League

Below you can see the screenshot of all PSL matches data. With the balls, striking batsman, non-striking batsman, runs scored in the match, innings, wicket etc.

Screenshot for PSL matches dataset

Babar Azam has played 49 Matches and 1507 balls in the PSL as of March 23rd 2021. You can get the aggregates for Babar Azam using this code snippet.

#complete_data has all ball by ball data
#Babar has all balls played by babar Azam and you can use groupby aggregate to get an aggregates for Babar Azam
#scores dataframe has all the aggregates
babar =complete_data[complete_data[‘batsman’]==’Babar Azam’]
scores = babar.groupby('Match').agg({'batsmen runs':'sum','wicket':'first','index':'count','date':'first','innings':'first'}).reset_index()
scores['wicket'] = scores['wicket'].replace(np.nan, 'N/A')
scores['wicket_type'] = [json.loads(x.replace('\'','\"'))['kind'] if x!='N/A' else 'censored' for x in scores['wicket']]

Here is the aggregated dataset

Shows aggregates for Babar Azam, with runs scored in each match, balls_played (index) etc

Lets visualize some of the results

Fig 1. Wicket type . Fig2 : Balls played in the match and runs scored with colors of opposing teams
Fig. 3 Maximum runs scored against each team and balls played in the match. Fig4. Frequency of each outcome

From the above graphs we can see that the most common way of getting out for Babar Azam is getting a catch out, followed by being bowled etc. He remains not out -censored in 8 out of 49 matches. Secondly, we can see that there is a linear relationship between amount of balls played in the match and runs scored — with his best total being against Multan Sultan of 90 runs on 63 balls. Out of the 1507 balls he played in PSL, 644 resulted in 1 runs, 546 resulted in ducks, 191 in 4 and 28 balls resulted in 6.

Averaging his score

As stated in the start, the most important metric for a batsman is how much he scores, as runs are what result in wins.

In the 49 matches in PSL, Babar scored a total of 1774 runs for 1507 balls. That puts his average score per ball at 1.177. If we take all of his totals for the matches and weight them against the number of balls played in each respective match we get : 51.177 average runs per match.

However, there is a problem with simple averaging as it can be susceptible to outliers. For example, if in 4 matches he scores a 100, 30 , 50 and 0 and the balls he played are 70 balls in the first match, 40 in the second match, 60 in the third and 1 in the 4th. Then his average runs will be 65.5 — (100*(70/171) + 30*(40/171)+ 50*(60/171) +0*(1/171)). This is misleading as this suggests that in every match he should score around 65 runs per match (as that is his average) while in our example he scores 50+ runs in only one match. In cricket it is natural that the more balls a batsman plays the more they score. So we should adjust his scores with the probability of him staying not out until a certain number of balls or the probability of him “surviving” until the nth ball played in the match.

Another problem is that often batsmen remain not out by the end of the innings and we need to account for that. In statistics this is described as censoring, meaning we were not able to observe our subject until the event of interest occurs.

Survival analysis is really intriguing and interesting way of measuring time until certain event occur. You can explore this article and other resources on the internet to get an in-depth explanation of the topic. In survival analysis it is typical to estimate a survival function -S(t) that measures the probability of surviving until time t.

One of the most common techniques to estimate the survival function is the Kaplan-Meier Estimator. Before delving into that let’s define some terminology — lets call d_i as the number of records (matches) in which we observed that the event of interest (getting out) is observed up until the ith time and the n_i as the number of records (matches) in which we have not experienced the event of interest yet up until the ith time— (n_i includes records up until the point the record is censored).

The Kaplan-Meier(KM) estimate or S(t) at time t =0 is always 1 — intuitively this means that Babar is always not out if he hasn’t played a ball. Let’s say that we have records of 10 matches and out of those 10 matches in 9 matches Babar survived until the 1st ball. Then his survival function at t=1 is (n_1 =10, d_1=1):

S(1) = S(0) * (10–1)/10 or S(1) = S(0)*(1–1/10) = 0.9

If he survives till the 2nd ball in 8 out of those 9 matches (in which he currently survives ) then his survival function at t=2 is (n_2 = 9, d_2 = 1)

S(2) = S(1)*((9–1)/9) or S(2) = 0.9*(1–1/9) = 0.8

More generally if we solve the recursion we get the general formula for the survival function as :

Don’t be intimidated by this formula or these calculations, take the time to read up more, it will become more clear as you follow along. We can use python to simply estimate the Kaplan-Meir Estimates. You can use the library lifelines. In order apply the Kaplan-Meier Fitter you need two list one that stores the durations of the events or in this case the balls in each match and second list that shows if the data-point was censored or not (if he was not out by the end of the match or not). In the aggregated dataset above we already have a index column that tells the duration and a column for the wicket type. So you can fit a Kaplan-Meier Fitter as follows.

#index column has the balls played each match
durations = scores[‘index’]
scores[‘Event’] = [True if x==’Not Out’ else False for x in scores[‘wicket_type’]]
event_observed = [1 if x==False else 0 for x in scores[‘Event’]]
kmf = KaplanMeierFitter()
## Fit the data into the model
kmf.fit(durations, event_observed,label=’Kaplan Meier Estimate’)
## Ploting the Survival Function
fig,ax = plt.subplots(figsize=(9,5))
kmf.plot(ci_show=False,ax=ax, label=’Survival Probability’)
ax.set_xlabel(‘Balls Played’)
plt.style.use(‘ggplot’)
Survival Function for Babar Azam — the y axis denotes the probability of survival, x -axis denotes balls played.

You can notice the flat parts in between that indicate we haven’t observed him getting out in those periods. Also in this case the survival probability is never zero due to censored data points where Babar remains not out. Let’s look at his survival times individually.

Survival timelines for each match, the blue lines are instances where he remained Not Out (censored) and the red dot tells when he was out.

As indicated by the above graphs, we can see how the probability of Babar Azam getting out changes by balls played. We might be interested in seeing his median survival time or mathematically, the number of balls at which his survival probability =0.5. Which in this instance is 34 balls. Kaplan-Meier curves allow us to compare different groups. Like in this case we can see how Babar’s survival functions differ based on the opposing team.

Comparison of survival functions based on Opposing Team

In general if the survival function of one group is above the other group it indicates that the survival probability is higher. Which in our comparison tells that he is more likely to not get out. One other way of comparing survival curves statistically is to use the Log-Rank test. For a detailed explanation you can read this article. For simplicity sake all you need to know is that like other statistical tests like z-score etc, the test looks for differences in the curves and tests whether those differences are statistically significant. We need to apply the log-rank test pairwise for each curve. Once again you can use lifeline library’s built-in functions to do this easily.

Result of the Log-Rank Test for all different teams

If we take our rejection probability as 0.05 we can see that none of the curves are statistically different from one another — p value is not less than 0.05 in any comparison. Which means that Babar Azam’s survival probability is not significantly impacted by the team he is playing against as per the test. However, the log-rank test is limited in that it can be impacted by the curves crossing, which in this case they do. In such cases differences in early stages maybe offset by differences in later stages, which could be one reason why the test doesn’t see statistically significant differences. So our next best alternative is too look at the chunks of survival curves and see if there are differences. Looking just at the shapes of the curves we can see that Babar is most likely to survive when batting against Multan Sultan. In the early stages of the game his survival curve for Peshawar Zalmi is lowest, indicating that he is more likely to lose a wicket against them early than any other team. Yet in the later stages he survives better against them as opposed to the baseline. After 30 balls or so Babar is least likely to survive against Quetta Gladiators but survives better than baseline against them in middle overs of his innings.

Let’s compare his first innings survival with his 2nd innings survival.

Survival Curves based on innings — whether he is setting the score (1st) or chasing (2nd)

One interesting thing you can note — all though the differences are not statistically significant on the whole (as per the log rank test) — is that for the 2nd innings the tail of the curve is above the 1st innings tail, meaning that he is more likely to survive in the end of the 2nd innings as opposed to the 1st innings. Also in the middle overs of the play the curves invert, with the 1st innings curve being above suggesting that he more likely to survive in the middle overs if he is setting the score as oppose to chasing.

Now, to the main objective averaging his score, one way to go about is to calculate his average score per ball and multiply by the survival probability at each ball. and sum the total result. That would be his survival adjusted average, which comes out to be 38.25 runs. This metric tells us a better weighted result than the simple average but it would be naïve to assume that as he plays longer his average score per ball doesn’t change. The below graph shows his score per ball for each of the 49 matches he played in PSL.

Fig1.Match by Match score per ball — color coded against Opposing Team Fig2. Average score per ball by over

As can be seen by the above figure 1, he usually scores lower than 0.2 runs per ball in matches in which he gets out in the first 15 balls. However, as he survives longer his score per ball suddenly jumps and increases with a decreasing rate. Up in matches where he survives above 40 balls his runs per ball stabilizes in a range between 1 to 1.4 runs per ball. Clearly, the relationship indicates a monotonically increasing function with a decreasing rate of increase. If we look over to fig2 which plots the average score per ball in each over the match ends, we can see this functional relationship more clearly. Mathematically, there are numerous curves you can fit to such a dataset, like polynomial functions and logarithmic functions. In the absence of regressors in our dataset, the best option seems to just fit a curve that has the least square difference. You can use scipy’s curve fit to define and fit mathematical functions to variables. I fitted two functions of the following forms:

  1. y = a*log(x +c) — parameters a & c
  2. y = a*(x^c +d) — parameters a,c,d

Scipy will try to find the parameters that best fit the data. The graph shows the results

Functions fitted against the data

You can pick the curve which has the least error — you can define the error as the sum of square differences, commonly known as mean squared error. In this particular case the logarithmic fit has the lower error. Now that we can predict the score per ball based on how long the match lasted, lets see how our predicted totals for the matches fair with actual runs that Babar scored.

Predicted scores vs Actual scores plotted against balls played

Using both the survival function and the runs predictor, we can construct ranges for his runs. You can use lifelines conditional_times_to_event function to get the median survival times given that he has already survived up until a certain number of balls.

Timeline column states how long he has already survived and the next column tells the median time until he gets out.

Combining our score fitter and the median timeline until he gets out you can plot the expected score he is going to make given how long he has already survived.

Fig1.Shows median additional score he is going to make given the amount of balls he already has survived — Fig2.X-axis How long Babar has currently played/survived, y-axis End of match scores based on median survival time from that point.

From Fig1, you can see the additional score he stands to make if he already survived up until a certain point. The most interesting aspects are the spikes in between the curves, they can be interpreted as milestones after which his batting performance increases.

As you can see from fig2 when Babar survives the first 20 balls he scores 60+ runs 50% of the times and when he survives longer than 40 balls he scores 70+ runs 50% of the times. Also his median score at the beginning is around 40 runs. Using this we can figure out during a match or play how much Babar is likely to score.

Ending Remarks

Although, I tried to construct the best models possible, one important thing to remember is that no model is perfect. These models rely heavy on assumptions. There can also be some insights that I was unable to see or examine. So I encourage you to play around with the dataset I shared to gain more valuable insights.

I plan on exploring the topic of analytics and cricket further and next up would like to compare Babar with other star players of the game.

With that being said, data science and artificial intelligence are great tools to learn. I hope you found the article interesting, if yes, please do follow me on medium, as I aim to explore more interesting things beyond cricket. If you have any feedback or question please don’t hesitate to ask in comments.

--

--

Arslan Shahid
Arslan Shahid

Written by Arslan Shahid

Life has the Markov property, the future is independent of the past, given the present