Dualnoise: October 2012

Sunday, October 28, 2012

Analytics and Cricket - IX : Book cricket v/s T20 cricket

Introduction
This previous post on cricket in this tab can be found here. We discovered how long a game of snakes and ladders is expected to last a while ago. Calculating the duration of a book-cricket game appears to be relatively simpler. It's a two-player game that used to be popular among kids in India and requires a book (preferably a thick one with strong binding), a pencil, and a sheet of paper for scoring. A player opens a random page, and notes down the last digit on the left (even numbered) page.

(image linked from krishcricket.com)

A page value of 'zero' indicates that the player is out, and an '8' indicates a single run (or a no-ball). The remaining possibilities in the sample space {2, 4, 6} are counted as runs scored off that 'ball'. A player keeps opening pages at random until they are out. Here's a sample inning from a simple simulation model of traditional book-cricket:
6, 2, 2, 1, 4, 4, 1, 2, 1, 2, 4, 2, 1, 4, 6, 6, 6, 4, 0
score is 58 off 19 balls
The counting process terminates when the first zero is encountered. Given this game structure, we try to answer two questions: What is the expected duration of a player's inning, and what the expected team total is (i.e., across 10 individual innings).

Conditional Probability Model
Assume a page is opened at random and the resultant page values are IID (uniform) random variables.

Let p(i) = probability of opening page with value i, where
p(i) = 0 if i is odd, and equals 0.2 otherwise.

D = E[Duration of a single inning]

S = E[Score accumulated over a single inning]

Conditioning on the value of the first page opened, and noting that the counting process resets for non-zero page values:
D = 1*0.2 + (1+D)*4 *0.2
⇒ D = 5.
Next, let us compute F, the E[score in a single delivery]:
F = 0.2*(0+2+4+6+1) = 2.6 runs per ball, which yields a healthy strike rate of 260 per 100 balls

S = FD = 13 runs per batsman, so we can expect a score of 130 runs in a traditional book-cricket team inning that lasts 50 balls on average.

Introduction of the Free-Hit
The International Cricket Conference (ICC) added a free-hit rule to limited overs cricket in 2007. To approximate this rule, we assume that a page ending in '8' results in a no-ball (one run bonus, like before) that also results in a 'free hit' the next delivery, so the player is not out even if the number of the next page opened ends in a zero. This change will make an innings last slightly longer, and the score, a little higher. Here's a sample inning (a long one):
1, 6, 1, 0, 1, 4, 2, 2, 4, 6, 1, 6, 2, 4, 2, 4, 6, 6, 6, 4, 1, 1, 6, 0,
score is 76 off 24 balls
Note that the batsman was "dismissed" of the 4th ball but cannot be ruled 'out' because it is a free-hit as a consequence of the previous delivery being a no-ball. All such free-hit balls are marked in bold above.

D = 1*0.2 + (1+D)*0.2 + (1+D)*0.2 + (1+D)*0.2 + (1+d)*0.2
= 1.0 + 0.6D + 0.2d
where d = E[duration|previous ball was a no-ball]. By conditioning on the current ball:
d = (1 + d)*prob{current ball and previous ball are no-balls} + (1+D)*prob{current ball is not a no ball but previous ball was a no ball)
= (1+d)*0.2 + (1+D) * 0.8
⇒ d = 1.25+D

⇒ D = 1 + 0.6D + 0.2(1.25+D)
⇒ D = 6.25

Under the free-hit rule, a team innings in book-cricket lasts 62.5 balls on average, which is 12.5 page turns more than the traditional format. A neat way to calculate S is based on the fact that the free-hit rule only increases the duration of an inning on average, but cannot alter the strike rate that is based on the IID page values, so S = 6.25 * 2.6 = 16.25. To confirm this, let us derive a value for S the hard way by conditioning on the various outcomes of the first page turn:
S = 0*0.2 + (S+2)*0.2 + (S+4)*0.2 + (S+6)*0.2 + (s+1)*0.2
= 2.6 +0.6S + 0.2s.
where s = E[score|current ball is a no-ball] and can be expressed by the following recurrence equation:
s = (1 + s)*prob{next ball is a no-ball} + (r+S)*prob{next ball is not a no-ball), where
r = E[score in next ball | next ball is not a no-ball]
= 0.25*(0 + 2 + 4 + 6) = 3

Substituting for r, we can now express s in terms of S:

s = (1+s)*0.2 + (3+S) * 0.8
⇒ S = 2.6 + 0.6S + 0.2(3.25+S) = 16.25, as before.

Under the free-hit rule, the average team total in book cricket is 162.5 runs (32.5 runs more than the total achieved in the traditional format). The average strike rate based on legal deliveries, i.e. excluding no-balls, is 162.5 * 100/(0.8*62.5) = 325 per 100 balls. A Java simulation program yielded the following results:
num trials is 10000000
average score per team innings is 162.422832
average balls per team innings is 62.477603
average legal balls per team innings is 49.981687
scoring rate per 100 legal balls is 324.9646855657353

Result: In comparison to real-life T20 cricket (~ 120 balls max per team inning), book-cricket is roughly 50% shorter in duration, but the higher batting strike rate usually yields bigger team totals in book cricket. The fact that we can even rationally compare statistics between these two formats says something about the nature of T20 cricket!

The cost of front-foot no-balls and big wides in T20
We can use the simple conditional probability ideas used to analyze book-cricket to estimate the expected cost of bowling a front-foot no-ball and wide balls in real-life T20 matches by replacing the book-cricket probability model with a more realistic one:

Assume p[0] = 0.25, p[1] = 0.45, p[2] = 0.15, p[3] = 0.05, p[4] = 0.05, p[5] ~ 0, p[6] = 0.05, p[7, 8, ...] ~ 0.

E[score in a ball] = 0 + 0.45 + 0.3 + 0.15 + 0.2 + 0.3 =1.4

This probability model yields a reasonable strike rate of 140 per 100 balls)

E[cost | no ball] = 1 + 1.4 + 1.4 = 3.8

Bowling a front-foot no-ball in T20 matches is almost as bad as giving away a boundary (apart from paying the opportunity cost of having almost no chance of getting a wicket due to the no-ball and the subsequent free-hit). Similarly,
E[cost | wide-ball down the leg-side] = (5|wide and four byes)*prob{4 byes} + (1| wide but no byes)*prob{no byes} + 1.4.

Assuming a 50% chance of conceding 4 byes, the expected cost is 4.4. On average, a bowler may be marginally better off bowling a potential boundary ball (e.g., bad length) than risk an overly leg-side line that can result in 5 wides and a re-bowl.

More sophisticated simulation models based on actual historical data can help analyze more realistic cricketing scenarios and support tactical decision making.

Tuesday, October 23, 2012

The Gaussian Hare, the Laplacian Tortoise, and Big Data

Alternative Optimal Solutions
A recurrent theme of this tab is to highlight an important contribution of O.R decision modeling: alerting us to the presence of alternative optimal solutions (AOS). Prior posts relating to AOS can be found here and here.

Customers unfamiliar or uncomfortable with the jagged-edged linear programming (LP) models often seek refuge within the smooth world of classical calculus (a world now known to have been initially crafted by Kerala mathematicians, but i digress). Here, alpine slopes of gracefully changing functions invariably allow you to rapidly ski down to a uniquely best solution. Like the truism "Football is a simple game; 22 men chase a ball for 90 minutes and at the end, the Germans win", an applied math legend is that the Gaussian hare invariably trumps the Laplacian tortoise. Unless of course, modern day optimization solvers and decision science come into play and allow Laplace to overcome the various complications posed by non-smoothness. (Image below linked from http://www.econ.uiuc.edu)

Degeneracy
Well specified decision models in a variety of industrial situations admit plenty of good quality answers, so the problem is usually one of having 'too many' rather than too few options (my favorite academic researchers tackle increasingly difficult unsolved problems, whereas my favorite OR practitioners compete on identifying the easiest unsolved problems). A fundamental thumb-rule of practical decision optimization modeling is to advantageously exploit and defuse this often-deadly problem of 'degeneracy' that characterizes practical LP formulations, and a reasonably skilled analyst can turn a potential numerical liability of the LP model into a business analytical asset as follows.

AOS often hide in plain sight
The presence of alternative answers forces us to revisit our translation of business rules into an LP and devote some time toward gainfully analyzing the differences in the potential business impact of these seemingly equal solutions. The customer can use this feedback to re-examine and further refine his/her business goals and priorities. This process of iteratively improving the design specification provides valuable insight to customers, and helps setup a richer, and a more practical and robust optimization model. Not that a similar exercise is impossible to accomplish using smooth approximations - just that the flexibility afforded by an LP model is often superior, and the tool kits for analyzing LP solutions have gotten amazingly better over the years.

Means and Ends
It just doesn't make sense to crunch through 21st century "Big Data" analytics using bleeding edge data mining, econometric, and machine learning methods on the one hand, and on the other hand, downgrade to 19th century techniques, or random black-box methods to manage the underlying decision optimization problems and produce severely suboptimal and non-robust solutions. Using such shortcuts because "a quick one-iteration improvement is all that is needed" brings along with some risky side effects and potentially leaves a big chunk of 'Big-Data' value on the table. Do everybody a favor and upgrade to a Laplacian tortoise (e.g. CPLEX) and you will be surprised to see how fast it runs, especially on Big Data.

Friday, October 12, 2012

Quoting an Optimal Price

Suppose you are an eBay-like online store operator selling wonderful home furniture sets made in Indonesia. For example a typical sofa set you sell may consist of a central offering in the form of a large three-person couch that is accompanied by a couple of matching chairs and tables (sample picture from an online wholesale dealer below).

While the center-piece tends to catch the eye of couch potatoes, you would like your customers to also buy the accessory pieces that go with this core product to boost your margin. To make this an attractive proposition, you recognize that every customer has a different willingness to pay and budget in mind. So rather than employ an inflexible fixed-price all-or-nothing approach, or risk a full-blown auction, you keep prospective buyers interested by allowing them to buy what they like and even quote their own price that you either accept or counter with a higher price. Unlike you, the buyer cannot see the price offers accepted and rejected in the past. How should you price a customer request taking all this information into account?

This scenario, as many would recognize, is not uncommon at all and plays out across multiple industries in the real and virtual world. Practitioners of Operations Research and Business Analytics creatively design a variety of probabilistic optimization models that allow a seller to provide a real-time price quote to a prospective buyer that maximizes the statistical chance of a profitable sale while staying within a customer's intended budget. Such mathematical bid-pricing methods help create a win-win situation by matching a customer's willingness-to pay with a seller's willingness to sell.

Where can you read more about such customized pricing models for bid response? A good start is the superb book on Pricing and Revenue Management by Robert Phillips or this pdf link. And if you are lucky enough to attend the Informs 2012 annual conference in Phoenix, AZ along with thousands of OR fans, a brilliant colleague of mine will be presenting a novel real-world instance not involving furniture, and talk about its tremendous business impact measured using non-monopoly currency. If real-life Integer Programming based probabilistic bid pricing can wake you up very very early in the morning, this talk is scheduled for 8 A.M, October 17.

Tuesday, October 9, 2012

Predicting the Future Size of the Nehru Dynasty

A glance through Indian newspapers will tell you about corruption in the highest places - specifically within India's so-called first family of Gandhis. Those not familiar with Indian politics would be surprised to find 'Gandhi' and 'corruption' in the same sentence. The puzzle is quickly resolved once you discover that this Gandhi family thankfully does not have the Mahatma in their family tree. This is the family of Nehrus, and at some point, a surviving daughter married a relatively unknown chap bearing that hallowed last name, engineering the most profitable branding coup the world has ever seen.

It is easy to write reams about how this family has institutionalized poverty and corruption in India over the last 60 years but it suffices for the purposes of this post to note that starting a few years prior to India's political independence from the British in 1947, members of the Nehru dynasty have directly or indirectly controlled (and destroyed) the futures of several hundred million Indians. Sadly, their level of incompetence has increased every generation, and as their numbers slowly grow, it becomes important for Indians to know this: how many dynasty members will a person have to get through to reclaim power in New Delhi in the future? Take a quick look at these time-series data in 20-year chunks:

Date    Number     Dynasty members
1924-1944:    0              No Nehru calling the shots
1944-1964:    1              Jawaharlal Nehru
1964-1984:    1              Indira Nehru Gandhi
1984-2004:    2              Rajiv Gandhi & Sonia Gandhi
2004-2014:    3              SoniaG, RahulG, & PriyankaG

----------------------------------------------------------------

2014-2024:    3              SoniaG, RahulG, & PriyankaG
2024-2044:    5              SG, RG, PG and PG's two children

The numbers below the dashed line are future predictions based on the current family count, and assuming that Priyanka's two kids today (Rahul is unmarried with no pending paternity cases) will be baptized into the family tradition of absolute power in their 20s-30s, like every generation before them.

Of course, it should not be surprising that the counts shown above are exactly the first six numbers of the Fibonacci sequence. There is another interesting case of poetic injustice in the nomenclature that is hidden here. Calling them Fibonacci numbers perpetuates an injustice to the mathematicians in India who discovered the series a long time before Fibonacci and unlike the Gandhi-Nehru mix up, this fact was well-known outside India too.