Dualnoise: randomness

Showing posts with label randomness. Show all posts

Sunday, June 16, 2013

Traveling Surveyor: Contributions of Mahalanobis to Analytics - 2

Bengal
This post is the second in a series of blogs based on the work done by P. C. Mahalanobis in the area of statistics, analytics, and operations research between 1930-1960. We move north-east from the location of our prior post on storm-flood forecasting (Orissa) to Bengal: a land of science, wisdom, and dharma that gave to the world a Vivekananda whose thoughts deeply affected Gandhi's contribution to India's freedom struggle, who in turn shaped the work of Martin Luther King, Jr., and Nelson Mandela, and thus the civil liberties of a significant population of the world.

Much of discussion here is gleaned from ISI archives, websites, and various papers from Sankhya, ISI's flagship journal. Those interested in a more detailed and accurate analysis of this work are referred directly to ISI's journal material.

Jute
Bengal (including Bangladesh, formerly East Bengal) produces much of the world's jute. India is the largest producer and consumer of Jute today, followed by Bangladesh. Jute is an incredibly useful crop and has been a significant contributor to Bengal's revenue for a long time. Here are some contemporary pictures of standing Jute crop in Bengal.

(pics source: informedfarmers.com)

Why this work is important
Prior to 1947, when India was still occupied by the British Raj (there's a very relevant reason for bringing this up, but we'll get to that later), forecasting the supply of this valuable and lucrative crop in Bengal was largely a product of bad guesswork. Like most other sectors in India, the agricultural sector too is highly decentralized, which means a myriad of tiny farms growing jute and other crops, all of which had to be surveyed if one wanted to get an exact, enumerated production number. Mahalanobis came up with an alternative in the 1930-40s using methods derived from statistics and a field that is now termed 'Operations Research': a scarce-resource optimized method for accurate crop forecasting. Today, the Government of India employs sophisticated remote sensing including a Satellite Survey System to improve crop forecasts, but the methods developed then are still relevant and valuable. The seminal work of Mahalanobis in developing an optimal sample-based survey is also interesting to read from a practitioner's perspective. The combination of ideas employed in the work done in the 1930s include data analysis, statistical modeling, pilot study, scarce resource allocation, and mathematical optimization, and ranks among the great achievements in the practice of Operations Research and Analytics.

A map of undivided Bengal, circa 1850 C. E. (source: http://jrahman.files.wordpress.com)

Motivation
Jute and cotton were two of the most important exports out of India after the manufacturing sectors was crippled by the British Raj - 24% of the total revenue between 1927-37 was from Jute. Estimating the total Jute produce in Bengal up until the 1940s was largely a product of guesswork and ad-hoc estimates provided by the administrative chain of the British Raj produced wildly varying numbers. Like other parts of India, cultivation in Bengal was decentralized and spread over nearly 100 Million small farms, which were on average less than half-an-acre in area, spread over more than 60, 000 sq. miles. Jute was grown in a subset of these farms. Furthermore, the cultivation lifecycle of Jute is very short - about two months from planting to harvesting. Consequently, even if the administration was willing to cough up the expenses for an enumeration survey, covering all these farms within 8-9 weeks would be extremely expensive, if not impossible. Add to the fact, that many plots (30%) that cultivated Jute also cultivated other crops in parallel. Thus, while in theory, we can expect a total enumeration to give us near-zero error, in practice, allotting multi-crop areas to Jute and other Human-induced errors would introduce noise. In fact, the report states that the biggest negative associated with an enumerative survey was not the prohibitive cost but its unreliability, and this motivated Mahalanobis to develop and implement an alternative approach that accomplished the task at a fraction of the cost and time, and at a higher level of accuracy using random sampling.

Random Sampling
The nearly 100M jute farms were spread over Bengal in a non-homogeneous manner. Some areas were densely cultivated, some sparsely. The approach was to partition the total area into zones, i = 1, ..., n (area A_i) whose area was internally homogeneous (kinda like the way finite element analysis is used in structural engineering). Within each zone, a number of areas or grids were selected and sampled at random. If a sufficient number of such grids were sampled, the average proportion of area under Jute within a zone (J_i) can be obtained, which allows us to predict the total area under Jute = sum(i) A_i * J_i.

Decision variables
1. The partition of the total area into approximately homogeneous zones
2. The number of random samples within a zone
3. The area of a sample

For simplicity, we assume that the first decision of partitioning the area within Bengal is an external input and thus our focus is on optimizing the remaining two decisions.

Constraints
1. The cost of the whole operation depends on the second and third set of decision variables. For a given budget, if the area of an individual sample is large, then the number of samples has to be reduced, and thus the samples would be more spread out and further away from each other.

2. The achievable precision (variance) varies similarly. If the sample area is large, the per-sample variance is smaller, but cost considerations limited the number of such large-samples, and this can hurt the overall variance accumulated across the zone. On the other hand, a smaller area in tandem with a larger number of such small-samples affects precision in an opposite manner.

Nonlinear Optimization Problem
Given either cost or precision as a hard constraint, select the sampling area and the number of random samples to maximize precision, or minimize cost.

Mahalanobis' approach attempts to model the change in variance and cost as continuous functions of the two decision variable sets. Once these functions are at hand, a local optimum is obtained using a derivative based Lagrange-multiplier method. Mahalanobis used this approach to tabulate the achievable precision for a range of cost levels.

An exploratory, small-scale (pilot) survey was initially conducted at a small expense as a proof-of-concept and proof-of-technology validation of the methodology prior to embarking on a full-scale project. This type of an approach is now widely adopted in many business analytics projects.

The effect of the decision variables on variance can be calculated relying on theoretical methods. However, human-induced errors were also common, and Mahalanobis used the idea of interpenetrating half-sample pairs, where two groups independently arrived at Jute area estimates for a given location. There are many important details here that are left out for brevity. The cost calculation is detailed and empirical and depends on the nature of the survey, and among things, include:
a. cost of staying and surveying at a given site - this depends on the size of the sampled area and time spent
b. cost of traveling from sample to sample - this depends on the distances between the chosen samples and the sequence of visiting.

Again, we have left out a humongous amount of cost calculations that were done. Reading the reports that came out of this work, one is amazed by the time and effort devoted to meticulously tabulating the various costs that go beyond 'ball-park' estimates, to produce an accurate cost function. For example cost calculation (b) depends on the solution to the corresponding traveling salesman problem.

The TSP
One of the many reports that came out of this this project notes:

(source: Sankhya journal, 1940)

This cost calculation is reviewed by Applegate, Bixby, et al. in their book on TSP and in Bill Cook's 'In Pursuit of the Traveling Salesman'. A literature review of this TSP in these books mention that researchers later showed that the expected length of the optimal tour was approximately between (0.707, 1.27) times the square root of the number of samples visited in a unit square, so Mahalanobis' 1930s estimate was a remarkably good choice.

Results and Business Impact
The cost- and precision-controlled random sampling approach proved to be revolutionary. It achieved greater precision at a fraction of the cost. Specifically, the margin of error was +/- 2%, and the cost was 1/15 of an enumeration census that was performed the same year and found to be less accurate compared to the random sampling approach. Thus the benefit and return-on-investment of this analytical approach was successfully demonstrated in practice, which received widespread recognition and was later embraced by the Government of independent India for many nationwide surveys.

Prelude to Part-3: The Bengal Holocaust
Within a couple years of the successful demonstration and publication of this work, Mahalanobis' Bengal lost between 2-6 million people due to starvation and disease between 1942-1945, triggered in part possibly by a failure of rice crop. The British Raj, locked in an grim Atlantic battle during WW2, may have suppressed reports and figures. It appears that most of the world, and even a vast majority of Indians, to this day, remain unaware of the reality behind this event. How to obtain a reasonable estimate of casualties due to this disaster? Who was responsible and how? A recent book has brought this controversy into the open, and it appears that Mahalanobis (and his statistical sampling method) may have played a critical part in solving this puzzle.

To be continued.

Sunday, December 2, 2012

Managing Black Swans using Operations Research

LPHC Events
The importance of modeling Low-Probability High Consequence (LPHC) events, popularized in recent times by Nassim Taleb as "Black Swans" has been recognized by Operations Researchers for quite some time. As this recent article (via Reddit) implies, it is both demystifying and useful to talk of consequential rare events in terms of a range of probabilities rather than a potentially misleading binary classification of B/W Swans. For example, LPHC modeling is critical to identifying min-risk Hazmat routes, a popular research area within OR. In addition to standard risk measures such as the expected probability of an accident occurring on a given route, or the total expected cost of a chosen route, planners also explicitly limit the conditional expected cost associated with a catastrophic event along a prescribed route if in fact such an event were to occur along any link in the route. Practical OR models aim for robustness by never focusing on just one objective given that extremal solutions tend to faithfully follow the no-free-lunch rule. OR Planners typically track a variety of risk objectives and seek alternative optimal solutions that adequately address the multiple modes of failure that can be triggered over time.

Narrative-Rational, Time-Dependent World
Until recently, I thought that coin tosses were 50-50 events. In an unbiased, narrative-free world they are, but as Venkatesh Rao argues in his book 'Tempo' (and this understanding is subject to my limited ability to grasp and interpret the dense writing style employed in this book) that there is no absolutely narrative-independent useful model of rational decision-making: "... there is no meaningful way to talk about specific decisions outside of a narrative frame and a concrete context, any more than it is possible to talk about physics without reference to a specific, physical coordinate frame ..."
Rao agrees with Taleb's 'Black Swan' book claim that introducing a narrative injects bias, but is disinclined to adopt the approach of severing the narrative from the decision making process. Instead, Rao chooses to embrace narrative and his book examines elegant and sound decision-making within a given enactment over the narrative clock. On the other hand, Venkatesh Rao feels that the narrative-independent "Calculative rational decision-making finesses situatedness by working only with highly controlled clocks. When you start this way, seductive general theories of decision-making take center stage, and local conditions are forgotten, minimized or dismissed as “details.”. Perhaps, by zooming in on the narrative of a guy who makes decisions based on the outcome of 'vigorously tossing and catch a coin', researchers were able to recognize these subtle 'local conditions' that allowed them to predict the bias in real-life coin tosses.

Risk and Analytics
How do we deal with risk associated with decisions prescribed by today's predictive analytical models embedded within decision support systems? This is an important issue since our underlying predictive analytical models are calibrated using noisy data generated from past decisions. As long as our recommendations lie within the historical range, its consequences can be expected to be reasonably bounded. However, once we create new history (e.g. pricing a product at a value never ever done before), our predictive model is forced to extrapolate. Per 'Tempo', we now exit a closed world and enter an open one, where "The open world is a world that includes what Donald Rumsfeld called “unknown unknowns” and Nicholas Nassim Taleb calls “black swans” (rare, highly consequential events)." In such instances, history is of limited use as a guide, and statistically driven analytics can fail. Rao says that such calculative rational [analytical models]: "... can be viewed as a way of systematically leveraging or amplifying intuition. But intuition amplified through calculative-rational models, while necessary, is not sufficient for true risk management.".

Welcome to Entropic Decision Optimization
One option to survive the open world is to have in place a theoretical model that anticipates the consequences of such unforeseen events. 'Tempo' suggests that "Informal human mental models can comprehend open worlds because they can contain things that haven’t been understood: a high entropy state." Although Rao rightly suggests that analytical methods for such weakly understood "high entropy" situations are in its infancy, having even a simple and reasonable model helps. For example, in a recently patented pricing optimization method, we attempt to reduce a user-specified measure of 'entropy' (in the form of business disruption) associated with recommendations, which has worked reasonably well within a retail narrative so far. However, in the narrative-rational world of 'Tempo', the laws of thermodynamics dictate that all good things must come to an end regardless of whether we encounter a black swan or not. A simple "Frankenstein design principle" that we discussed a couple of years ago suggests that the design of particularly complex systems should explicitly assume a failure at some point in time. 'Tempo' chooses the example of 'Tetris' (rather than Blackjack/Poker, that analytics folks more commonly use) to illustrate that we can maximally delay the inevitable by chalking up elegant but temporary wins via better entropic decision making. A time-knapsack, if you will.

Thursday, June 28, 2012

Time-lapse view of a Soap Opera

One can speed-watch a bad 3-hour Shah Rukh Khan Hindi movie, set in some implausible post-modern world, in 15 minutes or less, and yet fully grasp the plot and story. Somewhere down the line, those wonderfully poetic song-and-artistic-dance filled world of authentically Indian movies and television programming disappeared, and along with it, my patience, leading to this compressed viewing method. On the other hand, it turned out to be quite interesting watching Jack Bauer save the U.S from annihilation in a '24' hour day, over a contiguous 24 hour period (on DVD) without any cuts. This provided a completely different experience and understanding compared to watching '24' live one-hour capsules across 24 weeks.

To experiment with a larger data set, I decided to watch a popular (5 days a week) Hindi TV serial that achieved high viewership ratings over its two-year lifetime (2008-10), and whose episodes were meticulously labeled and archived on the net by some fans. That data set had about 520 observations (episodes), which I speed-parsed over an exhausting six week period.

A TV Serial Storyline is a Random Process
Time-lapse viewing allows one to recall past events and key plot twists relatively more clearly, thus making it a little easier to figure out the 'rate' at which the storyline converges. If you think of a TV serial storyline as an origin-destination path that passes through a number of intermediate and interlinked stages (or states), you can identify the type of state transitions, including the many instances of 'back to square one' in a time-lapse mode. The soap storyline returns to a previously visited state (i.e. cycles or sub-tours) with a significant nonzero probability, much to the viewer's chagrin. You can also spot the random walks where the storyline drifts away from the original theme. When these events happen too often in soaps, it is a dead-giveaway that the plot is on life-support and the plug will be pulled shortly.

The story line of most TV serials appear to be remarkably Markovian in design. If you were to watch the series in real time (i.e. slowly), you would be able to skip many episodes and yet rejoin in a few weeks without really missing anything. In a soap opera, the future is indeed independent of the past, given the present.

Another O.R analogy
There are some intermediate episodes that generate peak ratings and a spike in viewer comments in the online archive. These often turn out to be events that appear to be 'pivotal' in the sense that they promise a lot of exciting ripple effects down the road. There is just the right amount of uncertainty in the future trajectory of the storyline but with some enticing structure around it. For example, if you were to look back at the Harry Potter book series, this would be around year 4 at Hogwarts. It's similar to what one experiences when solving a combinatorial MIP using branch-and-price (i.e. column generation) in practice. Starting with the root node, as you begin to apply restrictions and generate new columns, your LP objective function value usually drops*, and half-way through, your solver promises a really good solution. The possibilities are exciting. Alas, as your restrictions increase and you get close to the end state and your solution becomes increasingly integral, your objective function suffers a backlash from those nasty combinatorial constraints, and the objective function value goes through the roof. Invariably, much like a TV or book series, the ending is almost never quite as exciting as how one expects it would be. After investing all that time and effort. Perhaps, like life, good soaps are more about the journey and less about the destination.

*update (July 1, 2012)
The LP relaxation objective function value drops because:
a) the unrestricted subproblem in many large-scale planning problems is discrete, nonlinear, nonconvex; it is quite difficult to solve this problem to provable optimality within a reasonable amount of time

b) adding restrictions initially eliminates many bad solution areas, i.e. it simulates the role of inadvertent cutting planes, thereby improving solution quality.

Wednesday, May 23, 2012

Optimizing your Microwave Oven Performance using an Appalam

A far-out post on 'optimization' and 'parameter estimation'.

No two microwave ovens behave the same way even if they have the same power rating and capacity, and over time they show increased randomness in terms of energy usage and solution quality. Each one has its idiosyncrasies and as you move between apartments over the years, it is irritating to adjusted to a new oven. After you invest all the time in figuring out that the optimal settings to warm your cold coffee is 50 seconds in the previous home, using the same timing with the new one results in a lot of spilled coffee. Furthermore, objects get heated slightly quicker when placed in certain locations of the oven. What is a quick way to optimize your vessel placement and minimize your turn-around time? Enter, the Appalam.

The Appalam or the Pappad as it is called in Northern India is a circular, flattened, wafer thin dry mixture of lentils and spices. It is an almost fat-free, delicious snack if you microwave it although it can also be fried. In particular, the best brand for this test is the Lijjat Pappad. This one is hand made (or used to be). The Lijjat company rose from a tiny all-women cooperative start up in rural India (based on Mahatma Gandhi's principles of self-reliance, theirs is a remarkable and inspiring success story) and is still going strong, producing outstanding Pappads in a variety of flavors.

The idea is pretty simple: The Lijjat Pappad (LP) has a relatively large surface area that covers 75-100% of the circular glass tray in most residential microwaves. Microwave the LP for about 60 seconds and note which of its parts get heated up first (visually seen via a change in color) and how the cooking progresses. Within 60 seconds, you should be able to figure out the hottest and the coldest spots in the oven. In my current residence, the middle of the oven turned out to be the coldest, which was the exact opposite of the result in my prior residence. Of course, this is not intended to be a universal setting and only applies to a subset of foods being heated up.

Why the LP works relatively well for such a test:

1. The consistency of the LP mix is quite remarkable. It is neither thick to resist microwaving, nor too thin and very rarely exhibits any significant warping even after 120 seconds of microwaving (it will simply get carbonized before it warps).

2. The material naturally does not conduct heat well and convection doesn't help much either, so the localized heating effects show up visibly.

3. You can make a meal of your experimental subject once your test is complete. No test goes waste!