Traveling Surveyor: Contributions of Mahalanobis to Analytics - 2

Bengal
This post is the second in a series of blogs based on the work done by P. C. Mahalanobis in the area of statistics, analytics, and operations research between 1930-1960. We move north-east from the location of our prior post on storm-flood forecasting (Orissa) to Bengal: a land of science, wisdom, and dharma that gave to the world a Vivekananda whose thoughts deeply affected Gandhi's contribution to India's freedom struggle, who in turn shaped the work of Martin Luther King, Jr., and Nelson Mandela, and thus the civil liberties of a significant population of the world.

Much of discussion here is gleaned from ISI archives, websites, and various papers from Sankhya, ISI's flagship journal. Those interested in a more detailed and accurate analysis of this work are referred directly to ISI's journal material.

Jute
Bengal (including Bangladesh, formerly East Bengal) produces much of the world's jute. India is the largest producer and consumer of Jute today, followed by Bangladesh. Jute is an incredibly useful crop and has been a significant contributor to Bengal's revenue for a long time.  Here are some contemporary pictures of standing Jute crop in Bengal.



(pics source: informedfarmers.com)

Why this work is important
Prior to 1947, when India was still occupied by the British Raj (there's a very relevant reason for bringing this up, but we'll get to that later), forecasting the supply of this valuable and lucrative crop in Bengal was largely a product of bad guesswork. Like most other sectors in India, the agricultural sector too is highly decentralized, which means a myriad of tiny farms growing jute and other crops, all of which had to be surveyed if one wanted to get an exact, enumerated production number. Mahalanobis came up with an alternative in the 1930-40s using methods derived from statistics and a field that is now termed 'Operations Research': a scarce-resource optimized method for accurate crop forecasting. Today, the Government of India employs sophisticated remote sensing including a Satellite Survey System to improve crop forecasts, but the methods developed then are still relevant and valuable. The seminal work of Mahalanobis in developing an optimal sample-based survey is also interesting to read from a practitioner's perspective. The combination of ideas employed in the work done in the 1930s include data analysis, statistical modeling, pilot study, scarce resource allocation, and mathematical optimization, and ranks among the great achievements in the practice of Operations Research and Analytics.

A map of undivided Bengal, circa 1850 C. E. (source: http://jrahman.files.wordpress.com)

Motivation
Jute and cotton were two of the most important exports out of India after the manufacturing sectors was crippled by the British Raj - 24% of the total revenue between 1927-37 was from Jute. Estimating the total Jute produce in Bengal up until the 1940s was largely a product of guesswork and ad-hoc estimates provided by the administrative chain of the British Raj produced wildly varying numbers. Like other parts of India, cultivation in Bengal was decentralized and spread over nearly 100 Million small farms, which were on average less than half-an-acre in area, spread over more than 60, 000 sq. miles. Jute was grown in a subset of these farms. Furthermore, the cultivation lifecycle of Jute is very short - about two months from planting to harvesting. Consequently, even if the administration was willing to cough up the expenses for an enumeration survey, covering all these farms within 8-9 weeks would be extremely expensive, if not impossible. Add to the fact, that many plots (30%) that cultivated Jute also cultivated other crops in parallel. Thus, while in theory, we can expect a total enumeration to give us near-zero error, in practice, allotting multi-crop areas to Jute and other Human-induced errors would introduce noise. In fact, the report states that the biggest negative associated with an enumerative survey was not the prohibitive cost but its unreliability, and this motivated Mahalanobis to develop and implement an alternative approach that accomplished the task at a fraction of the cost and time, and at a higher level of accuracy using random sampling.

Random Sampling
The nearly 100M jute farms were spread over Bengal in a non-homogeneous manner. Some areas were densely cultivated, some sparsely. The approach was to partition the total area into zones, i = 1, ..., n (area A_i) whose area was internally homogeneous (kinda like the way finite element analysis is used in structural engineering). Within each zone, a number of areas or grids were selected and sampled at random. If a sufficient number of such grids were sampled, the average proportion of area under Jute within a zone (J_i) can be obtained, which allows us to predict the total area under Jute  = sum(i) A_i * J_i.

Decision variables
1. The partition of the total area into approximately homogeneous zones
2. The number of random samples within a zone
3. The area of a sample

For simplicity, we assume that the first decision of partitioning the area within Bengal is an external input and thus our focus is on optimizing the remaining two decisions.

Constraints
1. The cost of the whole operation depends on the second and third set of decision variables.  For a given budget, if the area of an individual sample is large, then the number of samples has to be reduced, and thus the samples would be more spread out and further away from each other.

2. The achievable precision (variance) varies similarly. If the sample area is large, the per-sample variance is smaller, but cost considerations limited the number of such large-samples, and this can hurt the overall variance accumulated across the zone. On the other hand, a smaller area in tandem with a larger number of such small-samples affects precision in an opposite manner.

Nonlinear Optimization Problem
Given either cost or precision as a hard constraint, select the sampling area and the number of random samples to maximize precision, or minimize cost.

Mahalanobis' approach attempts to model the change in variance and cost as continuous functions of the two decision variable sets. Once these functions are at hand, a local optimum is obtained using a derivative based Lagrange-multiplier method. Mahalanobis used this approach to tabulate the achievable precision for a range of cost levels.

An exploratory, small-scale (pilot) survey was initially conducted at a small expense as a proof-of-concept and proof-of-technology validation of the methodology prior to embarking on a full-scale project. This type of an approach is now widely adopted in many business analytics projects.

The effect of the decision variables on variance can be calculated relying on theoretical methods. However, human-induced errors were also common, and Mahalanobis used the idea of interpenetrating half-sample pairs, where two groups independently arrived at Jute area estimates for a given location.  There are many important details here that are left out for brevity. The cost calculation is detailed and empirical and depends on the nature of the survey, and among things, include:
a. cost of staying and surveying at a given site - this depends on the size of the sampled area and time spent
b. cost of traveling from sample to sample - this depends on the distances between the chosen samples and the sequence of visiting.

Again, we have left out a humongous amount of cost calculations that were done. Reading the reports that came out of this work, one is amazed by the time and effort devoted to meticulously tabulating the various costs that go beyond 'ball-park' estimates, to produce an accurate cost function. For example cost calculation (b) depends on the solution to the corresponding traveling salesman problem.

The TSP
One of the many reports that came out of this this project notes:
(source: Sankhya journal, 1940)

This cost calculation is reviewed by Applegate, Bixby, et al. in their book on TSP and in Bill Cook's 'In Pursuit of the Traveling Salesman'.  A literature review of this TSP in these books mention that researchers later showed that the expected length of the optimal tour was approximately between (0.707, 1.27) times the square root of the number of samples visited in a unit square, so Mahalanobis' 1930s estimate was a remarkably good choice.

Results and Business Impact
The cost- and precision-controlled random sampling approach proved to be revolutionary. It achieved greater precision at a fraction of the cost.  Specifically, the margin of error was +/- 2%, and the cost was 1/15 of an enumeration census that was performed the same year and found to be less accurate compared to the random sampling approach. Thus the benefit and return-on-investment of this analytical approach was successfully demonstrated in practice, which received widespread recognition and was later embraced by the Government of independent India for many nationwide surveys.

Prelude to Part-3: The Bengal Holocaust
Within a couple years of the successful demonstration and publication of this work, Mahalanobis' Bengal lost between 2-6 million people due to starvation and disease between 1942-1945, triggered in part possibly by a failure of rice crop. The British Raj, locked in an grim Atlantic battle during WW2, may have suppressed reports and figures. It appears that most of the world, and even a vast majority of Indians, to this day, remain unaware of the reality behind this event.  How to obtain a reasonable estimate of casualties due to this disaster? Who was responsible and how? A recent book has brought this controversy into the open, and it appears that Mahalanobis (and his statistical sampling method) may have played a critical part in solving this puzzle.

To be continued.

Comments