Wednesday, May 21, 2014

Predicting the Indian Elections - A Win for Data Science

The exit polls for the recently concluded Indian elections threw up a spectrum of results. Several Cable-TV networks ran their own polls, most of their numbers falling within a seemingly reasonable range, barring a public research group called 'Todays Chanakya', whose numbers were literally off the charts, predicting a massive win for Narendra Modi. People began to take averages of these polls to come up with an 'expected result', and many of these 'poll of polls' excluded TC's result as an outlier, discarding it as unbelievable.

I spent quite a bit of time looking at the meager information provided in the  (TC) website before the results were announced. Buzz-words aside, what caught my attention was the meticulous attention they paid toward obtaining a representative data sample in every single constituency. Their prior track record in predicting elections in India was simply stunning. In a recent state election too, their prediction was an outlier, and turned out to be accurate. This data sampling step is important, especially given the incredibly diverse nature of India's population. Translating projected vote-shares into actual seats won in India's 'first past the post' system is an incredibly daunting problem. If your sample is even slightly messed up, then your seat predictions can be way off, regardless of the sophistication of the predictive analytics you employ. Human judgment and domain expertise is critical.

As this useful blog points out, it's not about 'sampling error', but sampling bias. And once we see this, it is not difficult to see why the English TV networks of India, virtually every single one a willing and well-compensated participant in the witch hunt of Narendra Modi since 2002, miserably fail in their predictions, time and again. Their reporting has rarely been fact-driven, and is usually ratings-driven. Few, if any on their payroll, are trained in the rigorous scientific method. Reporters appear to be hired based on ideology, west-accented English-speaking ability, and political connections rather than merit or technical proficiency. So, when by force of habit, you look for a sample that you like, then you will only get the predictions you want viewers to see in your TV shows, which has little to do with reality. The media witch hunt against Modi, like their exit polls, as is now known, was never fact-driven from day one.  It was doomed from the start. After this election, few will take their "predictions" seriously again unless they reform.

TC's predictions were quite accurate. Modi indeed won in a landslide as they predicted, with the incumbent Nehru dynasty (aka "UPA" coalition) whose corruption almost surely qualifies as a crime against humanity, getting deservedly annihilated. On election day, at around 1-2:00 AM EST, while following the election trends, UPA was leading in about a hundred of the 543 seats up for grabs, way higher the predicted range of 61-79 seats that TC predicted they would get. However, as the day progressed, it was amazing to see UPA's leads petering out one by one, as if an invisible rope was magically pulling it back into the predicted range. Statistical destiny. Only two people appeared to be convinced about the result before May 16. TC, who adopted a scientific approach to gathering and analyzing data, and Narendra Modi, who created the history in the first place.  Both of them dared to be different and put their reputations on the line, and were worthy winners.

This election result and Modi becoming the Prime Minister of India has taught many of us a scientific lesson. Data science is about being guided by facts, not emotion, or prejudiced opinion, or preferred outcome. Carefully constructed fact-driven methods are less likely to fail. Gujarat's development, both rural and urban, spearheaded by Modi for 12 years, is real, and cannot be falsified. It happened, and it is there to be seen regardless of what the New York Times tells you. I blogged in 2012 that the heavy-lifting done in Gujarat may pay rich dividends in the future. The people there lived that development and they knew, and the thousands of migrants returned from Gujarat to other states to speak about their experience there.  TC's data sample accurately reflected this reality. The media-heads sitting in Delhi, London, and New York were high on ideology-meth, low on fact. Few visited the state of Gujarat to make a factual assessment. Some of the open-minded critics who did, ended up becoming Modi's strongest supporters. Not surprisingly, his fact-driven campaign won him every single parliamentary seat there. The amazing number of Indians cutting across religious, class, language, age, gender, and geographical 'barriers', who voted for Modi, too cannot be brushed aside. Facts cannot be ignored until time-travel becomes practical.

And here's another prediction, an easy one. Modi will probably become India's best, and most unifying leader since Mahatma Gandhi, if he isn't already that. If, as the Nehru dynasty says, "power is poison", India has surely found their Shiva.


  1. On a related topic, here is an interesting question on statistics, which got triggered off when I was watching the election results unfold on May 16th:
    Counting was underway, and partial vote counts were available for practically all the constituencies. At some point, NDA had leads in about 305 constituencies out of all 542. At this point, Prannoy Roy, with all his experience and wisdom remarked: "As counting gets completed, this lead is likely to swell and NDA could reach 320 seats". As it turned out, NDA ended up with 336. I wondered if it was a mere co-incidence, or is there a probabilistic basis for Prannoy's observation. I think I figured the answer.

    This phenomenon is the result of the "first past the post" system, where the winner takes all. The NDA won 62% of the seats but only 38.5% of the votes. During vote counting, early leads are influenced more by the vote share: for instance, if only one vote is counted in each constituency, the expected value of NDA leads would only be 38.5% of the seats (after making the simplified assumption that the 38.5% vote share is uniform across constituencies). As more votes get counted, the probability of the dominant party taking the lead starts increasing and the expected value of leads starts moving towards the final result. In other words, if there is a dominant party, their domination gets established more and more as a larger portions of the votes are counted.
    Another simple corner case would make this even clearer. Let's say one party gets around 50%-60% of votes in every single constituencies (God forbid that happens). Then they would end up getting all the seats. But leads based on counting one a fraction of votes would be significantly lesser (with just one vote being counted, they would get about 50-60% of the seats). As more votes are counted, the lead would inch towards 100% of the seats.

    1. fascinating and insightful observations. Will re-read and return via a separate post.

  2. This comment has been removed by a blog administrator.


Note: Only a member of this blog may post a comment.