Saturday, January 29, 2011

The honest politician and other rare events

This post is mildly motivated by the INFORMS blog challenge this month dealing with 'OR and politics'. This tab dabbles with OR-eyed viewpoints of Indian political events from time to time. Past posts on OR and politics can be found here, here and here. The idea for this post arose from real issues in OR practice and business analytics. Yet, there is an interesting element of politics as well.

Consider this hypothetical experiment. We select a list of many thousand past political leaders from around the world and generate ratings on multiple attributes that provide valuable insight into their their 'level of honesty' derived from fact-driven records during their leadership tenure. On the right hand side, we have a yes/no binary indicator on whether that politician was generally considered "honest" or "dishonest". Our objective is simple: generate the probability that an input politician is honest, given a set of scores for each of his/her performance attributes.

We use a binary logit model (i.e. logistic regression) to do this and use historical data to calibrate the parameters using the maximum likelihood estimate approach. Since we have a fairly large sample size, we get a good model fit and hit all the right notes as far as confidence intervals, etc. The statistical model shows a good fit. But how well will it predict in real life? These are two different stories.

Politicians strongly rated as honest and statesmanlike are a rare species. Indian legend regards King Harishchandra as an exemplar for honesty in public life, which is not surprising given that he never uttered a single lie in his life, and greatly influenced the the first person in the next list. More recently, 'Mahatma' Gandhi, Abe Lincoln, and Nelson Mandela. In current times and keeping with contemporary mores, a Barack Obama (perhaps), Dr. Abdul Kalaam of India, or a Helen Clark of New Zealand, ..., the list of people keeping it on the level is quite short. It is likely that we will find our predictive analytical model is (far too) good as far as picking crooked politicians. If 99% of politicians are dishonest, then it is very easy to get a good fit. In fact, a 1-line model that simply returns "crooked politician" is a good one - it is 99% accurate. However, this model is not very interesting. Our focus and curiosity is driven by finding those that fall in that elusive 1%. A "NO" model fails 100% in this regard. How well did our statistically calibrated predictive model fit the "YES" instances? Most likely it did a pretty poor job and far below the expected rate of good guys. In fact, if you were very careless, your computer program may even treat some of these 'YES' data points as nuisance value/outliers! This situation is kinda like the inverse of the analytical problem of fraud detection (pun unintended). Consequently, if we fed the model, say, 'Honest' Abe Lincoln's attributes, we would be disappointed with the output. Our model moves into the domain of truthiness. On the other hand, a 'monkey model' that randomly generates answers with a mean "YES" rate of 1% may be more useful. Our challenge is to be able to do better than the monkey.

To do that, we turn to analytical work done in political science. Folks here (and in areas like new drug discovery) often work with predictive math models for rare events and some literature search in these areas indicate that there are quick (but not obvious) fixes to such plain-vanilla predictive models that we tend to use mechanically in OR projects. In particular, these corrections ensure that the natural imbalance inherent in the training data is accounted for in the right way and by the right amount.

The lesson, if any, from this experiment is that the basic act of testing predictive models on hold-out or hidden samples must never be bypassed. Fitting well to historical data is necessary for our validation, but certainly not sufficient for a customer's satisfaction. It does NOT imply "useful predictor". Not even if we have a lot of data. Furthermore, when we build a prescriptive analytical layer by embedding our predictive model within an optimization framework to determine the optimal attributes that maximize some objective, the external effects of a bad predictive model become pronounced. Optimization magnifies the silliness of a bad prediction. It literally takes it to an extreme point. In fact, an advantage of having a prescriptive layer is that it can often tell if the underlying predictive layer is playing politics with you.

Friday, January 7, 2011

About a microanalytical startup with 'OR Inside'

Continuing with the new year theme, we take a first look at CQuotient, an analytics-driven start-up in the retail industry based in the Boston area. I just finished reading this interview on a business info site. It's interesting to hear what the founder and CEO Dr. Ramakrishnan has to say (he's also got a blog listed on the roll at the bottom right of this tab. I don't expect frequent updates for a while :). A couple of things were eye-openers.

They seem to be among the very first to sharply focus on individual customer behavior. Are we seeing some of the first practically viable applications of microanalytical (copyright, 2011 :) techniques this year ? Retail science normally thrives on aggregating individual customers into sufficiently big bunches so that the law of large numbers kicks in. Then you can reliably analyze statistical and econometric models to realistically predict and optimize based on these high-level purchase patterns. A retail microanalytical approach that drills down to the individual customer level looks pretty challenging to pull off in reality, but looking at the team assembled at CQuotient and the computing power available today, I wouldn't be surprised if they are onto something here.

Next, CQ will provide an 'optimal prescriptive' answer to a retailer. This convinces me that they have an application with "OR Inside" and their 'coolness quotient' just went up :). Rather than just dump a bunch of charts and qualitative insights on a tired, caffeine-deprived store manager-type and wish good luck, CQ seems to take it a step further and provides optimal decision recommendations to the retailer. Practical decision analytics can give you a pretty powerful edge since it can potentially eliminate or minimize a lot of costly guesswork. In the retail industry, which is characterized by wafer-thin margins and brutal competition, such OR-based innovations can be a big deal.

A minor grouse is that the word "OR" doesn't show up in the interview, but the content shows that all the good stuff is likely to be hidden inside. The scope for OR in the new world remains undiminished, especially if somebody is brave enough to dip their hands in messy data and put their money where their model is!

Sunday, January 2, 2011

Skills for new graduates to succeed at OR practice

The first post this year is for OR students who plan to put their ideas into practice.

There is a significant ongoing transformation in the landscape of OR practice. At the end of the first ten years of this century, we see that traditional industries where OR has succeeded in the past such as airlines and logistics will continue to use OR methods. However, due to the saturation and the lack of radical breakthrough ideas, the minor incremental returns for spending R&D dollars will continue to discourage management from enhancing the science behind these OR approaches, and they will further outsource such tools to 3rd party vendors. In such a support mode, there is little that differentiates you from your competition.

On the other hand, the application of OR methods to new industries is very exciting, even lucrative. This year, OR will quietly make its way into more new industries. Most of the world (including the OR community?) will not know this, since OR is likely to remain hidden within a 'business analytics' agenda.

A new graduate who wants to practice OR should possess sound 'traditional' math and OR skills as well as the ability to work with large data sets locked in databases. You should be strong enough in your fundamentals to perform proof-of-concepts without asking your boss (typically one who cares not for OR or even knows what OR is) to shell out big $$ to buy you a new CPLEX or Gurobi license or a new SAS license to analyze patterns in data.

Familiarity with open-source tools such as COIN-OR and R will help since they are free for R&D. In such new industries, the ability to work with and analyze large volumes of messy data is perhaps more important, so being at ease there will give you an edge over non-OR types since you can 'take it all the way'. Remember, OR is an applied field that is tailor-made for analytics, and that is a powerful plus point.

A PhD would be preferable unless you are OK with being tagged as an OR-programmer/data analyst. Ability to communicate technical ideas with a non-OR audience in plain English is very, very important.

Business problems do not show up with "use OR" on it. The stalwarts of our field in the 1950s-1980s came up with original approaches that best suited the practical problem at hand, and using the best computing technology available, and these breakthroughs eventually became part of OR folklore and textbooks. OR best succeeds when it is explainable and insightful, and at times, a smart 10-line answer may just do the trick.

Finally, It's worth restating the obvious. The most important component of OR practice is that you build reliable solutions for real people who spend real $$ in a tough economy.