## Monday, May 6, 2013

### Guesstimation in Optimization and Analytics

Guesstimation
Browsing through the various examples examined in the 2008 book 'Guesstimation', it is apparent that the basic ideas behind this approach of "solving the world's problems on the back of a paper napkin" are useful for analyzing real-life optimization and analytics instances. Whether we are looking at (i) Smart-grid optimization ("how many hours does it take to charge an electric car"), or (ii) Hazmat routing ("what is the risk of dying per unit mile of road") or (iii) environmental problems ("how much farmland will it take to convert all gasoline cars to ethanol"), guesstimating values can be critical to effective model formulation in the absence of precise information. Guesstimation is a 'small data' method that can improve the effectiveness and performance of our 'big data' analytics.

Optimization
Let's look at an 'elastic' linear programming formulation:
LP: Minimize c.x + P.s
A.x - s <= b
l <= x <= u, s >= 0
where we've added artificial variables 's' to ensure feasibility of LP for any bound feasible 'x'. Often times, exact values for b, c, l, u, or portions of A are not available. Furthermore, we allow artificial variables to be active at some high but as yet unknown penalty rate of 'P'. We often end up guesstimating these values in industrial applications since many of these values are not known accurately.

Example (i) in the previous paragraph gives us a charge rate per car (a component of 'A' matrix). if 'x' is the number of cars to be charged, and 'b' is an upper bound on the time available and 'c' the cost of electricity for that duration, we can predict the 'ballpark' cost incurred and the charge stored. Similarly, example (ii) gives us the average risk per unit mile as a cost coefficient 'c'. By extending this calculation to denote high and low risk arcs in a network, we have an initial minimum risk network path optimization formulation at hand. Finally, example (iii) gives us an unconstrained resource requirement. This value allows us to quickly assess the impact of adding a resource constraint.

Thou shalt always Guesstimate!
Direct uses of guesstimation in practical optimization include the calculation of tighter bounds (l, u) for decision variables, and the smallest valid values for the big-M numbers that riddle integer programming formulations. If we think of (u - l) as a measure of our ignorance about the as yet unknown optimal 'x', guesstimating tight bounds is akin to maximally reducing its "prior variance". While guesstimation in the book is done using feasibility calculations, we can and should incorporate optimality considerations to obtain sharper bounds for such unknown variables. Using solver default (0, infinity) values is lazy and bad practice. Also, spending some time to improve bounds often gives us insight into the nature of the optimal solution, and which constraints are likely to be more restrictive, and helps catch modeling bugs earlier.

Example 1: what is the cost of not scheduling a crew for an airline flight?
In airline crew scheduling, the flights in a schedule are partitioned among crew pairings (see this post for a problem description). Every flight must be covered by exactly one pairing. This is a hard constraint, but in practice, we have to temporarily employ its elastic form of the type described above, and penalize violations via 'P'. How do we determine an appropriate penalty value? This would be the cost of leaving a flight uncovered. Such flights have to be covered by a reserve crew, which comes at a certain cost that can be calculated and is typically smaller than some default value we may have otherwise chosen. The dual variable value associated with an uncovered flight in an optimal LP solution attains a value close to 'P'. Any new column (x(j)) generated that covers this flight will subsequently will have a reduced cost of order (c(j)-P), and so on. Good guesstimation of P helps in producing manageable reduced costs, and accelerates algorithm convergence.

Example 2: what is the price elasticity of point-and-shoot digital cameras?
Price elasticity is a dimensionless (negative) quantity that gives us the expected percentage change in sales for a percentage change in price. We can expect that this value will be more than that for electricity, grocery, and dairy items, which are essential items that people are less willing to give up when price is hiked, and are likely to be inelastic (less than a percent drop in sales for a percent increase in price, say -0.5). The value will be lower than that of highly discretionary purchases like fashion apparel or leisure airline tickets (which have an elasticity of -4 to -2). The geometric and arithmetic mean of 3.0 and 0.5 gives us a range (-1.8, -1.2). A data-driven estimation of this value should be compared against this guesstimate as a sanity check. For example, if regression models on a data sample estimates an 'optimized' elasticity parameter value of -5 or -0.1, we may want to double check our modeling, or run the risk of producing wild predictions.

Another use of guesstimation is in generating plausible random optimization instances to demonstrate the effectiveness of new NP-Hard algorithms for a certain problem class if only limited real-life data is available. It can help weed out weird and contrived instances that skew the degree of difficulty of the problem and make it seem much harder than it actually is in reality.