Dualnoise: March 2012

Monday, March 26, 2012

Gender-Shaping III: Is Amartya Sen's Missing Women Count Exaggerated?

This is the third post in this series on gender-shaping. The previous installment can be found here. Thanks to a twitter link, I came across a 2010 journal paper: "Missing Women: Age and Disease," Siwan Anderson (University of British Columbia) and Debraj Ray (New York University) published in Review of Economic Studies Vol.77.

This paper has among other things, investigated Amartya Sen's '100 Million Missing Women of India' claim that is attributed to systemic discrimination. Anderson and Ray have estimated the number of 'missing women' in India, China and Sub-Sahara Africa by age and cause-of-death (not done before) while also moving away from the simplistic aggregate sex ratios that were used as baselines in prior works. The authors make the following useful observations: Defining missing women by differences in aggregate sex ratios can be misleading, or uninformative (or both). It is misleading because different countries have different fertility and death rates, and (in particular) different age distributions. They will have different disease compositions.
They may also have different sex ratios at birth for genetic or environmental reasons that have nothing to do with missing females.

The procedure is also uninformative: we cannot tell at what ages the missing women are clustered, or what diseases are responsible. Thus, we cannot begin to ask about the various
channels: discrimination, biology, social norms, and so on. Answering these questions is of profound importance. By unpacking missing women by age and disease, our paper takes a limited and preliminary step in this direction."

From an OR perspective, we extensively rely on similar customer segmentation models (in revenue management for e.g), and this additional age- and causal-factor based segmentation appears to be quite important and yields two main results as well as a comparative result that may be interesting to an U.S audience:

1. A large fraction of the missing women in India are not infants (less than 20%) but adults, and is attributable to other factors like disease and injury, apart from any systemic discrimination. Consequently, any claim of exclusively female infanticide driven 'missing women' in India is rejected. On the other hand, this paper finds that 44% of China's missing women are in the prenatal age-group. Here is a snapshot of sex-ratio by age, taken from the Anderson & Ray paper:

2.The authors make an interesting comparative comparison with the U.S: "we observe some similarities between age-specific percentages of missing women in the historical United States (ca. 1900) and India or sub-Saharan Africa today".

3. The Sen count (100 million missing women) appears to have been calculated with respect to a specific counterfactual: The overall sex ratio for N. America, U.S and Japan. An alternative calculation by Coale (1991) comes up with a more conservative estimate of 60 million. Anderson and Ray perform similar calculations but at the segment level (i.e. by age-disease) and generate missing number estimates using more carefully chosen counterfactuals as the baseline and find approximately 20 million missing women in India (aggregated across all age groups), while the corresponding figure for China is 58 million. Furthermore, 'injury' is not an insignificant culprit in India across all age groups, a potentially worrying trend that its government must look into. (The paper alludes to the old bogey of 'dowry deaths' as a probable cause which may not turn out to be the case. A similar detailed analysis is required).

The findings of this paper also weakens a statement in a previous post on this topic that a skewed overall male:female ratio in a region is a 'scary indicator' of female infanticide being practiced there. My statement ignored the age distribution as well as the 'cause of death' dimension. Bad O.R, but I have Amartya Sen for company.

Thursday, March 22, 2012

The Optimal Playlist

One of the problems with neighborhoods in parts of Connecticut is the lack of sidewalks coupled with crazy drivers (probably from a neighboring state to its left). To avoid getting run-over, I decided it is safer to do my walking on the treadmill. I'm now getting all the exercise a creaky researcher needs, but I'm not getting anywhere. To overcome this monotony, I hooked up my old iPod-classic for company, but it's time-consuming to generate my preferred playlist : start off with some up-temp music for motivation, then switch to cruise mode, and tone down after my 30/60 minutes of walking.

In India, we have the concept of 'Rasa', a Sanskrit untranslatable that very roughly speaking, includes notions of experiencing certain emotion(s), themes, ambiance, genre, etc. So the sequence of Rasas matters a great deal. Furthermore, I like to listen to complete songs and hate to end a virtuoso Carnatic performance half way when the exercise session-clock runs out. Furthermore, there are so many languages in India and many have their own pop-culture, folk, and classical genres in instrumental as well as vocal modes, and I prefer a diverse sampling of these to feel more at home.

Putting all this together to achieve an optimized playlist requires a constraint-programming approach. If I also want to optimize a certain objective (e.g., stay close to 12 songs), this turns into an exercise of solving an associated discrete decision optimization problem that can be stated as follows:

Find (preferably) 12 complete non-repetitive songs in a preferred sequence that lasts (almost) exactly 60 minutes, and includes at least n(i) songs having user-specified attribute (i), i = 1, .., n.

If we restate the attribute requirement as a soft-constraint by creating a score-table for including any attribute (e.g. 10 points for including a song with attribute i once, 15 points for two songs with attribute (i), 17 if three or more times) as opposed to the 'must satisfy' version stated earlier, then the playlist optimization problem can be posed as a attribute score-maximizing multiple-choice knapsack problem with a cardinality constraint, followed by a sequencing step. Even with a huge home music database, practical instances of the latter formulation may be relatively easy to solve via combinatorial methods (iPhone app?) and may not require expensive MIP solvers. Then, as a second step, we can sort the included songs into another preference-score maximizing sequence to generate the final playlist, unless of course the sequencing requirements are not that simple (in which case, a more sophisticated optimization approach may be required).

Such an optimized playlist is also useful if you want to build an auto-pilot DJ for your next house party. If your approach can solve this problem on-demand, you would also be able to dynamically re-optimize the playlist after manual intervention.

It seems apt to terminate this post with a Carnatic-Western classical fusion piece.

Updated on March 30: The objective function above is deterministic so there is a good chance that the you will get the same set of songs to listen to each day, which is not very useful. To introduce diversity and exploit the fact that in practice you tend to get several alternative optimal solutions to such problems, add a small amount of clock-dependent noise to the attribute-score and sequence-preference score. This will likely do the trick.

Thursday, March 1, 2012

House Hunting Efficiently

One of the consequences of the kind of convergence shown in this graph was that it created the need to buy a house. It's become a ritual to spend weekdays creating a list of houses that are feasible with respect to hard constraints (big kitchen, level lot, ..), and then converting that into a prioritized list based on how they score on soft constraints (pre-wired for Bose speakers, for example) that in turn motivates a preferred way of visiting these houses during weekends. I noticed that I rarely see each house in isolation and my view of a house tends to be colored by what I saw earlier. However, of late, the time to view these houses has become a scarce resource, so I created an O.R driven prioritized list that maximizes and optimally allocates viewing time, keeping the total duration equal to the limited time available. I used my automobile GPS unit as the solver.

This GPS unit "solves" the Traveling Salesman Problem (TSP) to figure out the optimal (?) order of visitation that minimizes total drive time, which automatically maximizes aggregate viewing time. (In particular, if houses are located on either side of a busy highway or Main Street, a good heuristic would be ensure that the optimal path intersects such a link infrequently.) I can then allocate the optimal expected viewing time to the houses based on personal preferences. The total viewing time also informs me if I have spread myself too thin, in which case, I can start deleting houses with the lowest scores from the list and re-optimize until the solution looks reasonable.

If viewing order is important from a 'relative comparison' perspective, the resultant constrained TSP problem becomes a bit more harder to solve using a GPS unit. A simple heuristic rule could be to fix the second node ("first house to first") and/or the second-last node ("house to visit last") of the tour and let the others be visited based on time-optimality. If your realtor drives you around, her/his office is the start and end node of the Hamiltonian circuit.

One issue I encountered while using the GPS unit to merely drive-by a house as part of a local neighborhood search (pun unintended) is that I had to get close enough to the house and maybe pause a bit to inform the GPS that this house has been reached. Otherwise, the GPS unit would continually re-route me back to the house, resulting in considerable confusion.