David’s Musing: In search of a missing word: entropy and ???

One of my maxims for mathematicians is in public, curb the impulse to present mathematics as a pinnacle of human creative achievement. The profession is better served by being less arrogant, and instead portraying mathematics as useful intellectual infrastructure, analogous to the useful physical infrastructure of the internet. My point is that there's not a Windows exponential function and a different Linux exponential function and a different OSX exponential function, there's just the exponential function. That's infrastructure.

But there's a curious irony in this view. Many of the sciences have official bodies to decide on matters like the definition of a kilogram or the name of a new chemical element or the name of a species; this is analogous to the official bodies that determine what is correct French or German language. But Mathematics and Statistics have no official bodies to rule on what exactly a topological group or an analysis of variance are, just as English has no official body to rule on spelling or grammar or word meaning in the English language. It is perhaps remarkable that such superficial anarchy does not lead to chaos, but it doesn't. Occasionally confusion arises when a word has two meanings. Using the same phrase standard deviation for data and for theoretical probability distributions causes, in my opinion, unnecessary conceptual confusion in elementary statistics courses; and a man wearing suspenders is viewed differently in Britain and the U.S. But the topic of this article is one instance of an opposite annoyance, that a simple concept may not have a standard name.

Consider (p₁,...,p_n), either relative frequencies or a probability distribution, in the context of categorical data. There are many summary statistics one could define to measure where a given distribution is on the spectrum from being thinly spread over many categories to being concentrated on one category. The two statistics that seem most fundamental are

For E everyone, or at least all readers of this article, surely uses the same name, entropy, or a variant like Shannon entropy. But what is the name of the statistic S?

Let me digress for a short rant. Names of mathematical objects and theorems are just labels, just identifiers. Attributing theorems and other matters of substance to their discoverers is good scholarship, and the custom of naming them after the discoverers may (if initially done correctly) provide the double benefit of both attribution and a label. For instance, the Kolmogorov 0-1 law is a perfectly good name for that theorem. But asserting that objects like S should be named after the first person to write them down is frankly ridiculous. If the tail sigma-field featuring in the Kolmogorov 0-1 law had first been defined by some Professor Wagstaff, does that mean we should rename it the Wagstaff sigma-field? In fact the academic preoccupation with identifying who did something first was well satirized four centuries ago.

He forgot to tell us who was the first man in the world that had a cold in his head ... but I give it accurately set forth, and quote more than five-and-twenty authors in proof of it, so you may perceive I have labored to good purpose and that the book will be of service to the whole world. (Cervantes, Don Quixote).

Ideally, identifiers are unique, short and memorable; using people's names is an option, not a requirement; entropy and tail sigma-field and ham sandwich theorem work at least as well as Kolmogorov 0-1 law and Riemann hypothesis and Stone-Tukey theorem. End rant.

Anyway, what I am seeking is the name of S in actual current usage, since 2000 say. Given any candidate name, one can quickly use Google Scholar to get a rough idea of how extensively it is used. And it doesn't take long to find a relevant Diversity index page of Wikipedia, which contains a useful set of possibly relevant names. Using these to start a search (note I am not relying on Wikipedia as an authority, merely as a starting point) one very quickly discovers the following.

1. In ecology (e.g. populations of different species), S has long been called Simpson's diversity index or Simpson's index. Since 2000 the name Gini-Simpson (diversity) index has become more common, apparently stemming from a 1981 paper of C.R. Rao.

2. In demography (e.g. populations of different ethnicities) D = 1 - S is called the diversity index, in particular by the U.S. census. (Intuitively, it is D not S that increases as diversity increases).

3. In economics (e.g. market shares of different companies) S is called the Herfindahl index.

At this point one suspects that another hour of searching would find yet more names used in other quantitative academic disciplines. Indeed I happen to have recently seen susceptibility used in graph theory (component sizes), the word indirectly derived from Ising models of magnetisim. And Manjunath Krishnapur pointed out that in areas of information theory the quantity - log S is called Renyi entropy or collision entropy.

Anyway, this plethora of names for S demonstrates the disadvantages of naming simple mathematical concepts after people or in some application-specific way. A descriptive name such as L² categorical diversity would surely be much more desirable.

In case you wonder what prompted me to write about S in particular, here are three ways it came to my attention. Back in the 1990s, it arose in my own technical work on stochastic coalescence. In my current efforts at undergraduate/popular level exposition, I am creating a list (suggestions welcome) of the ten most interesting explicit formulas in applied probability. One of these is from population genetics; in a population of constant size N and with chance q of a mutation to a new neutral allelic type, the “effective number” 1/S of neutral alleles is approximately 1+4Nq. (Any summary statistic of diversity can be interpreted as an effective number, the number of categories for which the uniform distribution has the given value of the statistic; for our statistics the effective number is 1/S or exp(E)). A third context involves my interest in reading popular expositions of probability, of which I regard Warren Weaver's 1963 Lady Luck as a classic benchmark for comparison of subsequent works. Weaver only once mentions a new idea of his own, which is to measure the unlikeliness of a particular outcome i by the relative probability p₁ / S, and then call its inverse S / p₁ the surprise index associated with an outcome. This notion has not come into widespread use, though there is an interesting article Surprise Index, written by I.J. Good, in the Encyclopedia of Statistical Sciences. Good develops a little mathematics, in particular observing (as many others have) that S and E can be regarded as members of a certain one-parameter family of summary statistics – this family is often named Renyi entropy. Good also attributes the idea of S to Gini as the Gini index of homogeneity; but this name invites confusion with the Gini coefficient for quantitative data and does not seem to be in widespread use.

The entropy statistic E suffers from no such terminological confusion, but perhaps the prominence of the name (cf. the third law of thermodynamics) leads to an opposite problem – overuse. The Wikipedia Diversity index page describes the intuitive significance of entropy for species abundance data as follows.

As we walk around and observe individual organisms, we call out (a binary codeword for their species). This gives a binary sequence. If we have used an (optimal) code, we will be able to save some breath by calling out a shorter sequence than would otherwise be the case. If so, the average codeword length we call out as we wander around will be close to the (entropy).

This is basically correct but is hardly a convincing argument for using E as a statistic of biological significance. Because there are specific contexts (e.g. data compression) where E really is the relevant statistic, and because E has nice mathematical properties, it is often used as a default statistic to quantify diversity, loosely analogous to the use of standard deviation as a default statistic to measure spread. But for a typical observed categorical data-set it is hard to give any positive justification or natural interpretation for E. Consider for instance the extensive data at http://www.ssa.gov/oact/babynames/ on U.S. birth names. Any statistic would reveal the dramatic increase in diversity of names over the last 50 years. The statistic S has a natural interpretation – the chance two random babies have the same name. Can you think of an interpretation for E that is less artificial than the story of the breath-saving biologist above?

David Aldous, Berkeley

Editors’ note: This is the third installment of a regular opinion column.