Best of Chance News 13.03

Prepared by J. Laurie Snell, Bill Peterson, Jeanne Albert, Charles Grinstead, and Myles McLeod with help from Fuxing Hou and Joan Snell. We are now using a listserv to send out notices that a new Chance News has been posted on the Chance Website. You can sign on or off or change your address at here. This listserv is used only for this posting and not for comments on Chance News. We do appreciate comments and suggestions for new articles. Please send these to jlsnell@dartmouth.edu. Chance News is based on current news articles described briefly in Chance News Lite .

The current and previous issues of Chance News and other materials for teaching a Chance course are available from the Chance web site.

Chance News is distributed under the GNU General Public License (so-called 'copyleft'). See the end of the newsletter for details.

He's got their number: Scholar uses math to foil financial fraud.
Wall street journal, 10 July, 1995
Lee Berton

Mark Negrini, who teaches accounting at St. Mary's University in Halifax, wrote his PhD thesis on: "The detection of income evasion through an analysis of digital distributions". He has persuaded business and government people to use Benford's law to test suspicious financial records such as bookkeeping, checks and tax returns. The article states that Bendford's law "lays out the statistical frequency with which the numbers 1 through 9 appear in any set of random numbers". Actually, Benford's law states that the distribution of the leading digit in data sets is typically not equi-distributed but rather given by the distribution p(k) = log(k+1)- log(k) for k = 1,2,...,9. (The leading digit of .0034 is 3, of 243 is 2 etc.). Numerous explanations for this have been given but perhaps the most persuasive is that Benford's distribution is the unique distribution for the leading digits that is not changed by a change of units, i.e. multiplying the data by a constant c.

(For a recent discussion of Benford's distribution and further references see: Theoredore P. Hill, The significant digit phenomenon, "The American Mathematical Monthly", April 1995.)

Negrini's idea is that, if we are honest, the numbers in our tax returns and on our checks should satisfy Benford's law and if they do not there may be some skullduggery.

The article states that "Mr. Negrini has also lent his expertise to federal and state tax authorities, officials in Denmark and the Netherlands and to several companies. He has even put President Clinton's tax returns to the Benford's Law test. When he analyzed the president's returns for the past 13 years he found that 'the returns by Clinton follow Benford's Law quite closely'".

DISCUSSION QUESTION:

1. Would you expect Benford's distribution to apply to the number of hits a baseball player gets in a year?, to the prices of stocks on a given day?, to the population of cities in the United States?

2 Mr. Smith is quite wealthy and makes over a hundred charitable contributions each year. Do you think the distribution of the leading digits of these numbers would have a Benford distribution if he is honest? If he cheats, why might they not have a Benford distribution?

3. Compute the first 100 powers of 2 and show that the leading digits have a Benford distribution.

On the peculiar distribution of the U.S. stock indexes' digits.
American Statistician, Vol. 50, No. 4, Nov. 1996, pp. 311-313
Eduardo Ley

As we have discussed in previous issues of Chance News (see Chance News 4.10), Benford's distribution for leading digits is supposed to fit "natural data." The Benford distribution assigns probabilities log((i+1)/i) to digits i = 1,2,..,9. Ley asks if this distribution fits the leading digits of the one-day return on stock indexes defined as: r(t) = 100*(ln(p(t+1))-ln(p(t)))/d(t) where p(t) is the value of the index on the tth trading day and d(t) is the time the tth and t+1st trading day -- usually 1. Since p(t+1) = p(t)exp{r(t)*d(t)/100}, r(t) is the continuous-time return rate for the period between tth and t+1st trading time.

Ley finds that the leading digits of r(t) fit the Benford distribution remarkably well for both the Dow-Jones and the Standard and Poor's index.
He obtained the following distributions for the leading digits for the one-day return rate for the Dow-Jones from January 1900 to June 1993 and the S&P from January 1926 to June 1993.

As you can see, in both cases the approximation is very good.

Despite this apparent good fit, in both indicies a chi-squared test would reject Benford's distribution. Ley attributes this to the very large power of the test caused by so many samples. He remarks that if you take just the last ten years this does not happen and says:

If one takes models as mere approximations to reality, not as perfect data-generating mechanisms, then this can only be viewed as a weakness in the Neyman-Pearson theory (hypothesis testing). Ley's data is available from from http://econwpa.wustl.edu.

Comment: This example is discussed in an article" First Íignficant Îigit Patterns From Mixtures of Uniform Distributions" by Ricardo J. Rodriguez in The American Statistician, Vol. 58, No. 1, Feb. 2004. Here it is shown that this data is better fit by another law called "Stigler's law."

DISCUSSION QUESTIONS:

(1) Would you expect leading digits of the Dow-Jones values themselves to have Benford's distribution?
http://econwpa.wustl.edu.

(2) Do you agree with Ley's remark about the weakness in the theory of hypothesis testing?

It is often rewarding though less often enjoyable to read a technical book. But this book is both technical and a pleasure to read. It tells the story of Mark Nigrini's crusade to convince the world of accountants that Benford's distribution for the leading digit of "natural" data can be useful in detecting fraud.

Nigrini tells us how he learned about Benford's distribution, how he came to write his Phd thesis on the use of this distribution to detect fraud, and how, in the last twenty years, he has developed and applied methods for using digital analysis for fraud detection. While Benford's distribution is at the heart of this analysis, Nigrini incorporates other kinds of digital irregularities in his analysis. For example, when Dartmouth reimburses us for travel, we do not have to provide receipts for items that are less than $25. So, most of our meals end up costing $24.50. Nigrini's digital analysis would have no trouble detecting this fraud.

The leading digit L of a positive numbers is the first non-zero number in its decimal representation. So the leading digits of .0023, 2.23, and .234 are all 2. The second digits of these numbers are 3,2, and 3 respectively.

As our quote for this Chance News suggests, the famous astronomer Simon Newcomb was led, by looking at tables of logarithm, to the belief that the leading distribution of data was not typically uniformly distributed. In his 1881 article Newcomb (1) (see end of the article for references) gave an argument to show that the mantissas of the logarithms of the numbers should be uniformly distributed which leads to

The physicist Frank Benford(2) independently discovered this distribution and showed that a distributions of the leading digits for a number of "natural" data sets reasonably fit this distribution.

Here are some data whose leading digits are reasonably approximated by the Benford distribution:

The Dow Jones numbers are from an article by Ley(3) The rivers and newspaper data are from Benford's article. The County populations are from the population for 3,141 counties as reported in the 1990 census and analyzed by Nigrini in this book. The electricity consumption data appears in an excellent survey article by Raimi(4) and represents the electricity consumption of 1243 users in one month in the Solomon Islands.

In keeping with his desire to keep the mathematics informal, Negrini justifies the Benford distribution in terms of simple examples. For example, he considers a town that starts with population 10,000 and increases 10 percent per year. Then (ignoring compounding) the population increasing from 10,000 to 20,000 is a 100 percent increase and so takes about 10 years. But, a population change from 50,000 to 60,000 is only a 20 percent increase so takes only about 2 years. Thus the city will have a population with leading digit 1 about five times longer than it has leading digit 5.

We found Newcomb's argument a little mysterious and so will describe an argument that justifies the Benford distribution as the only distribution that is invariant under a change of scale, i.e. multiplying the data by a constant c. This was first proved by Pinkham(5) and extended by Hill(6).

To know the leading digit, the second digit, etc. we need only know the data modulo powers of 10. This is the same as saying we need only know the mantissas of their logarithms. We represent the mantissas as points on the unit circle. Following Hill we assume there is some chance process producing the data which in turn determines a probability measure for their mantissas.

For a data point to have leading digit L = j it must have a mantissa between log(j) and log(j+1). Thus

Since multiplying the numbers by c amounts to adding log(c) to their mantissas, this means that the probability measure for the mantissas should be invariant under a rotation of the circle. It is well know that the only measure with this property is the uniform measure. Thus

Negrini's book contains a wide variety of case studies starting with the analysis of the data, from his student projects analyzing the data from their relatives mom-and-pop stores and to his own digital analysis of data from major corporations. Along the way there is an interesting discussion on the best way to test the fit of a distribution when you have many thousands of data elements.

Here is an example Nigrini gives from his student projects. For his project a student, Sam, used data from his brother's small store in Halifax, Nova Scotia. Throughout the day family members would ring up sales in the normal fashion. At night before closing, Sam's brother would go downstairs and ring up fictitious sales on a similar register so that the basement total was less than the upstairs total, to evade income and sales taxes. On one day there were 433 authentic sales totaling $4,038.32. On the fake register printout used for tax purposes there were 245 sales totaling $1,947.29. The leading digit distributions from the two registers' printouts and, for comparison, Benford distribution are:

The fit to the Benford distribution for the actual sales is not great but the biggest difference, the excess in digits 5, could be explained by the large number of sales of cigarettes which sold for somewhat more than $5 at that time. Thus Nigrini would pass the actual sales data. However, his tests would certainly detect fraud in the fake sales data.

The obvious test to use for deciding if an empirical distribution fits the theoretical Benford distribution would be the chi-square test. However, Nigrini rejects the use of this test for the very large data sets obtained from analyzing data from a major company. The reason is that, when the true distribution is only slightly different from the theoretical Benford distribution, the large sample would lead to rejection of the Benford distribution even though there may be no fraud. This is like the observation that with enough tosses of a coin it is possible to reject just about any coin as a fair coin since coins are never exactly fair.

For this reason Nigrini recommends a test that he calls the Mean Absolute Deviation (MAD) test. This test computes the average of the 9 differences between the empirical proportions of a digit and the proportion predicted by the Benford distribution. Based on his experience Nigrini suggests the guidlines for measuring conformity of the first digits to Benford using MAD to be:

Thus a single deviation of more than 5% would rule out close conformity and more than 10% would suggest nonconformity. Nigrini gives a graph of the fit for a data set with 250,000 data entries from a large Canadian oil company. There are noticeable deviations for at least half of the digits. However, the deviations are small and the MAD was computed at .0036 indicating close conformity. A chi-square test would undoubtedly have rejected the Benford distribution.

Nigrini emphasizes that deviation from Benford's distribution and other digital irregularities do not by themselves demonstrate fraudulent behavior. There may be good reasons for these deviations. It only suggests that investigator might want to look further into how the data is collected, both for evidence of fraud and also for more efficient ways to run the business.

While we have assumed that natural data is the result of chance outcomes, it has not been necessary to know the probability distribution that produces them. However, we would hope that some of our standard distributions would be appropriate for fitting natural data. Both Hill and Nigrini remark that it would be interesting to know which standard distributions produce data with the leading digits having at least approximately a Benford distribution.

In a recent paper Leemis, Schmeiser, and Evans(7) have looked at this problem for survival distributions. These distributions include, among others, the well-known exponential, Gamma, Wielbull and Log normal distributions. All these distributions, for some parameter values, produce data with leading digits that reasonably fit Benford's distribution. However, since the fit is sensitive to the parameter values, the authors also warn that relying completely on the Benford distribution to detect fraud could lead to a significant number of false positives.

Mark Nigrini was one of our year 2000 Chance Lecturers, so you can see the movie before or after you read the book.

(1) Simon Newcomb, (1881), Note on the frequencies of the different digits in natural numbers. Amer. J. Math 4, 39-40.

(2) Benford, Frank,(1938), The law of anomalous numbers. Proc. Amer. Phil. Soc. 78, 551-72.

(3) Ley, Eduardo, (1996), On the peculiar distribution of the U.S. stock Indices. The American Statistician, 1996, 50, 311-313.

(5) Pinkham, Roger, (1961), On the ditribution of first significant digits, Ann. Math. Statist., 32, 1223-1230.

(6) Hill, Theodore P., (1995), The Significant-Digit Phenomenon. American Mathematical Monthly, Vol. 102, No. 4. pp. 322-327.

(7) Survival Distributions Satisfying Benford's Law (with B. Schmeiser and D. Evans), The American Statistician, November 2000, Volume 54, Number 4, Available at: Larry Leemis -- Homr Page

Greg Leibon taught the Dartmouth Chance Course this winter and made this sample project for his students. Since we have discussed the Benford's Law a number of times, we thought readers might also enjoy his project. Recall that the Benford's Law is that the first(leading) digit in "natural data" should have approximately the distribution given by base 10 logarithm of (1 + 1/d) for d = 1,2,...,9. Thus the leading digit is 1 with probability log(2) = .301, 2 with probability log(1.5) = .176 etc. The complete distribution is:

For his project, Greg wanted to see if "natural numbers" on the web satisfied Benford's Law. He writes:

I wanted to understand numbers on the World Wide Web in which real live people were actually interested. In particular, I did not want to accidentally include numbers from data sets intended only for data mining purposes. To accomplish this, I included a piece of text in my search. I wanted to choose a natural piece of text, hence (for lack of a better idea) I used the word “nature”. Thus, my Google Numbers are numbers that occur on a web page that also includes the word “nature”.

I wanted my search to produce robust but reasonable numbers of results. This is because I wanted to leave myself in a position to actually examine the resulting hits in order to achieve a sense for how the numbers were derived.

A little experimenting led Greg to the conclusion that searches for six-digit numbers and the word "nature" resulted in a reasonable number of hits. So he chose nine random five digit numbers and for each of these he added all possible leading digits. His first five-digit number was x = 13527 giving him the 9 six-digit numbers 113527, 213527, 313527, ...,913527. He then searched for each of these numbers and the word "nature" in Google and recorded the number of hits. Here is what he found:

He repeated this for his 8 other random five-digit numbers and combined the results to obtain:

Greg wondered if he was just lucky or if there was some explanation for such a good fit. Looking for an explanation, he found that many of the numbers he observed could be considered the result of a growth process. As an example of such a growth process, consider the money you have in the bank that is continuously compounded. Then it is easy to check that the percent of time your money has leading digit k for k = 1,2,3,...,9 fits the Benford distribution. Greg remarks:

Hence, we would expect Google numbers to have a Benford distribution if they satisfied two criteria: first that every Google Number behaves like money with interest continuously compounded, and, second that the probability that a Google number is posted on the web is proportional to how long that quantity is meaningful.

We gave Greg an A on his project but you should read it here yourself. You can also see what Greg did in the Chance Course and some student projects here.

Repeat Greg's experiment replacing "nature" by a different word. Do you get similar results?

This article deals with the topic of missing data. Two major subtopics are discussed; one is the capture-recapture method, and the other is bias in clinical studies. As an example of the capture-recapture method, consider the following scenario. Suppose that we are trying to estimate the number of people who live in a certain area. We can pick a random sample from the area and consider the people in the sample to be 'captured'. Then we can pick another random sample from the same area and count how many of the captured people are picked the second time (i.e. 'recaptured'). Suppose that we pick a sample of 100 people the first time, and then we find that, in our second sample of size 100, 10 people have been recaptured. This tells us that the captured set is about 1/10 the size of the population, leading to an estimate of 1000 for the population size. This method has been applied in ecology, national defense, and public health, to estimate sizes of populations that are hard to count directly.

The second subtopic concerns missing data. Suppose, for example, that a drug has been developed to treat a certain condition. Before it can be introduced into the marketplace, it must be tested to see if it is safe and if it works. It is fairly clear that if one tests a drug sufficiently often, it is probable that at least one of the tests will show that the drug is effective. Unfortunately, there may be many other studies of the same drug that do not show this, and these latter studies are less likely to be published than the former one. The question in this case is whether the fact that these negative studies exist can be discerned from the studies that have been published.

It turns out that the answer to this question is 'sometimes'. One can use what is known as a 'funnel plot' to help determine whether there is missing data. One begins with the data from many small studies concerning the drug's efficacy. Since these studies are small, they tend to vary widely in their predictions concerning the drug. Larger studies are needed to establish statistically significant results. If one compares the results of the published larger studies with the smaller studies, two possibilities exist. The first one is that, while the larger studies tend to cluster more closely to a certain figure than the smaller ones, there is no bias in the larger studies with respect to this figure. The second possibility is that the larger studies are skewed (presumably in the direction that shows the drug is beneficial). This latter situation suggests that some larger studies have been conducted and have remained unpublished because they do not show that the drug is effective at a statistically significant level.

Matthias Egger and some colleagues at Bristol University have used funnel plots in this way to show that, in at least one quarter of 75 published medical studies, there were significant signs of missing data. These results have given more ammunition to medical scientists who have called for access to all study results.

(1) The Census Bureau plans to use the capture-recapture method to determine the undercount in the census 2000. This is done by carrying out two surveys (Mass Enumeration & PES Sample Enumeration) of a population of N blocks at about the same time. The unknown total population is

where M11 is the number enumerated on both occasions; M10 the number enumerated on occasion one but not on occasion two, M01 the number enumerated on occasion two but not one, and M00 the number not enumerated on either occasion. Everything but M00 is known here. Assume the two surveys are independent and each individual has the same probability of being counted. Then you can estimate M00 by

(2) Matthews mentions the following problem: Jones, a butterfly researcher, captures butterflies for a period of time m. He captures 5 different species with 4,1,10,20,2 representatives of these 5 species. Jones asks: if I were to return the next day and catch butterflies for a length of time n, how many new species of butterflies should I expect to get?

This is a famous and very difficult estimation problem. There is one elegant solution due to I.J. Good (Biometrica, Vol. 40, 1953, pp. 237-264) which works sometimes. Assume that butterflies of species s enter the net according to a Poisson process with rate v which depends on s. Then the distribution of the number of butterflies of species s caught in time t is Poisson with mean tv. Thus, before he starts his research. the probability that Jones catches no s butterfly on the first day but at least one s butterfly on the second day is

Using the infinite series expansion for e^x in the term of this product permits us to write this product as:

where X is the number of s butterflies caught on the first day. Summing these equivalent expressions over all species shows that

where r(j) is the expected number of species represented j times in the first days catch.

(3) Assume that n=m. Estimate the number of new butterflies that Jones will capture on the second day.

Thisted and Efron (Biometrica 1987,74, 3, pp. 445-55) used this method to determine if a newly found poem of 429 word had about the right number of new words to be consistent with being written by Shakespeare. They interpreted a species as a word in Shakespeare's total works discovered or not discovered. They took the time m to be 884647, the number of words in existing Shakespeare works, and n to be 429, the number of words in the new poem. Using the above result they got an estimate of about 7 for the number of new words that should occur in a new work of 429 words, and since the new poem had 9 such new words they concluded that this was consistent with the poem being a work of Shakespeare.

(5) It would be interesting to check this model on bird data. A natural thing to try would be the Christmas Bird Count (CBC). I'm told the data is available.

In his column, Unconventional Wisdom, Richard Morin included an account of the sea monster study that we mentioned in Chance News. As we read his account it occurred to us that this was a good topic for a Chance profile on sampling methods related to the capture-recapure method. We wrote the beginning of such a module and am including it in this Chance news with the hope of getting some suggestions for improvement or additions. We plan to include the simulation programs mentioned. Morin's discussion is such a nice introduction that we include it as he wrote it.

Unconventional Wisdom
Washington Post, 7 March, 1999

In reading the Poisson story, the first thing that came to our mind was that the remarkable ability of the gamekeeper to distinguish every deer he has seen suggests that the King might also have used the capture-recapture method used by the Census Bureau in the undercount problem in this way. The King asks the gamekeeper to make a serious effort to see as many deer as he could on a specific day. Think of the deer he saw the first day as tagged. Then on the next day he could make an even greater effort to see as many as possible. Suppose the gamekeeper sees 35 deer on the first day and 60 on the second day, noting that 20 of these 60 deer he also saw on the first day. Then, assuming there are N deer walking around the forest more or less at random, the proportion of tagged deer in the second day's sample, 20/60, should be approximately the proportion of tagged deer in the forest, 35/N. Thus the King can estimate the number of deer in the forest to be 3*35 = 105. Thinking about the validity of the assumptions made in this hypothetical problem suggests problems that the Census Bureau must adjust for in their real world use of the capture-recapture method in the Census 2000.

Let's turn now to solving the problems that the King and Paxton posed. In the deer problem, it might be reasonable to assume that each deer has the same probability of being seen in a day. However, for the monster fish problem the probability of seeing a particular species of large fish in a given year would surely depend on the species. Thus to have a solution that will apply to both problems we will let the probability of seeing a particular deer differ from deer to deer.

Poisson suggested plotting the numbers that the gamekeeper saw each day for a series, say, of 50 days. When you do this you get an increasing set of points that suggest that a curve could be fitted to them that looks a bit like a parabola. This curve would be expected to level off when all the deer had been seen and the value at which it levels off would be our estimate for the total number of deer in the forest. Our problem is to determine the curve that best fits this data.

Assume there are M deer in the forest which we label 1,2,3,...,M Let's assume that on a given day the gamekeeper sees deer j with probability p(j). Then the probability that he has not seen the ith deer after n days is (1- p(i))^n and so the probability that he has seen this deer after n days is 1- (1-p(i))^n. Thus the expected number of deer the gamekeeper has seen after n days is the sum of 1 - (1-p(i))^n for i = 1 to M. If we could estimate the probabilities p(i) this would provide a natural curve to fit to the data given by the gamekeeper.

The original deer problem would have all the p(j)'s the same, say p. In this case our expected curve is M(1-(i-p)^n). Of course, we do not know N and p. We can choose the curve that best fits the data by considering a range of M and p values and choosing the pair that minimizes the sum of the squares of the differences between the observed and predicted values. We carried out this procedure by simulation assuming M = 100 and p = .05 and the gamekeeper reported for 50 days. The best fit curve predicted the values M = 100 and p = .05 quite accurately.

For the species problem the p(j)'s are different, and so the problem is more difficult. Now we have M different values to estimate and less than that number of data points. Here researchers take advantage of the fact that the predicted curves look like parabolas. They replace this complicated expected value curve by the much simpler hyperbola of the form y = sx/(b-x) where x is the time observed and y the total number of species seen by this time. Now they again have only two parameters, s and b to estimate and again we can do this by choosing s and b to minimize the sum of the squares of the differences between observed and predicted values. To see how this works we simulated this experiment. We assumed that there are 100 different species and chose their probabilities of being seen in a given year at random between 0 and .1. We had the program estimate the best fit curve and estimate the total number of species from this curve. Using 100 species, we found that the resulting estimates were not too bad. However, the resulting hyperbolic curve did not fit the data very well. Even if it did fit the observed points, different models could fit just as well but give quite different limits. Curve fitting without an underlying model is a risky business.

The work of Fisher referred to by Richard Morin was for a slightly different problem. His problem can be described as follows: the King now wants to know how many species of butterflies there are in his forest. He asks the gamekeeper to go out one day with his net and see how many species of butterflies he catches. The gamekeeper does this and reports that he caught 30 butterflies with 5 different species of butterflies with numbers (12,7,5,3,5) for the different species. Fisher asked: how many new species would the gamekeeper expect to find on a second day of catching butterflies?

I. J. Good later gave an elegant new solution of Fisher's problem. Here is Good's argument.

It is natural to assume that number of butterflies of a particular species that enter the gamekeeper's net has a Poisson distribution since typically there are a large number of butterflies of a particular species and a small probability that any one would enter the gamekeeper's net. Let m be the mean number of species s caught in a day. Then the probability that no butterfly of species s is caught on the first day is e^(-m) and the probability that at least one such butterfly of species s is captured on the second day is 1-e^(- m). Therefore the probability that no butterfly of species s is captured on the first day and at least one is captured on the second day is:

Using the series expansion for e^(x) in the second term of this product we can write this probability as:

Now Good assumes that the mean m for a specific species is itself a chance quantity with unknown density f(m). Integrating our last expression with respect to f(m) we find that the probability that species s is not seen on the first day and at least one is captured on the second day is:

where X is the number of times species s is represented in the first day's capture. Summing this expression over all species we find that:

where e(j) is the expected number of species with j representatives on the first day. The e(j)'s can be estimated by the number of species represented j times on the first day's capture. Doing this we obtain an estimate for the number of new species we will find on the second day.

We have assumed that the sampling time was the same on the two days. If it is different, then our final sequence becomes

where w is the ratio of the time spent sampling on day 2 to the time spent on day 1.

The same procedure can be used to find an expression for the expected number of species in the second sample that occurred k times in the first sample. The result is

Thisted and Efron used this method to determine if a newly found poem of 429 words, thought to be written by Shakespeare, was in fact written by him.

They interpreted a species as a word in Shakespeare's total works. The sample on the first day corresponds to the 884,647 words in known Shakespeare works. The words in the new poem constituted the second sample. They then used Good's method to estimate the number of words in the poem that were used k times in the previous works. Then they compared these estimated values to the actual values.

In this example, w = 429/884647 is sufficiently small that only the first term wC(k+1,k)e(1+k) in our series need be used.

From this formula, we see that the expected number of words that were not used at all in previous works is we(1). There were 14,376 words in Shakespeare's original works that were used once. Thus we estimate e(1) = 14376/884647. This gives an estimate 7 for new words in the poem (the actual number was 9). In the same way we estimate that there should be 4 words that were used once in the previous works (the actual number was 7), and 3 words that should have been used twice (the actual number was 5. The authors concluded that this fit suggested that the poem was written by Shakespeare.

They made similar estimations for other Elizabethan poems by Shakespeare and other authors to see how well this method identified a work of Shakespeare. Their results showed that their method did distinguish works of Shakespeare from those of Donne, Marlowe and Jonson.

The use of simulation in these models highlights the fact that in statistics we are usually trying to estimate a quantity given incomplete information. If we set a lot of gamekeepers to gather the data we would typically get slightly different answers using the data from the different gamekeepers. We can see how much variation to expect by simulating this experiment.

While this example would take a couple of days to discuss in a Chance course, it uses nothing more difficult than the coin- tossing probability model and the Poisson approximation of this model.

David Rutherford sent us remarks on the Economist article on estimating the number of large salt-water species, He writes:

On the estimation of the number of as yet undiscovered salt-water species with length or width of at least 2 meters (6.6 feet), the author does not state that we are sampling only from that portion of the sea which trawlers fish (the bottom of sea trenches is relatively undersampled, for example, while the top portions of the sea are relatively oversampled). This is equivalent to sampling only the top half of the chocolate box. Here the analogy breaks down however, because the conditions at the bottom of the sea are markedly different (in terms of pressure, temperature and light) than conditions at the top of the sea, whereas the conditions in the chocolate box are fairly uniform. Presumably this qualifier appears in the original paper, either by Fisher or Paxton (ie 47 species remaining in that part of the sea which is or has in the past been sampled).

David Rutherford
Melbourne, Australia

Editors comments: Fisher and his colleague (R. A. Fisher, A.S. Corbet, C.B. Williams, Journal of Animal Ecology, 1943, Vol. 12, p. 42-48) state that their estimates apply only to a region for which the sample is representative. So, for example, estimates for the number of species of butterflies based on a sample at the bottom of a mountain would not apply to the number in an area of higher elevation. Likewise, the estimates for the number in one season need not apply to another. Paxton does not discuss this issue directly but after remarking that his model assumes a constant sampling which may well not be satisfied he writes:

The validity of the analysis presented here would also be in doubt if many new species are split from existing species by information gained by molecular techniques or new methods of biological sampling found large numbers of new species of large marine fauna in as yet unexplored habitats.

Study: Oscar winners tend to live longer.
Nando Times, 14 may 2001
Michael Rubinkam
Available at Nando Times until 27 May 2001

Survival in Academy Award--winning actors and actresses.
Annals of Internal Medicine, Vol. 134, No 10, 15 May 2001
Donald A Redelmeier, Sheldon M. Singh

The abstract for this study as presented in the Annals of Internal Medicine described the study as follows:

Those who win more than one academy award are counted only once which explains the difference between the number nominated 762 and the number of controls 887.

As seen above, Oscar-winners have a life expectancy of 3.9 years (79.7 vs. 75.8 yrs) greater than the matched controls from the same movie who have not been nominated for Oscars. The researchers report that the winners have about the same advantage over those who were nominated but did not get an Oscar -- 79.7 vs. 76.

The researchers comment that this increase of about 4 years in longevity is equal to the estimated societal consequence of curing all cancers in all people for all time. They do not have any simple explanation for this increase in longevity. But they suggest the following possible explanations: Oscar winners are under greater scrutiny that may lead to a more controlled life to maintain their image; they may have managers with a vested interest in their reputation and enforce high standards of behavior; they may have more resources and so can avoid stress and have access to special privileges that others do not.

The usual explanations for longer life expectancy for the rich over the poor: better schooling, better health care etc. do not seem to apply here.

(2) In discussing the limitations of their study the authors say that they should have had more biographical information about those in the study. What would they looked for if they did?

We usually do not go into the technical details of a study that we discuss in Chance News but decided that it might be fun to try this. We chose the study "Survival in academy award-winning actors and actresses," Annals of Internal Medicine, Vol. 134, No. 10, by Donald A Redelmeir and Sheldon M. Singh discussed in Chance News 10.05. Redelmeier was also the lead author of the interesting study on the danger of using a cell phone while driving, discussed in Chance News 6.03 and 6.10 and in an article "Using a car phone like driving drunk?" in Chance Magazine, Spring 1997.

Recall that for their Oscar study the authors identified all the actors and actresses who had been nominated for an award for leading or supporting role since the Oscar awards were started 72 years ago. For each of them, they identified another cast member of the same gender in the same film and born in the same era. This provided a group of 887 actors to be used as a control group for their study of Oscar winners. From those nominated there were 235 Oscar winners. The authors wanted to determine if Oscar winners tend to live longer than comparable actor who were not winners. Thus the key question is how do you decide if there is a significant difference in the life expectancy of members of two different groups.

similar problem arises in a medical trial in which one group of patients is given a new treatment and a second group is given a placebo or a standard treatment and the researchers are interested in the expected time until a particular "end event" occurs such as death, the disappearance of a tumor, the occurrence of a heart attack etc. The test that is generally used for such studies, and was used in the Oscar study, is called the "Kaplan-Meier survival test". This test was developed by Kaplan and Meier in 1958 ("Nonparametric Estimation from Incomplete Observations," Journal of the American Statistical Association, Vol. 53, No 282, 457-481). The importance of this test is suggested by the fact that Science Index shows that over 22,000 papers have cited the 1958 Kaplan and Meier paper since 1974. We are willing to bet that no reader can, without help from the Internet or a friend, guess a scientific paper that has a larger number of citations.

A good description of how the Kaplan-Meier test is carried out can be found in Chapter 12 of the British Medical Journal's on-line statistic book "Statistics at Square One."

The Kaplan-Meier test requires that we construct a life table for Oscar winners and the control group. We start by reminding our readers how Life Tables are constructed for the US population. The most recent US Life Tables can be found in CDC's href="http://www.cdc.gov/nchs/products/pubs/pubd/nvsr/48/lifetables98.htm">National Vital Statistics Report Volume 58, Number 18 and are based on 1998 data. Life tables are given by sex and gender and also for the total US population. The following table is from the first 10 rows of the Life Table for the 1998 US population.

The second column gives the proportion q(x) dying in each age interval, determined, as the number who died in this interval in 1998 divided by the US 1998 population at the midpoint of the age interval.

The third column starts with a cohort of 100,000 at birth and gives the number expected still to be alive at the beginning of each age interval. To compute these numbers we start with l(1) =100000 and then use the recursion relation l(x+1) = l(x)(1- q(x)) for x > 1. (Note that 1-q(x) is the proportion of those alive at time x that survive at least one more year.) For example, l(2) = 100000(1- .00721) = 99279, l(3) = 99279(1-.00055) = 99,225 etc. The quantity l(x) can be interperted as the probability that a newborn child will live to year x. For any year t greater than or equal to x, the quantity l(t)/l(x) can be interpreted as the probability that a person who has lived to year x wil lives to year t.

>To determine the life expectancy of a newborn baby we need only sum l(x) for all x. (Recall that for a discrete random variable X the expected value of X can be computed by the sum over all x of Prob(X >= x)). To find the life expectancy for a person who has reached age x we add the values of l(x)/l(t) for t greater than or equal to x. From the table we see that the life expectancy for a person at birth is 76.7 while for a person who has survived 9 years it is 69.4 making a total life expectancy of 79.4. Thus there is a 2.7 year bonus for having survived 9 years. You can view the entire Life Table here and check your own bonus. We have 10 year bonus for surviving so long but, alas, only a 10.7 year additional expected lifetime.

>A survival curve is a plot of the probability l(x)/100000 of living x or more years as a function of age. Using the full life table we obtain the following survival curve for the US population.

For a discussion of the technical problems in producing and interpreting Life Tables see the article . A method for constructing complete annual U.S. life

We return now to the Oscar study. In studies like this and others where the Kaplan-Meier test are used, we have incomplete information about the length of time to the end effect-- death in the case of the Oscar study. Some of the Oscar winners will have died by the time the study is completed but others will not have died by this time. Still others may have been lost by the researchers after being followed for some time. For those who have not died we know only that they lived to a certain age and at this point we say that they have "left the study." The Kaplan-Meier procedure allows us to use this information along with the information, about those who have died, in making life tables.

Dr. Relemeier provided us with the >data needed to carry out the Kaplan-Meier test. In Table 1 we show the first 10 entries in his data set for the members of the control group:

The first column gives the number of years that it is known the actor survived. For the first actor this was 56 years and the 0 in column 2 indicates that this actor died in his 56th year. The second actor survived 78 years and the 1 in column 2 indicates that this actor was lost to the study in his 79th year.

Recall that in the actual study there were 235 Oscar winners and 887 controls. To discuss how the Kaplan-Meier test works, using a manageable number of subjects, we chose a random sample of 30 Oscar winners and a random sample of 100 controls from the author's data set.>

We first want to determine a survival curve for each of the two sample groups. Recall that the key information for determining a survival curve for the US population was the estimate q(x) of the probability that a person who has lived to the beginning of year x survives this year. We use the same quantity to construct the survival curve when we have incomplete information. This is done by constructing Table 2. In column 2 of Table 2 we have listed, in increasing order, the years that the 30 Oscar winners died or were lost to the study.

In column 3 we put a 0 if the actor died and a 1 if the actor was lost to the study. Note that in some years there was more than one actor. For example there were three actors who either died or were lost to the study in their 54th year -- one died and two were lost to the study.

In column 4 we put the number n(x) of Oscar winners known to be alive at the beginning of year x.

In column 5 we put the number of Oscar winners d(x) who died in year x. Then (n(x)-d(x))/n(x) is the proportion q(x) of those alive at the beginning of year x who survived this year. These values appear in column 6.

Finally we use q(x) to estimate for the probability that an Oscar winner will live at least to the beginning of year x. As in the case of the tradition life table, l(0) = 1 and, for values greater than 0, l(x) is calculated by the recursion equation l(x) = l(x-1)*(1-q(x)). The values of l(x) appear in the last column.

The upper curve is the survival curve for the winners and the lower curve is for the controls. These curves suggest that the winners do tend to live longer than the controls. We can get another indication of this by computing the expected lifetime for members of each group. As in the US population we simply summing l(t) over all years t. Doing this we find that the expected lifetime for an Oscar winner is 78.7 and for members of the control group it is 74.7 again indicating that the Oscar winners live longer on average than the controls.

Again the top curve is the survival curve for the Oscar winners and the bottom curve is for the control group. Using all the data we find that the expected lifetime for an Oscar winner is 79.7 and for the controls it is 75.8 which reflect the same difference we saw in >

We must now tackle the question of how the authors concluded that the approximately four years difference found in the study was significant, i.e., could not be accounted for by chance. For this, the authors used a test called the "log-rank test".

We say an event happened in age year x if either at least one subject died or was lost to the study during this year.For each group and each age year we count the number still being followed at the beginning of the year and the number of deaths during this year. For example, in our sample we find that in the age year 61 we were still following 19 Oscar winners 2 of whom died in this year. For the control group we were still following 65 controls one of whom died this year. Thus there was a total of 84 still being followed at the beginning of age year 61 and a total of 3 deaths during this year. Now, under the hypothesis that there is no difference between Oscar winners and the controls, these 3 deaths should be randomly chosen from the 84 people still being followed. Thus we can imagine an urn with 84 balls, 19 marked O for Oscar winner and 65 marked C for controls. Then father death chooses three balls at random from this urn to determine the deaths. Then the number of the number of deaths chosen from the Oscar winners group has a hypergeometric distribution. The probability that any particular death is an Oscar Winner is 19/84 so the expected number of Oscar winners deaths in year 61 is 3*(19/84) =..679. The observed number o was 1. We also need to calculate the variance of the number of deaths among the Oscar winners. This is more complicated because the variance for the multinomial distribution is complicated.

Assume that you have n balls in an urn, k are red and and n-k are black. Then if you draw m balls at random the expected number of red balls is

e = m(k/n)
and the variance is

v = ( m*(k/n)*(n-k)/n)*((n-m)/(n-1)).

In our example the red balls are Oscar winners so the variance for the number of Oscar winners who died in the 61th year is

Then to carry out the rank-order test we do the above calculations for each age year for a particular group. We chose the Oscar winners. Let O be the sum of the observed number of Oscar winners who died in each year, E the sum of the expected values over all years and V the sum of the variances over all years, Then the statistic

will, by the Central Limit theorem, be approximately normal so S^2 will have approximately a chi-square distribution with 1 degree of freedom. Note that in summing the variances we are assuming that the observed numbers of Oscar winners who die in different years are independent. This is true because we are conditioning on knowing the number in each group that we are still watching and the total number of deaths for a given year. These determine the distribution of the number of Oscar winners who die in this year under our assumption that there is no difference between the two groups.

Using our program to carry out these calculations for the data for this study we found S^2 = 9.1246. The probability of finding a chi-squared value greater than this is .0025 so this indicates a significant difference between the Oscar winners group and the control group.

To check our program we carried out the calculations for the Kaplan-Meier procedure, as did the authors, using the SAS statistical program. SAS yielded the same survival curves as our program and for the significance tests SAS reported:

Thus the log-rank test agreed with our calculation. SAS also provided two other tests that might have been used both of which would result in rejecting the hypothesis that there was no difference between the two groups. This ends our saga.

WEATHER FORCASTING

Here is an account by Dan Rockmore of another chance exploration carried out recently by Dan and Laurie. This is a sequel to a previous exploration when they visiting local weather forecaster Mark Breen (See Chance News 8.04).

Whither the weather balloon?

Recently, our quest to understand the process of weather prediction led us to take a trip to visit the National Weather Service (NWS) Forecast Office in Gray, Maine. Our goal was to witness statistics in action, in the form of the launch of a real, live weather balloon.

Some time ago, our friend Mark Breen, the local Vermont Public Radio forecaster had shown us how some of his forecasting depended on the regional forecast generated from this NWS office. In turn, their own forecast used the data generated by weather balloons which are launched every day, at seven o'clock in the morning and seven o'clock in the evening at about 70 sites around the country. We couldn't understand how we had never seen one of these balloons. This was clearly a job for the Chance News action team, so off we went to witness a launch, ask some questions and shed some light on the mystery of the missing balloons.

We arrive in nearby Portland, Maine on the evening before our scheduled morning meeting at NWS, just in time to take in a Portland SeaDogs baseball game and have a great dinner at the Fore Street restaurant (order the Loup de Mer if it is on the menu!). We awake at 5:30 AM the next day and are on the road by 6 heading for the weather station. Weather prediction is clearly a coffee-intensive activity.

We wind our way over the local highways on a drizzly and appropriately gray morning in Gray. Our final destination is a medium-sized reddish brick, official-looking building, which is the regional weather prediction center of National Oceanic and Atmospheric Administration (NOAA). You can see pictures of the office at

http://www.seis.com/~nws/officepix.html.

For a brief history of the office see

http://www.seis.com/~nws/historyPWM.html.

You'll be interested to discover that the National Weather Service traces its origins to President Ulysses S. Grant, who gave responsibility for its creation to the Secretary of War.

We ring the bell and are let into the center. Weather prediction is a 24 by 7 job and some night-shift scientists are around, as well as the scientist in charge of the sacred balloon launch, Art Lester, one the hydro-meteorological technicians. In fact Art has already been there for about an hour or so, since it is his responsibility to prepare the balloon at around 6 AM for its 7 AM launch. Art gives us a tour of the operations room which is basically a computer room where computers are running weather models and weather maps are displayed on every monitor. We ask a bunch of questions about the pictures. The scientists can't help but look and sound like weather forecasters as they respond. Their hands sweep across the monitors as they trace out fronts moving this way and that, lows evolving into highs, and temperature isotherms. Other maps show clouds of precipitation and locate lightning strikes. The weather prediction models are being run in preparation for the construction of the morning regional forecast which is put together by our host John Jensenius, the Warning Coordination Meteorologist, who arrives at about 6:40.

John takes us aside to show us a weather balloon up close, along with the instrument that it carries into the sky, a radiosonde. The balloon itself is a silky yellowish brown sac which will be filled with helium and released into the air and should rise about 17 miles. The accompanying radiosonde carries sensing instruments encased in styrofoam. The instruments are measuring the temperature, humidity, barometric pressure at different altitudes. Wind speed and direction is also inferred by tracking the signal, thereby monitoring the movement of the balloon and radiosonde. The readings are sent back via radio waves - separate frequencies are reserved for the different variables. In fact as a way of checking that the radiosonde is on-line, the measurements are "played" as a set of tones of different frequencies. This gives new meaning to songs like "You are My Sunshine'' and "Thunder and Lightning".

Launch time is approaching. Everything checks out at the office and now it's time to launch! We go out and get into the car for a quick drive up to the launch site -- it looks like a little observatory, really just like a largish tall garage. We go inside and there is the balloon, tied to the table, with radiosonde tied on to it. The garage door is opened and final preparations are made, checking again to see that the radiosonde is secured and transmitting. The balloon is walked outside and with no ceremony at all, released and it speeds into the sky. With the low ceiling on this cloudy day it is quickly out of sight. Its rate of ascent makes it clear to us why we have never seen one floating lazily in the sky. With the low ceiling, in ten seconds it disappears, and we imagine that even on a clear day it is invisible in less than 30 seconds. As it rushes up to the clouds, the highly malleable skin flattens out in the early morning wind, looking more like a large lumpy pillow (UFO?) than a balloon, and soon it is gone, disappearing into the low clouds on this rainy morning. The radiosonde continues to transmit its song of the weather.

The readings will continue to be sent for about 2 hours or so. After this the balloon usually bursts (this was tested on the ground) and begins to descend rapidly to earth, often ending up in the ocean or a tree somewhere, which is probably why we have yet to come across one in our daily wanderings. In general, those launched on the east coast are rarely found, although in other places, like the landlocked Midwest with its wide-open plains, they are found relatively often. The radiosonde has a little notice on it, assuring anyone who finds it, that the equipment is perfectly safe and asking that it be mailed back (for free) to the regional office whose address is on the label. Each launch costs about 200 dollars, including the money spent for the time of the scientists. It is believed that this process will be automated in the near future, and that ultimately (possibly as soon as ten years from now) as satellite imaging gets better, there may very well come a day when the balloons are unnecessary.

We get back into the car to return to the main building for our debriefing. We continue our discussion with John of the role that the forecast plays, and the many, many tools available now for predicting the weather. John will be using the radiosonde data as well as the most recent model output, and satellite data (animated as brief movies made by the looping the pictures taken over time by the satellites) to prepare the day's area forecast discussion which is posted on the web - (see

http://www.seis.com/~nws/mesnhs.html

for the appropriate link as well as other related links). Comparison of the model predictions with incoming weather data helps to generate the forecast. We remember that the regional forecast is used by our own Mark Breen on VPR to help create his local forecast.

John explains that there have been great improvements in the 3,4, and 5-day forecasts, but that beyond that accurate predictions remain elusive. Once again we discuss the age-old bugabear, "What does probability of precipitation mean?" as well as its corollary "If a 20 percent chance of rain is predicted, and it rains, is the forecaster correct?" Here is our e-exchange:

Finally it's time to leave. We've tracked down the elusive weather balloon, yet unbelievably the joys of discovery are not yet over. As we make our way back to the interstate, we decide to stop for a proper breakfast at a roadside diner, Stones Grove. The six pick-up trucks parked in front are a good omen. We are not disappointed as we enjoy the best home fries we've ever tasted. The diner is full of regulars. We join in as the gang all sings Happy Birthday to Tom the cook who receives an 8 pound torque wrench, sweater, pineapple upside down cake and gift certificate to L.L. Bean for his birthday. As we climb back in the car and head home, we congratulate each other on a perfect trip. Now what were the chances of that?!

What is the difference in these two approches? Which do you think is better? Compare these two methods as applied to the data given in Harold Brooks article.

Copyright (c) 2004 Laurie Snell
This work is freely redistributable under the terms of the GNU General Public License published by the Free Software Foundation. This work comes with ABSOLUTELY NO WARRANTY.

Best of Chance News
APRIL 17, 2004

BEST OF CHANCE NEWS
APRIL 17 2004

Best of Chance News APRIL 17, 2004

BEST OF CHANCE NEWS APRIL 17 2004

Best of Chance News
APRIL 17, 2004

BEST OF CHANCE NEWS
APRIL 17 2004