Best of Chance News 13.03

Best of Chance News
APRIL 17, 2004

Prepared by J. Laurie Snell, Bill Peterson, Jeanne Albert, Charles Grinstead, and Myles McLeod with help from Fuxing Hou and Joan Snell. We are now using a listserv to send out notices that a new Chance News has been posted on the Chance Website. You can sign on or off or change your address at here. This listserv is used only for this posting and not for comments on Chance News. We do appreciate comments and suggestions for new articles. Please send these to jlsnell@dartmouth.edu. Chance News is based on current news articles described briefly in Chance News Lite.

The current and previous issues of Chance News and other materials for teaching a Chance course are available from the Chance web site.

Chance News is distributed under the GNU General Public License (so-called 'copyleft'). See the end of the newsletter for details.

Contents of Best of Chance News

(1)Benford's Law

(2) Estimating the number of species

(3) Oscar winners live longer.

(4) Weather forcasting.


 

BENFORD'S LAW

He's got their number: Scholar uses math to foil financial fraud.
Wall street journal, 10 July, 1995
Lee Berton


Mark Negrini, who teaches accounting at St. Mary's University in Halifax, wrote his PhD thesis on: "The detection of income evasion through an analysis of digital distributions". He has persuaded business and government people to use Benford's law to test suspicious financial records such as bookkeeping, checks and tax returns. The article states that Bendford's law "lays out the statistical frequency with which the numbers 1 through 9 appear in any set of random numbers". Actually, Benford's law states that the distribution of the leading digit in data sets is typically not equi-distributed but rather given by the distribution p(k) = log(k+1)- log(k) for k = 1,2,...,9. (The leading digit of .0034 is 3, of 243 is 2 etc.). Numerous explanations for this have been given but perhaps the most persuasive is that Benford's distribution is the unique distribution for the leading digits that is not changed by a change of units, i.e. multiplying the data by a constant c.

(For a recent discussion of Benford's distribution and further references see: Theoredore P. Hill, The significant digit phenomenon, "The American Mathematical Monthly", April 1995.)

Negrini's idea is that, if we are honest, the numbers in our tax returns and on our checks should satisfy Benford's law and if they do not there may be some skullduggery.

The article states that "Mr. Negrini has also lent his expertise to federal and state tax authorities, officials in Denmark and the Netherlands and to several companies. He has even put President Clinton's tax returns to the Benford's Law test. When he analyzed the president's returns for the past 13 years he found that 'the returns by Clinton follow Benford's Law quite closely'".

DISCUSSION QUESTION:

1. Would you expect Benford's distribution to apply to the number of hits a baseball player gets in a year?, to the prices of stocks on a given day?, to the population of cities in the United States?

2 Mr. Smith is quite wealthy and makes over a hundred charitable contributions each year. Do you think the distribution of the leading digits of these numbers would have a Benford distribution if he is honest? If he cheats, why might they not have a Benford distribution?

3. Compute the first 100 powers of 2 and show that the leading digits have a Benford distribution.


On the peculiar distribution of the U.S. stock indexes' digits.
American Statistician, Vol. 50, No. 4, Nov. 1996, pp. 311-313
Eduardo Ley


As we have discussed in previous issues of Chance News (see Chance News 4.10), Benford's distribution for leading digits is supposed to fit "natural data." The Benford distribution assigns probabilities log((i+1)/i) to digits i = 1,2,..,9. Ley asks if this distribution fits the leading digits of the one-day return on stock indexes defined as: r(t) = 100*(ln(p(t+1))-ln(p(t)))/d(t) where p(t) is the value of the index on the tth trading day and d(t) is the time the tth and t+1st trading day -- usually 1. Since p(t+1) = p(t)exp{r(t)*d(t)/100}, r(t) is the continuous-time return rate for the period between tth and t+1st trading time.

Ley finds that the leading digits of r(t) fit the Benford distribution remarkably well for both the Dow-Jones and the Standard and Poor's index.
He obtained the following distributions for the leading digits for the one-day return rate for the Dow-Jones from January 1900 to June 1993 and the S&P from January 1926 to June 1993.

Digit
Dow Jones
S&P
Benford
1
.289
.292
.301
2
.168
.170
.176
3
.124
.134
.125
4
.100
.099
.097
5
.085
.078
.079
6
.072
.071
.067
7
.062
.056
.058
8
.053
.054
.051
9
.047
.047
.046


As you can see, in both cases the approximation is very good.

Despite this apparent good fit, in both indicies a chi-squared test would reject Benford's distribution. Ley attributes this to the very large power of the test caused by so many samples. He remarks that if you take just the last ten years this does not happen and says:

If one takes models as mere approximations to reality, not as perfect data-generating mechanisms, then this can only be viewed as a weakness in the Neyman-Pearson theory (hypothesis testing). Ley's data is available from from http://econwpa.wustl.edu.

Comment: This example is discussed in an article" First Íignficant Îigit Patterns From Mixtures of Uniform Distributions" by Ricardo J. Rodriguez in The American Statistician, Vol. 58, No. 1, Feb. 2004. Here it is shown that this data is better fit by another law called "Stigler's law."

DISCUSSION QUESTIONS:

(1) Would you expect leading digits of the Dow-Jones values themselves to have Benford's distribution?
http://econwpa.wustl.edu.

(2) Do you agree with Ley's remark about the weakness in the theory of hypothesis testing?


Digit Analysis Using Benford's Law: Tests & Statistics for Auditors,
Global Audit Publications, Vancouver, BC
Mark J. Nigrini

It is often rewarding though less often enjoyable to read a technical book. But this book is both technical and a pleasure to read. It tells the story of Mark Nigrini's crusade to convince the world of accountants that Benford's distribution for the leading digit of "natural" data can be useful in detecting fraud.

Nigrini tells us how he learned about Benford's distribution, how he came to write his Phd thesis on the use of this distribution to detect fraud, and how, in the last twenty years, he has developed and applied methods for using digital analysis for fraud detection. While Benford's distribution is at the heart of this analysis, Nigrini incorporates other kinds of digital irregularities in his analysis. For example, when Dartmouth reimburses us for travel, we do not have to provide receipts for items that are less than $25. So, most of our meals end up costing $24.50. Nigrini's digital analysis would have no trouble detecting this fraud.

The leading digit L of a positive numbers is the first non-zero number in its decimal representation. So the leading digits of .0023, 2.23, and .234 are all 2. The second digits of these numbers are 3,2, and 3 respectively.

As our quote for this Chance News suggests, the famous astronomer Simon Newcomb was led, by looking at tables of logarithm, to the belief that the leading distribution of data was not typically uniformly distributed. In his 1881 article Newcomb (1) (see end of the article for references) gave an argument to show that the mantissas of the logarithms of the numbers should be uniformly distributed which leads to

      Prob(L = j) = log(j+1) - log(j) for j = 1,2,..,9

where the logarithm is to the base 10.

This gives the distribution

First digit
1
2
3
4
5
6
7
8
9
Probability
.301
.176
.125
.097
.079
.067
.058
.051
.046

The physicist Frank Benford(2) independently discovered this distribution and showed that a distributions of the leading digits for a number of "natural" data sets reasonably fit this distribution.

Here are some data whose leading digits are reasonably approximated by the Benford distribution:

 Benford  Dow Jones 1900-1993  Rivers Area  News papers  County populations  Electricity consumption
 1  .301  .289  .310  .30  .316  .316
 2  .175  .168  .164  .18  .170  .167
 3  .125  .124  .107  .12  .134  .116
 4  .097  .100  .113  .10  .083  .087
 5  .079  .085  .072  .08  .073  .085
 6  .067  .072  .086  .06  .067  .064
 7  .058  .062  .055  .06  .055  .057
 8  .051  .053  .042  .05  .056  .050
 9  .046  .047  .052  .05  .046  .057

The Dow Jones numbers are from an article by Ley(3) The rivers and newspaper data are from Benford's article. The County populations are from the population for 3,141 counties as reported in the 1990 census and analyzed by Nigrini in this book. The electricity consumption data appears in an excellent survey article by Raimi(4) and represents the electricity consumption of 1243 users in one month in the Solomon Islands.

In keeping with his desire to keep the mathematics informal, Negrini justifies the Benford distribution in terms of simple examples. For example, he considers a town that starts with population 10,000 and increases 10 percent per year. Then (ignoring compounding) the population increasing from 10,000 to 20,000 is a 100 percent increase and so takes about 10 years. But, a population change from 50,000 to 60,000 is only a 20 percent increase so takes only about 2 years. Thus the city will have a population with leading digit 1 about five times longer than it has leading digit 5.

We found Newcomb's argument a little mysterious and so will describe an argument that justifies the Benford distribution as the only distribution that is invariant under a change of scale, i.e. multiplying the data by a constant c. This was first proved by Pinkham(5) and extended by Hill(6).

To know the leading digit, the second digit, etc. we need only know the data modulo powers of 10. This is the same as saying we need only know the mantissas of their logarithms. We represent the mantissas as points on the unit circle. Following Hill we assume there is some chance process producing the data which in turn determines a probability measure for their mantissas.

For a data point to have leading digit L = j it must have a mantissa between log(j) and log(j+1). Thus

P(L = j) = P(log(j) <= mantissa < log(j+1))

Since multiplying the numbers by c amounts to adding log(c) to their mantissas, this means that the probability measure for the mantissas should be invariant under a rotation of the circle. It is well know that the only measure with this property is the uniform measure. Thus

P(L = j) = log(j+1) - log(j)
which is Benford's distribution for the leading digit.

Negrini's book contains a wide variety of case studies starting with the analysis of the data, from his student projects analyzing the data from their relatives mom-and-pop stores and to his own digital analysis of data from major corporations. Along the way there is an interesting discussion on the best way to test the fit of a distribution when you have many thousands of data elements.

Here is an example Nigrini gives from his student projects. For his project a student, Sam, used data from his brother's small store in Halifax, Nova Scotia. Throughout the day family members would ring up sales in the normal fashion. At night before closing, Sam's brother would go downstairs and ring up fictitious sales on a similar register so that the basement total was less than the upstairs total, to evade income and sales taxes. On one day there were 433 authentic sales totaling $4,038.32. On the fake register printout used for tax purposes there were 245 sales totaling $1,947.29. The leading digit distributions from the two registers' printouts and, for comparison, Benford distribution are:

 Digit  Benford  Actual sales  Fake sales
 1  .301  .290  .17
 2  .176  .230  .06
 3  .125  .075  .05
 4  .097  .100  .02
 5  .079  .170  .38
 6  .067  .025  .17
 7  .058  .024  .02
 8  .051  .024  .01
 9  .046  .040  .12

The fit to the Benford distribution for the actual sales is not great but the biggest difference, the excess in digits 5, could be explained by the large number of sales of cigarettes which sold for somewhat more than $5 at that time. Thus Nigrini would pass the actual sales data. However, his tests would certainly detect fraud in the fake sales data.

The obvious test to use for deciding if an empirical distribution fits the theoretical Benford distribution would be the chi-square test. However, Nigrini rejects the use of this test for the very large data sets obtained from analyzing data from a major company. The reason is that, when the true distribution is only slightly different from the theoretical Benford distribution, the large sample would lead to rejection of the Benford distribution even though there may be no fraud. This is like the observation that with enough tosses of a coin it is possible to reject just about any coin as a fair coin since coins are never exactly fair.

For this reason Nigrini recommends a test that he calls the Mean Absolute Deviation (MAD) test. This test computes the average of the 9 differences between the empirical proportions of a digit and the proportion predicted by the Benford distribution. Based on his experience Nigrini suggests the guidlines for measuring conformity of the first digits to Benford using MAD to be:

MAD:    0 to .004  (close conformity)
MAD: .004 to .008  (acceptable conformity)
MAD: .008 to .012  (marginally acceptable conformity)
MAD:  greater than .012 (nonconformity)

Thus a single deviation of more than 5% would rule out close conformity and more than 10% would suggest nonconformity. Nigrini gives a graph of the fit for a data set with 250,000 data entries from a large Canadian oil company. There are noticeable deviations for at least half of the digits. However, the deviations are small and the MAD was computed at .0036 indicating close conformity. A chi-square test would undoubtedly have rejected the Benford distribution.

Nigrini emphasizes that deviation from Benford's distribution and other digital irregularities do not by themselves demonstrate fraudulent behavior. There may be good reasons for these deviations. It only suggests that investigator might want to look further into how the data is collected, both for evidence of fraud and also for more efficient ways to run the business.

While we have assumed that natural data is the result of chance outcomes, it has not been necessary to know the probability distribution that produces them. However, we would hope that some of our standard distributions would be appropriate for fitting natural data. Both Hill and Nigrini remark that it would be interesting to know which standard distributions produce data with the leading digits having at least approximately a Benford distribution.

In a recent paper Leemis, Schmeiser, and Evans(7) have looked at this problem for survival distributions. These distributions include, among others, the well-known exponential, Gamma, Wielbull and Log normal distributions. All these distributions, for some parameter values, produce data with leading digits that reasonably fit Benford's distribution. However, since the fit is sensitive to the parameter values, the authors also warn that relying completely on the Benford distribution to detect fraud could lead to a significant number of false positives.

Mark Nigrini was one of our year 2000 Chance Lecturers, so you can see the movie before or after you read the book.

References for this review:

(1) Simon Newcomb, (1881), Note on the frequencies of the different digits in natural numbers. Amer. J. Math 4, 39-40.

(2) Benford, Frank,(1938), The law of anomalous numbers. Proc. Amer. Phil. Soc. 78, 551-72.

(3) Ley, Eduardo, (1996), On the peculiar distribution of the U.S. stock Indices. The American Statistician, 1996, 50, 311-313.

(4) Raimi, R. (1976), The first digit problem, Amer. Math. Monthly 83, 521-38.

(5) Pinkham, Roger, (1961), On the ditribution of first significant digits, Ann. Math. Statist., 32, 1223-1230.

(6) Hill, Theodore P., (1995), The Significant-Digit Phenomenon. American Mathematical Monthly, Vol. 102, No. 4. pp. 322-327.

(7) Survival Distributions Satisfying Benford's Law (with B. Schmeiser and D. Evans), The American Statistician, November 2000, Volume 54, Number 4, Available at: Larry Leemis -- Homr Page
                                  

Google numbers.
Sample project for Dartmouth Chance Course
Gregory Leibon

Greg Leibon taught the Dartmouth Chance Course this winter and made this sample project for his students. Since we have discussed the Benford's Law a number of times, we thought readers might also enjoy his project. Recall that the Benford's Law is that the first(leading) digit in "natural data" should have approximately the distribution given by base 10 logarithm of (1 + 1/d) for d = 1,2,...,9. Thus the leading digit is 1 with probability log(2) = .301, 2 with probability log(1.5) = .176 etc. The complete distribution is:

First digit
Benford's probability
1
   30.1%
2
17.6
3
12.5
4
9.7
5
7.9
6
6.7
7
5.8
8
5.1
9
4.6

For his project, Greg wanted to see if "natural numbers" on the web satisfied Benford's Law. He writes:

I wanted to understand numbers on the World Wide Web in which real live people were actually interested. In particular, I did not want to accidentally include numbers from data sets intended only for data mining purposes. To accomplish this, I included a piece of text in my search. I wanted to choose a natural piece of text, hence (for lack of a better idea) I used the word “nature”. Thus, my Google Numbers are numbers that occur on a web page that also includes the word “nature”.


I wanted my search to produce robust but reasonable numbers of results. This is because I wanted to leave myself in a position to actually examine the resulting hits in order to achieve a sense for how the numbers were derived.

A little experimenting led Greg to the conclusion that searches for six-digit numbers and the word "nature" resulted in a reasonable number of hits. So he chose nine random five digit numbers and for each of these he added all possible leading digits. His first five-digit number was x = 13527 giving him the 9 six-digit numbers 113527, 213527, 313527, ...,913527. He then searched for each of these numbers and the word "nature" in Google and recorded the number of hits. Here is what he found:

 

x = 13527
occurrences
113527
136
213527
44
313537
35
413527
30
513527
27
613527
15
713527
9
813527
13
913527
8

 

He repeated this for his 8 other random five-digit numbers and combined the results to obtain:

Leading digit
count
Empirical Percent
Benford
1
645
    31.65%
   30.1%
2
342
16,78
17.6
3
262
12.86
12.5
4
181
8.88
9.7
5
164
8.05
7.9
6
143
7.02
6.7
7
115
5.64
5.8
8
105
5.15
5.1
9
81
3.97
4.6

This is a remarkably good fit. Here is his graphical comparison:

f

Greg wondered if he was just lucky or if there was some explanation for such a good fit. Looking for an explanation, he found that many of the numbers he observed could be considered the result of a growth process. As an example of such a growth process, consider the money you have in the bank that is continuously compounded. Then it is easy to check that the percent of time your money has leading digit k for k = 1,2,3,...,9 fits the Benford distribution. Greg remarks:

Hence, we would expect Google numbers to have a Benford distribution if they satisfied two criteria: first that every Google Number behaves like money with interest continuously compounded, and, second that the probability that a Google number is posted on the web is proportional to how long that quantity is meaningful.

We gave Greg an A on his project but you should read it here yourself. You can also see what Greg did in the Chance Course and some student projects here.

DISCUSSION QUESTION:

Repeat Greg's experiment replacing "nature" by a different word. Do you get similar results?


ESTIMATING THE NUMBER OF SPECIES

Hidden truths
New Scientist, 23 May, 1998, p 28
Robert Matthews

This article deals with the topic of missing data. Two major subtopics are discussed; one is the capture-recapture method, and the other is bias in clinical studies. As an example of the capture-recapture method, consider the following scenario. Suppose that we are trying to estimate the number of people who live in a certain area. We can pick a random sample from the area and consider the people in the sample to be 'captured'. Then we can pick another random sample from the same area and count how many of the captured people are picked the second time (i.e. 'recaptured'). Suppose that we pick a sample of 100 people the first time, and then we find that, in our second sample of size 100, 10 people have been recaptured. This tells us that the captured set is about 1/10 the size of the population, leading to an estimate of 1000 for the population size. This method has been applied in ecology, national defense, and public health, to estimate sizes of populations that are hard to count directly.

The second subtopic concerns missing data. Suppose, for example, that a drug has been developed to treat a certain condition. Before it can be introduced into the marketplace, it must be tested to see if it is safe and if it works. It is fairly clear that if one tests a drug sufficiently often, it is probable that at least one of the tests will show that the drug is effective. Unfortunately, there may be many other studies of the same drug that do not show this, and these latter studies are less likely to be published than the former one. The question in this case is whether the fact that these negative studies exist can be discerned from the studies that have been published.

It turns out that the answer to this question is 'sometimes'. One can use what is known as a 'funnel plot' to help determine whether there is missing data. One begins with the data from many small studies concerning the drug's efficacy. Since these studies are small, they tend to vary widely in their predictions concerning the drug. Larger studies are needed to establish statistically significant results. If one compares the results of the published larger studies with the smaller studies, two possibilities exist. The first one is that, while the larger studies tend to cluster more closely to a certain figure than the smaller ones, there is no bias in the larger studies with respect to this figure. The second possibility is that the larger studies are skewed (presumably in the direction that shows the drug is beneficial). This latter situation suggests that some larger studies have been conducted and have remained unpublished because they do not show that the drug is effective at a statistically significant level.

Matthias Egger and some colleagues at Bristol University have used funnel plots in this way to show that, in at least one quarter of 75 published medical studies, there were significant signs of missing data. These results have given more ammunition to medical scientists who have called for access to all study results.

DISCUSSION QUESTION:

(1) The Census Bureau plans to use the capture-recapture method to determine the undercount in the census 2000. This is done by carrying out two surveys (Mass Enumeration & PES Sample Enumeration) of a population of N blocks at about the same time. The unknown total population is

M = M11 + M10 + M01 + M00

where M11 is the number enumerated on both occasions; M10 the number enumerated on occasion one but not on occasion two, M01 the number enumerated on occasion two but not one, and M00 the number not enumerated on either occasion. Everything but M00 is known here. Assume the two surveys are independent and each individual has the same probability of being counted. Then you can estimate M00 by

M00' = M01M10/M11

and from this obtain an estimate of M as

M' = M11 + M10+M01+M00.

How realistic do you think the assumptions are in this situation?

(2) Matthews mentions the following problem: Jones, a butterfly researcher, captures butterflies for a period of time m. He captures 5 different species with 4,1,10,20,2 representatives of these 5 species. Jones asks: if I were to return the next day and catch butterflies for a length of time n, how many new species of butterflies should I expect to get?

This is a famous and very difficult estimation problem. There is one elegant solution due to I.J. Good (Biometrica, Vol. 40, 1953, pp. 237-264) which works sometimes. Assume that butterflies of species s enter the net according to a Poisson process with rate v which depends on s. Then the distribution of the number of butterflies of species s caught in time t is Poisson with mean tv. Thus, before he starts his research. the probability that Jones catches no s butterfly on the first day but at least one s butterfly on the second day is

e^(-vm)*(1-e^(-vn)).

Using the infinite series expansion for e^x in the term of this product permits us to write this product as:

P(X = 1)(n/m) - 2P(X=2)(n/m)^2 + 3P(X = 3) (n/m)^3 + ...

where X is the number of s butterflies caught on the first day. Summing these equivalent expressions over all species shows that

Expected number of new species caught on the second day =

r(1)(n/m) - r(2)(n/m)^2 + r(3)(n/m)^3 - + ...

where r(j) is the expected number of species represented j times in the first days catch.

(3) Assume that n=m. Estimate the number of new butterflies that Jones will capture on the second day.

Thisted and Efron (Biometrica 1987,74, 3, pp. 445-55) used this method to determine if a newly found poem of 429 word had about the right number of new words to be consistent with being written by Shakespeare. They interpreted a species as a word in Shakespeare's total works discovered or not discovered. They took the time m to be 884647, the number of words in existing Shakespeare works, and n to be 429, the number of words in the new poem. Using the above result they got an estimate of about 7 for the number of new words that should occur in a new work of 429 words, and since the new poem had 9 such new words they concluded that this was consistent with the poem being a work of Shakespeare.

(4) What do you think about applying the Poisson model to this kind of problem?

(5) It would be interesting to check this model on bird data. A natural thing to try would be the Christmas Bird Count (CBC). I'm told the data is available.


In his column, Unconventional Wisdom, Richard Morin included an account of the sea monster study that we mentioned in Chance News. As we read his account it occurred to us that this was a good topic for a Chance profile on sampling methods related to the capture-recapure method. We wrote the beginning of such a module and am including it in this Chance news with the hope of getting some suggestions for improvement or additions. We plan to include the simulation programs mentioned. Morin's discussion is such a nice introduction that we include it as he wrote it.

Unconventional Wisdom
Washington Post, 7 March, 1999

Statistics, the King's Deer And Monsters Of the Sea

How many species of sea monsters are left to be discovered? The correct answer is 47, give or take a couple, claims Charles Paxton of Oxford University.

Paxton isn't a prophet. He's a marine biologist who has created a statistical model that, he claims, predicts how many species of creatures measuring more than two meters--about six feet--still swim, slither or crawl undiscovered in the world's oceans.

Of course his prediction may be off a bit, he allows in a recent issue of the Journal of the Marine Biological Association of the United Kingdom. It also may just be possible that there's nothing new of any size lurking in the depths.

But Paxton doesn't think so. Buoyed by the results of his statistical model, he's confident that there are probably many more marine mammoths out there--including, perhaps, "a couple of new totally weird sharks"--and maybe even another Moby Dick. "Field observations suggest one or two species of odontocetes [toothed whales] probably await capture and formal description," he reported in his article.

His confidence is based on the results of a statistical technique frequently used by ecologists to estimate the diversity of populations, based on only a small sampling. It's derived from the work of the English mathematician and biologist Sir Ronald Aylmer Fisher.

But Paxton's work is reminiscent of a tale, perhaps apocryphal, told about the famed French mathematician Simon Denis Poisson (1781-1840), whose achievements were so notable that he's honored with a plaque on the first level of the Eiffel Tower. The late Stephen Withey of the University of Michigan liked to tell the story to skeptical graduate students (including your Unconventional Wiz) in trying to prove that it really is possible to count the uncountable.

In the early 1800s, the king of France summoned Poisson. His highness wanted to know how many deer there were in the Royal Forest. It seemed that the royal gamekeeper had a sharp eye but a dull mind: He could easily recognize individual deer after seeing them just once (yeah, we didn't believe it, either--but it made the story better). When could the poor rustic stop counting the deer, Poisson was asked. And how certain could the king be that his count was accurate?

Poisson came up with this solution: He told the gamekeeper to search the forest every day and tally every new deer he found. He cautioned the man not to add to the count when he came across any deer he had previously seen. Of course, there were lots of first-time sightings on the first day. The cumulative total increased only modestly on the second day. Every day thereafter, the total grew by smaller and smaller amounts.

At the end of each day, Poisson dutifully plotted each updated total on a graph. Then he connected the dots, creating a line that grew at a progressively slower rate and tended to reach what mathematicians call a "fixed upper limit"--the point at which one could safely predict that the chances of finding a previously undiscovered deer would be overwhelmingly small.

Roughly speaking, what Poisson is said to have done with the king's deer, Paxton has done with creatures of the sea. Paxton consulted the 10th edition of Carolus Linnaeus's "Systema Naturae," first published in 1758, and identified all saltwater animals more than two meters long. He found about 100 species. Then he searched subsequent scientific literature through 1995 and recorded the discovery by year of new species that met his "monster" test.

He found that, by 1995, the overall number of really big sea creatures had reached 217. But the rate of discovery had dropped dramatically (we're currently discovering one new big marine animal on average every 5.3 years, Paxton reported). He then graphed the data between 1830 and 1995 and estimated its upper limit, which suggests there are probably about 47 more species of whoppers still eluding us in the world's oceans.

In reading the Poisson story, the first thing that came to our mind was that the remarkable ability of the gamekeeper to distinguish every deer he has seen suggests that the King might also have used the capture-recapture method used by the Census Bureau in the undercount problem in this way. The King asks the gamekeeper to make a serious effort to see as many deer as he could on a specific day. Think of the deer he saw the first day as tagged. Then on the next day he could make an even greater effort to see as many as possible. Suppose the gamekeeper sees 35 deer on the first day and 60 on the second day, noting that 20 of these 60 deer he also saw on the first day. Then, assuming there are N deer walking around the forest more or less at random, the proportion of tagged deer in the second day's sample, 20/60, should be approximately the proportion of tagged deer in the forest, 35/N. Thus the King can estimate the number of deer in the forest to be 3*35 = 105. Thinking about the validity of the assumptions made in this hypothetical problem suggests problems that the Census Bureau must adjust for in their real world use of the capture-recapture method in the Census 2000.

Let's turn now to solving the problems that the King and Paxton posed. In the deer problem, it might be reasonable to assume that each deer has the same probability of being seen in a day. However, for the monster fish problem the probability of seeing a particular species of large fish in a given year would surely depend on the species. Thus to have a solution that will apply to both problems we will let the probability of seeing a particular deer differ from deer to deer.

Poisson suggested plotting the numbers that the gamekeeper saw each day for a series, say, of 50 days. When you do this you get an increasing set of points that suggest that a curve could be fitted to them that looks a bit like a parabola. This curve would be expected to level off when all the deer had been seen and the value at which it levels off would be our estimate for the total number of deer in the forest. Our problem is to determine the curve that best fits this data.

Assume there are M deer in the forest which we label 1,2,3,...,M Let's assume that on a given day the gamekeeper sees deer j with probability p(j). Then the probability that he has not seen the ith deer after n days is (1- p(i))^n and so the probability that he has seen this deer after n days is 1- (1-p(i))^n. Thus the expected number of deer the gamekeeper has seen after n days is the sum of 1 - (1-p(i))^n for i = 1 to M. If we could estimate the probabilities p(i) this would provide a natural curve to fit to the data given by the gamekeeper.

The original deer problem would have all the p(j)'s the same, say p. In this case our expected curve is M(1-(i-p)^n). Of course, we do not know N and p. We can choose the curve that best fits the data by considering a range of M and p values and choosing the pair that minimizes the sum of the squares of the differences between the observed and predicted values. We carried out this procedure by simulation assuming M = 100 and p = .05 and the gamekeeper reported for 50 days. The best fit curve predicted the values M = 100 and p = .05 quite accurately.

For the species problem the p(j)'s are different, and so the problem is more difficult. Now we have M different values to estimate and less than that number of data points. Here researchers take advantage of the fact that the predicted curves look like parabolas. They replace this complicated expected value curve by the much simpler hyperbola of the form y = sx/(b-x) where x is the time observed and y the total number of species seen by this time. Now they again have only two parameters, s and b to estimate and again we can do this by choosing s and b to minimize the sum of the squares of the differences between observed and predicted values. To see how this works we simulated this experiment. We assumed that there are 100 different species and chose their probabilities of being seen in a given year at random between 0 and .1. We had the program estimate the best fit curve and estimate the total number of species from this curve. Using 100 species, we found that the resulting estimates were not too bad. However, the resulting hyperbolic curve did not fit the data very well. Even if it did fit the observed points, different models could fit just as well but give quite different limits. Curve fitting without an underlying model is a risky business.

The work of Fisher referred to by Richard Morin was for a slightly different problem. His problem can be described as follows: the King now wants to know how many species of butterflies there are in his forest. He asks the gamekeeper to go out one day with his net and see how many species of butterflies he catches. The gamekeeper does this and reports that he caught 30 butterflies with 5 different species of butterflies with numbers (12,7,5,3,5) for the different species. Fisher asked: how many new species would the gamekeeper expect to find on a second day of catching butterflies?

I. J. Good later gave an elegant new solution of Fisher's problem. Here is Good's argument.

It is natural to assume that number of butterflies of a particular species that enter the gamekeeper's net has a Poisson distribution since typically there are a large number of butterflies of a particular species and a small probability that any one would enter the gamekeeper's net. Let m be the mean number of species s caught in a day. Then the probability that no butterfly of species s is caught on the first day is e^(-m) and the probability that at least one such butterfly of species s is captured on the second day is 1-e^(- m). Therefore the probability that no butterfly of species s is captured on the first day and at least one is captured on the second day is:

e^(-m)(1-e^(-m))

Using the series expansion for e^(x) in the second term of this product we can write this probability as:

e^(-m)(m - m^2/2! + m^3/3! - ...)

Now Good assumes that the mean m for a specific species is itself a chance quantity with unknown density f(m). Integrating our last expression with respect to f(m) we find that the probability that species s is not seen on the first day and at least one is captured on the second day is:

P(X = 1) - P(X = 2)+ P(X = 3) - ...

where X is the number of times species s is represented in the first day's capture. Summing this expression over all species we find that:

The expected number of new species found on the second day is

e(1) - e(2) + e(3) - ...

where e(j) is the expected number of species with j representatives on the first day. The e(j)'s can be estimated by the number of species represented j times on the first day's capture. Doing this we obtain an estimate for the number of new species we will find on the second day.

We have assumed that the sampling time was the same on the two days. If it is different, then our final sequence becomes

we(1) - w^2e(3) + w^3e(3) - ...

where w is the ratio of the time spent sampling on day 2 to the time spent on day 1.

The same procedure can be used to find an expression for the expected number of species in the second sample that occurred k times in the first sample. The result is

wC(k+1,k)e(1+k) - w^2C(k+2,2)e(2+k) + w^3C(k+3,3)e(3+k) - ...

where C(n,j) is the number of ways to choose j elements from a set of size n.

Thisted and Efron used this method to determine if a newly found poem of 429 words, thought to be written by Shakespeare, was in fact written by him.

They interpreted a species as a word in Shakespeare's total works. The sample on the first day corresponds to the 884,647 words in known Shakespeare works. The words in the new poem constituted the second sample. They then used Good's method to estimate the number of words in the poem that were used k times in the previous works. Then they compared these estimated values to the actual values.

In this example, w = 429/884647 is sufficiently small that only the first term wC(k+1,k)e(1+k) in our series need be used.

From this formula, we see that the expected number of words that were not used at all in previous works is we(1). There were 14,376 words in Shakespeare's original works that were used once. Thus we estimate e(1) = 14376/884647. This gives an estimate 7 for new words in the poem (the actual number was 9). In the same way we estimate that there should be 4 words that were used once in the previous works (the actual number was 7), and 3 words that should have been used twice (the actual number was 5. The authors concluded that this fit suggested that the poem was written by Shakespeare.

They made similar estimations for other Elizabethan poems by Shakespeare and other authors to see how well this method identified a work of Shakespeare. Their results showed that their method did distinguish works of Shakespeare from those of Donne, Marlowe and Jonson.

The use of simulation in these models highlights the fact that in statistics we are usually trying to estimate a quantity given incomplete information. If we set a lot of gamekeepers to gather the data we would typically get slightly different answers using the data from the different gamekeepers. We can see how much variation to expect by simulating this experiment.

While this example would take a couple of days to discuss in a Chance course, it uses nothing more difficult than the coin- tossing probability model and the Poisson approximation of this model.


David Rutherford sent us remarks on the Economist article on estimating the number of large salt-water species, He writes:

On the estimation of the number of as yet undiscovered salt-water species with length or width of at least 2 meters (6.6 feet), the author does not state that we are sampling only from that portion of the sea which trawlers fish (the bottom of sea trenches is relatively undersampled, for example, while the top portions of the sea are relatively oversampled). This is equivalent to sampling only the top half of the chocolate box. Here the analogy breaks down however, because the conditions at the bottom of the sea are markedly different (in terms of pressure, temperature and light) than conditions at the top of the sea, whereas the conditions in the chocolate box are fairly uniform. Presumably this qualifier appears in the original paper, either by Fisher or Paxton (ie 47 species remaining in that part of the sea which is or has in the past been sampled).

David Rutherford
Melbourne, Australia

Editors comments: Fisher and his colleague (R. A. Fisher, A.S. Corbet, C.B. Williams, Journal of Animal Ecology, 1943, Vol. 12, p. 42-48) state that their estimates apply only to a region for which the sample is representative. So, for example, estimates for the number of species of butterflies based on a sample at the bottom of a mountain would not apply to the number in an area of higher elevation. Likewise, the estimates for the number in one season need not apply to another. Paxton does not discuss this issue directly but after remarking that his model assumes a constant sampling which may well not be satisfied he writes:

The validity of the analysis presented here would also be in doubt if many new species are split from existing species by information gained by molecular techniques or new methods of biological sampling found large numbers of new species of large marine fauna in as yet unexplored habitats.


OSCAR WINNERS LIVE LONGER

Study: Oscar winners tend to live longer.
Nando Times, 14 may 2001
Michael Rubinkam
Available at Nando Times until 27 May 2001

Survival in Academy Award--winning actors and actresses.
Annals of Internal Medicine, Vol. 134, No 10, 15 May 2001
Donald A Redelmeier, Sheldon M. Singh

The abstract for this study as presented in the Annals of Internal Medicine described the study as follows:

Background: Social status is an important predictor
of poor health. Most studies of this issue have focused
on the lower echelons of society.

Objective: To determine whether the increase in status
from winning an academy award is associated with
long-term mortality among actors and actresses.

Design: Retrospective cohort analysis.

Setting: Academy of Motion Picture Arts and Sciences.

Participants: All actors and actresses ever nominated
for an academy award in a leading or a supporting role
were identified (n = 762). For each, another cast member
of the same sex who was in the same film and was born
in the same era was identified (n = 887). Measurements:
Life expectancy and all-cause mortality rates.

Results: All 1649 performers were analyzed; the median
duration of follow-up time from birth was 66 years, and
772 deaths occurred (primarily from ischemic heart disease
and malignant disease). Life expectancy was 3.9 years
longer for Academy Award winners than for other, less
recognized performers (79.7 vs. 75.8 years; P = 0.003).
This difference was equal to a 28% relative reduction in
death rates (95% CI, 10% to 42%). Adjustment for birth
year, sex, and ethnicity yielded similar results, as did
adjustments for birth country, possible name change,
age at release of first film, and total films in career.
Additional wins were associated with a 22% relative
reduction in death rates (CI, 5% to 35%), whereas
additional films and additional nominations were not
associated with a significant reduction in death rates.

Conclusion: The association of high status with increased
longevity that prevails in the public also extends to celebrities,
contributes to a large survival advantage, and is partially
explained by factors related to success.

Those who win more than one academy award are counted only once which explains the difference between the number nominated 762 and the number of controls 887.

As seen above, Oscar-winners have a life expectancy of 3.9 years (79.7 vs. 75.8 yrs) greater than the matched controls from the same movie who have not been nominated for Oscars. The researchers report that the winners have about the same advantage over those who were nominated but did not get an Oscar -- 79.7 vs. 76.

The researchers comment that this increase of about 4 years in longevity is equal to the estimated societal consequence of curing all cancers in all people for all time. They do not have any simple explanation for this increase in longevity. But they suggest the following possible explanations: Oscar winners are under greater scrutiny that may lead to a more controlled life to maintain their image; they may have managers with a vested interest in their reputation and enforce high standards of behavior; they may have more resources and so can avoid stress and have access to special privileges that others do not.

The usual explanations for longer life expectancy for the rich over the poor: better schooling, better health care etc. do not seem to apply here.

The Nando Times remarks:

Examples of long-lived Oscar-winners abound. John
Gielgud who won for "Arthur," was 96 when he died
last year. George Burns ("The Sunshine Boys") lived
to 100. Leading lady Greer Garson ("Mrs. Miniver")
reached 92. So did Helen Hayes ("The Sin of Madelon
Claudet," "Airport"). Katharine Hepburn - with a record
four Academy Awards - turned 94 on Saturday.

DISCUSSION QUESTIONS:

(1) What reasons can you think of for this increased longevity?

(2) In discussing the limitations of their study the authors say that they should have had more biographical information about those in the study. What would they looked for if they did?


We usually do not go into the technical details of a study that we discuss in Chance News but decided that it might be fun to try this. We chose the study "Survival in academy award-winning actors and actresses," Annals of Internal Medicine, Vol. 134, No. 10, by Donald A Redelmeir and Sheldon M. Singh discussed in Chance News 10.05. Redelmeier was also the lead author of the interesting study on the danger of using a cell phone while driving, discussed in Chance News 6.03 and 6.10 and in an article "Using a car phone like driving drunk?" in Chance Magazine, Spring 1997.

Recall that for their Oscar study the authors identified all the actors and actresses who had been nominated for an award for leading or supporting role since the Oscar awards were started 72 years ago. For each of them, they identified another cast member of the same gender in the same film and born in the same era. This provided a group of 887 actors to be used as a control group for their study of Oscar winners. From those nominated there were 235 Oscar winners. The authors wanted to determine if Oscar winners tend to live longer than comparable actor who were not winners. Thus the key question is how do you decide if there is a significant difference in the life expectancy of members of two different groups.

similar problem arises in a medical trial in which one group of patients is given a new treatment and a second group is given a placebo or a standard treatment and the researchers are interested in the expected time until a particular "end event" occurs such as death, the disappearance of a tumor, the occurrence of a heart attack etc. The test that is generally used for such studies, and was used in the Oscar study, is called the "Kaplan-Meier survival test". This test was developed by Kaplan and Meier in 1958 ("Nonparametric Estimation from Incomplete Observations," Journal of the American Statistical Association, Vol. 53, No 282, 457-481). The importance of this test is suggested by the fact that Science Index shows that over 22,000 papers have cited the 1958 Kaplan and Meier paper since 1974. We are willing to bet that no reader can, without help from the Internet or a friend, guess a scientific paper that has a larger number of citations.

A good description of how the Kaplan-Meier test is carried out can be found in Chapter 12 of the British Medical Journal's on-line statistic book "Statistics at Square One."

The Kaplan-Meier test requires that we construct a life table for Oscar winners and the control group. We start by reminding our readers how Life Tables are constructed for the US population. The most recent US Life Tables can be found in CDC's href="http://www.cdc.gov/nchs/products/pubs/pubd/nvsr/48/lifetables98.htm">National Vital Statistics Report Volume 58, Number 18 and are based on 1998 data. Life tables are given by sex and gender and also for the total US population. The following table is from the first 10 rows of the Life Table for the 1998 US population.

Table 1. From the 1998 US Population Life Table.

The first column indicates the first 10 age intervals.

The second column gives the proportion q(x) dying in each age interval, determined, as the number who died in this interval in 1998 divided by the US 1998 population at the midpoint of the age interval.

The third column starts with a cohort of 100,000 at birth and gives the number expected still to be alive at the beginning of each age interval. To compute these numbers we start with l(1) =100000 and then use the recursion relation  l(x+1) = l(x)(1- q(x)) for x > 1. (Note that 1-q(x) is the proportion of those alive at time x that survive at least one more year.) For example, l(2) = 100000(1- .00721) = 99279, l(3) = 99279(1-.00055) = 99,225 etc. The quantity l(x) can be interperted as the probability that a newborn child will live to year x. For any year t greater than or equal to x, the quantity l(t)/l(x) can be interpreted as the probability that a person who has lived to year x wil lives to year t.

>To determine the life expectancy of a newborn baby we need only sum l(x) for all x. (Recall that for a discrete random variable X the expected value of X can be computed by the sum over all x of Prob(X >= x)). To find the life expectancy for a person who has reached age x we add the values of l(x)/l(t) for t greater than or equal to x. From the table we see that the life expectancy for a person at birth is 76.7 while for a person who has survived 9 years it is 69.4 making a total life expectancy of 79.4. Thus there is a 2.7 year bonus for having survived 9 years. You can view the entire Life Table here and check your own bonus. We have 10 year bonus for surviving so long but, alas, only a 10.7 year additional expected lifetime.

>A survival curve is a plot of the probability l(x)/100000 of living x or more years as a function of age. Using the full life table we obtain the following survival curve for the US population.

Figure 1 Survival curve for US population 1998

For a discussion of the technical problems in producing and interpreting Life Tables see the article . A method for constructing complete annual U.S. life

tables"  by R. N. Anderson.>

We return now to the Oscar study. In studies like this and others where the Kaplan-Meier test are used, we have incomplete information about the length of time to the end effect-- death in the case of the Oscar study. Some of the Oscar winners will have died by the time the study is completed but others will not have died by this time. Still others may have been lost by the researchers after being followed for some time. For those who have not died we know only that they lived to a certain age and at this point we say that they have "left the study." The Kaplan-Meier procedure allows us to use this information along with the information, about those who have died, in making life tables.

Dr. Relemeier provided us with the >data needed to carry out the Kaplan-Meier test. In Table 1 we show the first 10 entries in his data set for the members of the control group:

0 means died in year x
1 means lost to the study in year x

Table 2: How the data is presented.

The first column gives the number of years that it is known the actor survived. For the first actor this was 56 years and the 0 in column 2 indicates that this actor died in his 56th year. The second actor survived 78 years and the 1 in column 2 indicates that this actor was lost to the study in his 79th year.

Recall that in the actual study there were 235 Oscar winners and 887 controls. To discuss how the Kaplan-Meier test works, using a manageable number of subjects, we chose a random sample of 30 Oscar winners and a random sample of 100 controls from the author's data set.>

We first want to determine a survival curve for each of the two sample groups. Recall that the key information for determining a survival curve for the US population was the estimate q(x) of the probability that a person who has lived to the beginning of year x survives this year. We use the same quantity to construct the survival curve when we have incomplete information. This is done by constructing Table 2. In column 2 of Table 2 we have listed, in increasing order, the years that the 30 Oscar winners died or were lost to the study.

In column 3 we put a 0 if the actor died and a 1 if the actor was lost to the study. Note that in some years there was more than one actor. For example there were three actors who either died or were lost to the study in their 54th year -- one died and two were lost to the study.

In column 4 we put the number n(x) of Oscar winners known to be alive at the beginning of year x.

In column 5 we put the number of Oscar winners d(x) who died in year x. Then (n(x)-d(x))/n(x) is the proportion q(x) of those alive at the beginning of year x who survived this year. These values appear in column 6.

Finally we use q(x) to estimate for the probability that an Oscar winner will live at least to the beginning of year x. As in the case of the tradition life table, l(0) = 1 and, for values greater than 0, l(x) is calculated by the recursion equation l(x) = l(x-1)*(1-q(x)). The values of l(x) appear in the last column.

Case The year x that an Oscar winner died or left the study A 0 means the winner died in year x and a 1 means the winner was lost to the study in year x The number n(x) of Oscar winners known to be alive at the beginning of year x The number d(x) of Oscar winners who died in year x The proportion q(x) of those life to the xth year who survived this year The proportion l(x) of Oscar winners who survived at least x years
0           1
1 27 1 30 0 1 1.000
2 37 1 29 0 1 1.000
3 38 1 28 0 1 1.000
4 45 1 27 0 1 1.000
5 50 1 26 0 1 1.000
6 51 1 25 0 1 1.000
7 54 1 24 1 23/24 0.958
8 54 1        
9 54 0        
10 57 1 21 1 20/21 0.913
11 57 0        
12 61 0 19 2 17/19 0.817
13 61 0        
14 63 1 17 0 1 0.817
15 65 0 16 1 15/16 0.766
16 69 1 15 0 1 0.766
17 73 1 14 0 1 0.766
18 73 1        
19 74 1 12 0   0.766
20 75 1 11 0   0.766
21 77 0 10 1 8/10 0.612
22 77 0       0.612
23 78 0 8 1 7/8 0.536
24 81 0 7 1 6/7 0.459
25 82 1 6 0   0.459
26 84 0 5 1 4/5 0.367
27 85 0 4 1 3/4 0.276
28 92 1 3     0.276
29 93 0 2 1 1/2 0.138
30 96 1 1 1 1 0.138

Table 3 Determining the survival probabilities.

Plotting l(x) we obtain the following survival curve for the Oscar winners.

Figure 2 The Survival curve for the sample of Oscar winners

Making a similar table for the sample control group and plotting the survival curve we obtain the survival curve for this group:

Figure 3 The Survival curve for the sample control group

Putting the two together for comparison we obtain:

Figure 4 Survival curve for the sample of 30 Oscar winners and 100 controls

The upper curve is the survival curve for the winners and the lower curve is for the controls. These curves suggest that the winners do tend to live longer than the controls. We can get another indication of this by computing the expected lifetime for members of each group. As in the US population we simply summing l(t) over all years t. Doing this we find that the expected lifetime for an Oscar winner is 78.7 and for members of the control group it is 74.7 again indicating that the Oscar winners live longer on average than the controls.

We wrote a program for the above calculations. We then used this program to compute the survival curves for the original study with 235 Oscar winners and 887 controls. Here are the results:

Putting the two together for comparison we obtain:

Figure 5 Survival curves for all 235 Oscar Winners and 887 controls

Again the top curve is the survival curve for the Oscar winners and the bottom curve is for the control group. Using all the data we find that the expected lifetime for an Oscar winner is 79.7 and for the controls it is 75.8 which reflect the same difference we saw in >

We must now tackle the question of how the authors concluded that the approximately four years difference found in the study was significant, i.e., could not be accounted for by chance. For this, the authors used a test called the "log-rank test".

We say an event happened in age year x if either at least one subject died or was lost to the study during this year.For each group and each age year we count the number still being followed at the beginning of the year and the number of deaths during this year. For example, in our sample we find that in the age year 61 we were still following 19 Oscar winners 2 of whom died in this year. For the control group we were still following 65 controls one of whom died this year. Thus there was a total of 84 still being followed at the beginning of age year 61 and a total of 3 deaths during this year. Now, under the hypothesis that there is no difference between Oscar winners and the controls, these 3 deaths should be randomly chosen from the 84 people still being followed. Thus we can imagine an urn with 84 balls, 19 marked O for Oscar winner and 65 marked C for controls. Then father death chooses three balls at random from this urn to determine the deaths. Then the number of the number of deaths chosen from the Oscar winners group has a hypergeometric distribution. The probability that any particular death is an Oscar Winner is 19/84 so the expected number of Oscar winners deaths in year 61 is 3*(19/84) =..679. The observed number o was 1. We also need to calculate the variance of the number of deaths among the Oscar winners. This is more complicated because the variance for the multinomial distribution is complicated.

Assume that you have n balls in an urn, k are red and and n-k are black. Then if you draw m balls at random the expected number of red balls is

                                                                           e = m(k/n)
and the variance is

                                                   v = ( m*(k/n)*(n-k)/n)*((n-m)/(n-1)).

In our example the red balls are Oscar winners so the variance for the number of Oscar winners who died in the 61th year is

                                                  v = 5*(19/84)*(65/84)*(81/83)= .854

Then to carry out the rank-order test we do the above calculations for each age year for a particular group. We chose the Oscar winners. Let O be the sum of the observed number of Oscar winners who died in each year, E the sum of the expected values over all years and V the sum of the variances over all years, Then the statistic

                                                    S = ((O-E))/sqrV

will, by the Central Limit theorem, be approximately normal so S^2 will have approximately a chi-square distribution with 1 degree of freedom. Note that in summing the variances we are assuming that the observed numbers of Oscar winners who die in different years are independent. This is true because we are conditioning on knowing the number in each group that we are still watching and the total number of deaths for a given year. These determine the distribution of the number of Oscar winners who die in this year under our assumption that there is no difference between the two groups.

Using our program to carry out these calculations for the data for this study we found S^2 = 9.1246. The probability of finding a chi-squared value greater than this is .0025 so this indicates a significant difference between the Oscar winners group and the control group.

To check our program we carried out the calculations for the Kaplan-Meier procedure, as did the authors, using the SAS statistical program. SAS yielded the same survival curves as our program and for the significance tests SAS reported:

 

Table 4. Significant tests produced by SAS

Thus the log-rank test agreed with our calculation. SAS also provided two other tests that might have been used both of which would result in rejecting the hypothesis that there was no difference between the two groups. This ends our saga.


WEATHER FORCASTING

 

Here is an account by Dan Rockmore of another chance exploration carried out recently by Dan and Laurie. This is a sequel to a previous exploration when they visiting local weather forecaster Mark Breen (See Chance News 8.04).

Whither the weather balloon?

Recently, our quest to understand the process of weather prediction led us to take a trip to visit the National Weather Service (NWS) Forecast Office in Gray, Maine. Our goal was to witness statistics in action, in the form of the launch of a real, live weather balloon.

Some time ago, our friend Mark Breen, the local Vermont Public Radio forecaster had shown us how some of his forecasting depended on the regional forecast generated from this NWS office. In turn, their own forecast used the data generated by weather balloons which are launched every day, at seven o'clock in the morning and seven o'clock in the evening at about 70 sites around the country. We couldn't understand how we had never seen one of these balloons. This was clearly a job for the Chance News action team, so off we went to witness a launch, ask some questions and shed some light on the mystery of the missing balloons.

We arrive in nearby Portland, Maine on the evening before our scheduled morning meeting at NWS, just in time to take in a Portland SeaDogs baseball game and have a great dinner at the Fore Street restaurant (order the Loup de Mer if it is on the menu!). We awake at 5:30 AM the next day and are on the road by 6 heading for the weather station. Weather prediction is clearly a coffee-intensive activity.

We wind our way over the local highways on a drizzly and appropriately gray morning in Gray. Our final destination is a medium-sized reddish brick, official-looking building, which is the regional weather prediction center of National Oceanic and Atmospheric Administration (NOAA). You can see pictures of the office at

http://www.seis.com/~nws/officepix.html.

For a brief history of the office see

http://www.seis.com/~nws/historyPWM.html.

You'll be interested to discover that the National Weather Service traces its origins to President Ulysses S. Grant, who gave responsibility for its creation to the Secretary of War.

We ring the bell and are let into the center. Weather prediction is a 24 by 7 job and some night-shift scientists are around, as well as the scientist in charge of the sacred balloon launch, Art Lester, one the hydro-meteorological technicians. In fact Art has already been there for about an hour or so, since it is his responsibility to prepare the balloon at around 6 AM for its 7 AM launch. Art gives us a tour of the operations room which is basically a computer room where computers are running weather models and weather maps are displayed on every monitor. We ask a bunch of questions about the pictures. The scientists can't help but look and sound like weather forecasters as they respond. Their hands sweep across the monitors as they trace out fronts moving this way and that, lows evolving into highs, and temperature isotherms. Other maps show clouds of precipitation and locate lightning strikes. The weather prediction models are being run in preparation for the construction of the morning regional forecast which is put together by our host John Jensenius, the Warning Coordination Meteorologist, who arrives at about 6:40.

John takes us aside to show us a weather balloon up close, along with the instrument that it carries into the sky, a radiosonde. The balloon itself is a silky yellowish brown sac which will be filled with helium and released into the air and should rise about 17 miles. The accompanying radiosonde carries sensing instruments encased in styrofoam. The instruments are measuring the temperature, humidity, barometric pressure at different altitudes. Wind speed and direction is also inferred by tracking the signal, thereby monitoring the movement of the balloon and radiosonde. The readings are sent back via radio waves - separate frequencies are reserved for the different variables. In fact as a way of checking that the radiosonde is on-line, the measurements are "played" as a set of tones of different frequencies. This gives new meaning to songs like "You are My Sunshine'' and "Thunder and Lightning".

Launch time is approaching. Everything checks out at the office and now it's time to launch! We go out and get into the car for a quick drive up to the launch site -- it looks like a little observatory, really just like a largish tall garage. We go inside and there is the balloon, tied to the table, with radiosonde tied on to it. The garage door is opened and final preparations are made, checking again to see that the radiosonde is secured and transmitting. The balloon is walked outside and with no ceremony at all, released and it speeds into the sky. With the low ceiling on this cloudy day it is quickly out of sight. Its rate of ascent makes it clear to us why we have never seen one floating lazily in the sky. With the low ceiling, in ten seconds it disappears, and we imagine that even on a clear day it is invisible in less than 30 seconds. As it rushes up to the clouds, the highly malleable skin flattens out in the early morning wind, looking more like a large lumpy pillow (UFO?) than a balloon, and soon it is gone, disappearing into the low clouds on this rainy morning. The radiosonde continues to transmit its song of the weather.

The readings will continue to be sent for about 2 hours or so. After this the balloon usually bursts (this was tested on the ground) and begins to descend rapidly to earth, often ending up in the ocean or a tree somewhere, which is probably why we have yet to come across one in our daily wanderings. In general, those launched on the east coast are rarely found, although in other places, like the landlocked Midwest with its wide-open plains, they are found relatively often. The radiosonde has a little notice on it, assuring anyone who finds it, that the equipment is perfectly safe and asking that it be mailed back (for free) to the regional office whose address is on the label. Each launch costs about 200 dollars, including the money spent for the time of the scientists. It is believed that this process will be automated in the near future, and that ultimately (possibly as soon as ten years from now) as satellite imaging gets better, there may very well come a day when the balloons are unnecessary.

We get back into the car to return to the main building for our debriefing. We continue our discussion with John of the role that the forecast plays, and the many, many tools available now for predicting the weather. John will be using the radiosonde data as well as the most recent model output, and satellite data (animated as brief movies made by the looping the pictures taken over time by the satellites) to prepare the day's area forecast discussion which is posted on the web - (see

http://www.seis.com/~nws/mesnhs.html

for the appropriate link as well as other related links). Comparison of the model predictions with incoming weather data helps to generate the forecast. We remember that the regional forecast is used by our own Mark Breen on VPR to help create his local forecast.

John explains that there have been great improvements in the 3,4, and 5-day forecasts, but that beyond that accurate predictions remain elusive. Once again we discuss the age-old bugabear, "What does probability of precipitation mean?" as well as its corollary "If a 20 percent chance of rain is predicted, and it rains, is the forecaster correct?" Here is our e-exchange:

Dan: Given some initial conditions, the model either outputs precipitation, or no precipitation - the algorithm then goes back and checks to see historically what percent of time these initial conditions actually did produce precipitation. Is that right?

John: Yes. The machine-generated probabilities are based on equations developed with forward stepwise multiple linear regression based on a historical sample of past observed data (the predictand) and past model forecasts (the predictors). The regression can select from a considerable number of predictors but usually equations are limited to between 10 and 12 predictors. For the Probability of Precipitation(PoP), the most important predictors are generally the model-forecast relative humidity, model-forecast precipitation occurrence, model- forecast precipitation amount, and model-forecast vertical motion. Some of these are usually transformed into binary and grid-binary formats in addition to being offered in the raw formats. In addition, the model forecasts are space-smoothed to help incorporated some of the uncertainty inherent in the computer-generated forecasts.

Dan: If the weatherman predicts a 20% chance of rain, and it rains, is he right?

John: Only two forecasts can be absolutely wrong. A forecast of a 0% probability of rain when it rains and a forecast of 100% probability when it doesn't rain. All other probabilities must be evaluated in terms of the forecast reliability. To evaluate the reliability, you must gather a sufficient sample of cases of each probability (for example 20%). Then you must determine the observed relative frequency of precipitation for those specific cases. If the observed relative frequency closely matches the forecast probability, then the forecast is reliable. By looking at the observed relative frequencies for each of the forecast probabilities, you can see whether the forecasts are reliable over the entire forecast range of probabilities. One other factor to consider is the ability of a forecast system to discriminate between the rain/no rain cases. To simply always forecast the climatic observed relative frequency of precipitation will give you reliable forecasts given a sufficiently large sample, but will do nothing to discriminate between the rain/no rain cases. Both reliability and discrimination factor into the Brier Score.

Finally it's time to leave. We've tracked down the elusive weather balloon, yet unbelievably the joys of discovery are not yet over. As we make our way back to the interstate, we decide to stop for a proper breakfast at a roadside diner, Stones Grove. The six pick-up trucks parked in front are a good omen. We are not disappointed as we enjoy the best home fries we've ever tasted. The diner is full of regulars. We join in as the gang all sings Happy Birthday to Tom the cook who receives an 8 pound torque wrench, sweater, pineapple upside down cake and gift certificate to L.L. Bean for his birthday. As we climb back in the car and head home, we congratulate each other on a perfect trip. Now what were the chances of that?!

DISCUSSION QUESTION:

Read how the Brier Score is defined in an article by Harold Brooks at:

http://www.nssl.noaa.gov/~brooks/media/okcmed.html

Read how Peter Doyle would measure the validity of a weather forcaster at:

http://math.dartmouth.edu/~doyle/docs/email/forecast.txt

What is the difference in these two approches? Which do you think is better? Compare these two methods as applied to the data given in Harold Brooks article.


Copyright (c) 2004 Laurie Snell
This work is freely redistributable under the terms of the GNU General Public License published by the Free Software Foundation. This work comes with ABSOLUTELY NO WARRANTY.

BEST OF CHANCE NEWS
APRIL 17 2004