CHANCE News 9.02

(January 3, 2000 to February 4, 2000)


Prepared by J. Laurie Snell, Bill Peterson and Charles Grinstead, with help from Fuxing Hou, and Joan Snell.

Please send comments and suggestions for articles to

Back issues of Chance News and other materials for teaching a Chance course are available from the Chance web site:

Chance News is distributed under the GNU General Public License (so-called 'copyleft'). See the end of the newsletter for details.

Chance News is best read using Courier 12pt font.


I think we finally have a poll without a margin of error.

Presidential candidate John McCain


Contents of Chance News 9.02


Note: If you would like to have a CD-ROM of the Chance Lectures available on the Chance web site please send a request to jlsnell@dartmouth.edu with the address where it should be sent. There is no charge.

Norton Starr passed on the following item from RSS News, V. 27, No. 5, Jan. 2000, page 12:
a p-value (the probability over repeated sampling, of observing data as extreme as, or more than, what would be observed if an impossible hypothesis were true)...

The Psychologist May 1999


We have put a new Chance Video on the Chance web site:

To Sample or Not to Sample?
Why is That the Question for Census 2000?

by Stephen Fienberg, Carnegie Mellon University.

This talk is based on the new book:

Who Counts? The Politics of Census-Taking in Contemporary America Margo J. Anderson and Stephen E. Fienberg
Russell Sage Foundation, New York, 1999. 320 pp $32.50
ISBN 0-87154-256-0

We reviewed this book in the previous Chance News (9.01). There is another interesting review in Science (Jan. 14, 2000; 287:239-240) by Thomas R. Belin who worked for the Census Bureau during the 1990 census. Belin remarks that the authors deserve particular credit for their discussion of the effect on the census 2000 of the recent modification that allows respondents to identify themselves as belonging to one or more ethnic cateories. He remarks that these authors appear to be the first to address this issue. There is much more of interest in this review and also in the movie!

Rich Brown suggested "A Curmudgeon Teaches Statistics".

This web site is maintained by John Marden who teaches statistics at the University of Illinois. Marden keeps a daily diary of his thoughts on teaching statistics with emphasis on his Stats 100 course. On the day we looked at his site (January 26) we read:

The web is certainly helpful in making statistics relevant. Yesterday in my class I went to the CNN site, to the article relating male baldness to heart problems. It fit right in with the topic of the day: observational studies. Does the observed association between baldness and heart problems prove causation? Will getting one's head shaved provoke a heart attack? No. (Although Samson may disagree.) The students did a good job of coming up with possible third factors to explain the association: diet, stress, hormones. The last was the suggestion in the article: High levels of testosterone are associated with baldness and high cholesterol.
Clicking on the previous days of his calendar gives you his earlier thoughts on teaching statistics. On January 24 Marden discussed the decision by Harris Poll to conduct all its polls on the web. He notes an interesting article: Pollster sheds old ways, by Lakshmi Chaudhry, Wired News, 24 Jan. 2000, which discusses two different approaches to conducting polls on the internet: that of Harris Interactive and Intersurvey.

Harris samples from a database of 5 million users who have agreed to answer surveys and then adjusts this internet sample to be a more representative sample. Intersurvey chooses smaller random samples from the U.S. population and provides these with the necessary hardware to carry out internet surveys.

On January 22 we find Marden's version of the famous Titanic activity. This activity is discussed in The "Unusual Episode" Data Revisited by Robert J. MacG. Dawson (1995) Journal of Statistics Education vol. 3 no. 3.

Going back to January 3 we find that Marden used Chance News to give his feelings about sharing work using the GNU license or other open source schemes.

Paul Gazin at the Moses Brown School in Providence, Rhode Island wrote to the AP stat discussion list that he gives students extra credit (4 pts. on the next test grade; one extra credit per test) if they can bring in a newspaper or magazine article that uses statistics in a misleading or erroneous fashion and if they can explain the problem coherently. Gazin writes:
Here's a gem a student brought in yesterday from an article on drunk driving in our local paper (Fewer dying on R.I. roads, Providence Journal, Dec 24, 1999, Jonathan Saltzman).
Forty-two percent of all fatalities occurred on Friday, Saturday, and Sunday, apparently because of increased drinking on the weekends.
Of course 42% is remarkably close to 3/7 = 43%.

Gazin says that this remark is followed by:
More than a third of the fatalities occurred in Providence, Warwick and Cranston, the most populous area in the state.
In fact, well over half of the people in Rhode Island live in one of those three cities!
Another reader, John Becherer, wrote a letter to the editor also pointing the fact that 3/7 is pretty close to 42%. But then he goes on to say that, since there are a lot less commuter miles on the weekend, the number of accidents per mile driven is probably greater for the reason suggested. Becherer writes:
I just hate to see poor facts tortured and twisted with no regard for the obvious -- no matter how good the reason. As Twain said, "There are lies, damn lies and statistics."

Alas, we all know that this quote did not originate with Twain and, as John Biddy tells us, it probably did not even originate with Disraeli (See Chance News 2.03).

The examples reminded Bob Hayden of the Dilbert cartoon (17 April,1996):

Secretary: Oh my! This is shocking!
Boss: What?
Secretary: 40% of all sick days taken by our staff are Fridays and Mondays!
Boss: What kind of idiot do they think I am?
Secretary: Not an idiot savant, they can do math.


What do you think the Boss meant by his statement "What kind of idiot do they think I am?"

Poll results show vulnerability in calling primaries accurately.
The New York Times, 3 Feb. 2000, A24
Michael R. Kagay

This article reports that, while the polls predicted the winners in the New Hampshire primary, they did not do so well in predicting the margin of the victory. The article mentions two examples: the USA Today CNN Gallup poll and the CBS News poll. From the Gallup web site we learn that the last USA Today CNN Gallup tracking poll was carried out Jan. 30-31. For this poll a sample of likely Republican voters was asked:

Suppose the Republican primary election for president were being held today. If you had to choose among the following candidates, which candidate would you vote for? [RANDOM ORDER: Gary Bauer, George W. Bush, Steve Forbes, John McCain, Alan Keyes]

(If Unsure) As of today, to which Republican candidate do you lean most?

A sample of likely Democratic voters was asked a similar question. The results of the poll were:

Likely Republican voters             Likely Democratic voters

McCain           44%                 Gore              54%                
Bush             32                  Bradley           42
Forbes           13                  Other/undecided    4
Keyes             7                  N = 697
Bauer             1
Hatch             -
Other/undecided   3
N = 888

The margin of error was given as plus or minus 4 percentage points in each case.

The article reports that the CBS News tracking poll in its final interviewing on Friday, Saturday, and Sunday found McCain with only a 4-point lead and Gore with a 16 point lead.

The final results were:

McCain          49%                  Gore               50%
Bush            30                   Bradley            46
Forbes          13
Keyes            6
Bauer            1

The article observes that the polls consistently underestimate the size of McCain's margin victory (16-19%) and overestimate the size of Gore's margin of victory (4%).


(1) The article attributes the poor results on the difference between the leading candidates to late-deciding voters and the influence of independents. How do you think they know this?

(2) Another newspaper stated that the margin of error for the difference should be 8%. Is this correct? See Chance News 6.06.

No Pat Answer on the PAT.
Sports Illustrated, pg. 36
Dec. 27, 1999

In Football, 6 plus 2 often equals 6.
The New York Times, 9 Jan. 2000, Sec. 4 p. 2
David Leonhardt

These articles report on the work of Harold Sackrowitz, a Rutgers statistician, dealing with the question of whether a football team should go for one or two points after a touchdown. The articles do not provide very many details, but we will try to show now why this problem is not completely trivial. The first strategy to consider is the one that maximizes the expected number of points scored by the team. Since the probability of successfully completing a try for two points is well less than 1/2 (about 37%), one can see that this strategy would dictate that one always tries for one point, since the probability in this case is very close to 1 (about 98%).

However, anyone who follows the game of football knows that the score and the amount of time left in the game should affect this decision. For example, if after scoring the touchdown, the team is behind by five points, and it is late in the game, so that one more score is all that they can expect to make, then it may make sense for them to try for two points. This is because if they make a single point, they are still behind by four points, requiring their next score to be a touchdown, while if they make two points, they can tie the game on their next score with a field goal (assuming the other team has not scored in the meantime).

Of course, there are scenarios that will make the leading team try for two instead of one. For example, if the leading team has just scored a touchdown, and is now ahead by one point, and it is late in the game, they should probably try for two points, since, if they miss, they are not appreciably worse off than before; while if they make the two points, then the other team can only tie, and not win, the game with a late field goal.

In the Sports Illustrated article we read:
Sackrowitz has devised a formula that takes into account the score; the number of possessions left for both teams (six per quarter is the NFL average); the possible result of each of these possessions (touchdown, field goal or no score: safety is ignored because it's probabilistically insignificant); and the probability of each of these results based on the team's previous performance. The formula then reveals whether a team's chances of winning are greater if it goes for one or two. He lays this all out in an easy to read table.

We will be able to read the details of the construction of the table in a forthcoming article in Chance Magazine. But here is Sackrowitz's table showing that this is a subtle problem.

Sackrowitz explains the table as:

The table below gives the (probability based) optimal strategy for two 1998 "league average" teams playing against one another. That is, they, roughly, each score 2.41 touchdowns per game, make 1.475 field goals per game, and convert 39% of their 2 point conversion attempts. It assumes that all 1 point conversion attempts are successful. This was true for 60% of the teams in 1998. A typical NFL game will have approximately 6 possessions in each quarter.

STRATEGY TABLE: Gives conversion (1 or 2 pt) to be attempted. Point difference is immediately after TD but before any conversion attempt. A minus sign means you are behind. Blank indicates it doesn't matter.

You need to be using courier 12pt font to view this table.

 point                total number of possessions remaining 
 difference           in the game for both teams
             0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15

    -15                  2  2  2  2  1  1  1  1  1  1  1  1
    -14                  1  1  1  1  1  1  1  1  1  1  1  1
    -13                  2  2  2  2  2  2  1  2  1  1  1  1
    -12                        2  2  2  2  2  2  2  1  1  1
    -11                  1  1  1  1  1  1  1  1  1  1  1  1
    -10            2  2  2  2  1  1  1  1  1  1  1  1  1  1
     -9                  2  2  2  2  2  2  1  1  1  1  1  1
     -8            2  2  2  2  2  1  1  1  1  1  1  1  1  1
     -7         1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
     -6                  1  1  1  1  1  1  1  1  1  1  1  1
     -5            2  2  2  2  2  2  2  2  2  2  2  2  2  1
     -4            1  1  1  1  1  1  1  1  1  1  1  1  1  1
     -3            1  1  1  1  1  1  1  1  1  1  1  1  1  1
     -2      2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
     -1      1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
      0      1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
      1         2  2  2  2  2  2  2  1  1  1  1  1  1  1  1
      2         1  1  1  2  1  2  1  2  1  2  1  1  1  1  1
      3         1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
      4            1  1  1  1  1  1  1  1  1  1  1  1  1  1
      5         2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
      6         1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
      7         1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
      8         1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
      9               1  1  1  1  1  1  1  1  1  1  1  1  1
     10               1  1  1  1  1  1  1  1  1  1  1  1  1
     11               1  1  1  1  1  1  1  1  1  1  1  1  1
     12               2  2  2  2  2  2  2  2  2  2  2  1  1


(1) Why is the strategy so different, say when you are losing by 6 points as compared to losing by 5 points?

(2) Why is the strategy for when you are 2 points ahead so complicated?

(3) What considerations other than expected value might a coach want to take into account?

Ask Marilyn.
Parade Magazine, 2 January 2000, 12
Marilyn vos Savant

A reader asks:

A friend has signed up to do a tandem jump from a plane. She'll be parachuting strapped to her jump instructor. The way I see it, each time a person jumps from a plane, the odds of crashing due to a parachute failure increase. The instructor has jumped more than 1000 times; this will be my friend's first jump. Therefore, I believe her odds of a problem are increased to his. Others say that his odds are decreased to hers. Still others say it's somewhere in the middle. What are your thoughts on this?

In her response, Marilyn draws a distinction between "random parachute failure" and human error. She asserts that the odds of parachute failure are the same for each jump, rather than increasing as the reader suggests. On the other hand, she feels that the chance of human error should actually decrease with experience, so the instructor is better off on an individual jump than a novice.

Marilyn adds that tandem jumps are more complicated. The instructor would be safer jumping alone, but he is making things safer for the novice by jumping tandem.


(1) What misconception led the reader to think that the instructor's odds of a crash were going up over time?

(2) Do you think Marilyn's response will clarify this issue for the reader? If not, how would you respond?

Chimpanzee remembers five numbers.
Boston Globe, 6 Jan. 2000, A3
Associated Press

Numerical memory span in a chimpanzee
Nature, Jan. 6, 2000, Volume 403 Number 6765 p 39
Kswai, Noburuki, Matsuzawa, Tetsuro

In Chance News 7.10, we reviewed several articles reporting that monkeys possess some numerical skills. Now a new Japanese study, published in the journal Nature, has demonstrated that a chimpanzee can correctly recall the order of a sequence of random numbers.

The chimpanzee, whose name is Ai, had previously learned a sorting task. When five numbers were drawn at random from {0,1,...9} and presented scattered on a computer screen, Ai could correctly sequence them from smallest to largest. In the new experiment, once Ai touched the first number, the remaining four were concealed by white squares on the screen. The task was then to touch the four squares in proper order. This requires memorizing the order of the original numbers. Ai was successful in 65% of the trials with five numbers. With four numbers, she succeeded 90% of the time.

The researchers noted that an average pre-school child can readily do this task with five numbers. Adults can succeed with seven.

You can see Ai in action by going to Ai's Home Page. See if you can tell how many he gets right!


(1) The article states that Ai's results are "far better than chance." What are the chances in each case?

(2) Do you think the task performed here is equivalent to remembering five numbers, as suggested by the headline? If not, can you suggest an experiment that would be?

The placebo prescription.
The New York Times Magazine, 9 Jan. 2000, p 34
Margaret Talbot

This is a fascinating article that should be read in its entirety. We will attempt to summarize it here.

This article reports on many issues dealing with placebos in medicine. Placebos have been shown to have a positive effect both in cases involving surgical procedures and in cases involving prescription drugs. As an example of the former, a surgeon in Houston had 10 patients scheduled to relieve the arthritis pain in their knees. In a double-blind experiment, the patients were told that some of them would be given the standard surgical treatment, which involves scraping and rinsing the knee joint, others would just have their knee joint rinsed, and others would not have any surgical procedure, other than the making of incisions, performed upon them. (The number of procedures performed of these three types were 2, 3, and 5, respectively.) Six months after surgery, all 10 of the patients reported much less pain.

In the case of prescription drugs, the article gives several examples of cases in which the improvement of the patients seems to be largely due to the placebo effect. In one case, a study involving a genetically engineered heart drug found that, while both the patients who received the drug and the ones who received placebos showed improvements in their conditions, the ones who received the placebos showed more improvement than did the others. In another case, both the placebo-takers and the drug-takers of a food-allergy drug had a 75% positive response. This last example not only shows that perhaps this particular drug is not very effective, but it also shows that the placebo effect can be a very significant one.

Results of placebos are not limited to subjective sensations of the patients; physiological changes can result as well. As an example, physicians have successfully removed warts by painting them with an inert dye, telling the patients that after the color wore off, the warts would be gone. In another study, people who had just had their wisdom teeth extracted were helped just as much by fake ultrasound as real ultrasound, as long as both the patient and the therapist thought that the ultrasound was real.

One of the main questions of the article can now be stated: If placebos are so effective, does it make sense, and is it ethical, to include them as part of a physician's arsenal? From a statistical point of view, the answer to this question might very well be yes, but it is less clear from the point of view of a physician. Western medicine is largely founded on logic and objective reality, and placebos do not fit in well with this view of the world.

Perhaps one way to convince physicians that placebos should be accepted as a possible form of treatment is to understand the reasons that placebos work. Of course, this problem has been worked on for a long time, without any clear resolution. The argument that is most often cited is that the mind exerts a great deal of control over the workings of the body. This statement is probably beyond reproach, but it does little to explain how this control works.

Does the placebo need to be administered in a double-blind, or even a single-blind, manner? Here again are some surprising results. In one study, 15 adult `neurotics' were given an inert pill identified as such. Fourteen of the patients reported improvements in their conditions after one week.

The article concludes with a long discussion about the possibility that the real secret of the placebo effect may be the care shown by the doctor for the patient. Several examples, such as the following one, are given to support this thesis. In 1987, a British doctor responded in one of two ways to each of 200 of his patients who came to see him after feeling `under the weather.' He told some of them that he knew what the problem was, and that they would feel better in a few days. He told the others that he couldn't be sure what was wrong with them, and so he wouldn't say when they would feel better. Two weeks later, the percentages of those who were better in the two groups were 64% and 39%, respectively.

Unfortunately, in this day of HMO's, empathetic, caring treatment by physicians is hard to come by. The next step in this debate might be for someone to create a study that tries to compute, in dollars and cents, the value of the positive effects of such treatments. If it could be shown that patients would require fewer visits to a caring physician than to a harried one, the executives at the HMO headquarters might listen.


In the first example above, the patients were told that they were part of the described study. In particular, they knew that there was a positive probability that they would have no surgical procedure, except for the incisions, performed on them. The fact that, until the incisions were actually made, neither the patient nor the doctor knew whether the procedure would be performed, makes this experiment a double-blind one. Suppose that the patients are never even told that they are part of such a study? (Is this called triple-blind?) Might this increase the placebo effect? Would such a study be called unethical?

Paradox in game theory: losing strategy that wins.
The New York Times, 25 Jan. 2000, F5
Sandra Blakeslee

In the last Chance News ( 9.01) we discussed two simple coin tossing games each of which is unfavorable but if, for a sequence of plays, you randomly choose one of these two games to play you have a favorable game. This has been called the Parrando's paradox after its discoverer, Juan Parrando, who teaches physics at the Complutense University in Madrid. The article states:
The paradox is inspired by the mechanical properties of ratchets -- the familiar saw-tooth tools used in automobile jacks and in self-winding watches.

The article we reviewed in the last Chance News was from Nature and written by Derek Abbot and Gregory Harmer. These authors carried out experiments to verify and explain how Parrando's paradox works.

In this article Abbott is quoted as saying:

You see ratchets everywhere in life. Any child knows that when you shake a bag of mixed nuts, the Brazil nuts rise to the top. This is because smaller nuts block downward movement of larger nuts.
The electronic version of this article on the New York Times web site includes another quote from Abbott:
The paradox may shed light on social interactions and voting behaviors. For example, President Clinton, who at first denied having a sexual affair with Monica S. Lewinsky (game A) saw his popularity rise when he admitted that he had lied (game B.) The added scandal created more good for Mr. Clinton.
but this example was not included in the paper edition.

In this article we also read:
Sergie Maslov, a physicist at Brookhaven National Laboratory recently showed that if an investor simultaneously shared capital between two losing stock portfolios, capital would increase rather than decrease. "Its mind-boggling," Dr. Maslov said. "You can turn two minuses into a plus." But so far, he said, it is two early to apply his model to the real stock market because of its complexity.
For other more serious applications of ratchets see:

Speedy sorting.
New Scientist 13 Nov. 1999
Philip Cohen

An application of ratchets to separating fragments of DNA.

Body works.
New Scientist, 13 Dec. 1997
Dean Astumian

This article discusses a program to use ratchets to make a machine that mimics the way muscles move. The article includes a discussion of Feynman's explanation for why you cannot drive a one-way machine by using random thermal movement of gas molecules alone. (See The Feynman Lectures on Physics, Vol 1, p46-1, Addison-Wesley, 1963)

Quantum clockwork.
New Scientist, 22 Jan. 2000

Some new ideas how ratchets might be used in quantum computing.

(1) According to the article:
The paradox is illustrated by two games played with coins weighted on one side so that they will not fall by chance to heads or tails.
J. L. Doob once remarked:
If you believe Newton's laws have anything to do with what happens when you toss a coin, then having one side of a coin heavier than the other should not change the probability of getting heads. (See Chance News 4.02: Games of chance)

Why do you think he said this? Is he correct? Try this experiment and let us know the answer.

(2) Doob also writes:

Perhaps Bradley will win by losing in both Iowa and New Hampshire. Someone should tell him about that paradox.

What do you think about that?

(3) Who is this Doob guy?

25% of children are affected by alcoholism, a study says.
New York Times, 1 Jan. 2000, A22
Associated Press

Data presented in the current issue of The American Journal of Public Health indicate that, while growing up, one out of four US children experience living with a parent or other adult who has a drinking problem. These children are at increased risk for emotional and behavioral problems, as well as for later alcohol problems of their own.

The analysis is based on a 1992 survey of 42,800 adults, described here as the most recent data available. An estimated 10 million children were exposed to family alcohol problems, and over 28 million lived with adults who either were alcoholics or had abused alcohol at some point in their lives. The article reports that the author of the study, epidemiologist Bridget Grant, "concluded that the children's actual exposure fell between those two extremes," and that this reasoning led to the 25% figure.


(1) In what sense are the two figures quoted "extremes?" Where does the 25% figure come from?

(2) It appears that data from one survey was used to compute exposure throughout childhood. How can this be done? What difficulties might you anticipate?

(3) Dr. Enoch Gordis of the National Institutes of Health said the analysis based on the 1992 data was still valid, because alcoholism rates have stayed constant. How would this be known? How do you think "alcohol abuse" is defined? Is its rate also constant?

Go the medical route if herb doesn't relieve depression.
Boston Globe, 10 January 2000, C4
Judy Foreman

Special report. St. John's wort: less than meets the eye. Globe analysis shows popular herbal antidepressant varies widely in content, quality.
Boston Globe, 10 January 2000, S4
Judy Foreman

How the Globe did its testing.
Boston Globe, 10 January 2000, C4
Judy Foreman

In recent years, the herb St. John's wort has been widely marketed as a non-prescription antidepressant. While many people claim to have experienced benefits, the first article here points out that these could be attributable to the placebo effect. More convincing evidence may be available next year, when a study by the National Institute of Mental Health is due to be completed. This study, directed by researchers at Duke University, involves 336 patients across the country. It will compare St. John's wort with the antidepressant drug Zoloft and a placebo.

Various preparations of St. John's wort are sold under different brand names. In 1996, German researchers conducted a meta- analysis of 23 short-term studies involving different brands. They found that for mild to moderate depression, St. John's wort performed as well as prescription medications, and better than placebo. The Duke study will use Kira, a German brand which has been the most studied to date. It will also enforce a consistent protocol. While the smaller studies all appear to indicate positive results, one member of the Duke team pointed out that none of them had a completely satisfactory methodology.

The last two articles address concerns about what consumers actually gets with a purchase of St. John's wort. The FDA does not certify the health claims of herbal products. Also, being a natural produce, the herb is sensitive to factors like temperature, so the potency of the preparations can degrade on the shelf. A Globe investigative team set out to test store-bought samples of six leading brands. At a Cambridge, Massachusetts CVS drugstore, the team purchased Natrol, NatureMade, Nature's Resource, Quanterra, YourLife and the CVS store brand; they also purchased a seventh product, HerbalLife, which is sold only through distributors.

Two labs were chosen to analyze the samples: a chemical testing company called PhytoChem Technologies in Chelmsford, Massachusetts; and an herbal testing company called Paracelsian, Inc. in Ithaca, New York. Both labs received a complete set of seven coded bottles, each bottle containing several pills from one of the brands. Each set also contained an eighth bottle of "dummy" pills supplied by the Massachusetts College of Pharmacy. The coding blinded the analysts to the brand names.

PhytoChem performed a spectrophotometry test to measure the amount of the chemical hypericin in each sample. This chemical is believed to be the "active ingredient" in the herb. The lab also performed a chromatography test, which compares light absorption patterns of the samples to known patterns for St. John's wort. The lab found that only Nature's Resource lived up to its labeling claim of 0.3% hypericin content, which is considered the industry standard. Four others were close to the label claim: Natrol had 0.28% NatureMade had 0.27%, Herbalife and YourLife both had 0.25%. Quanterra, which makes no claim, turned out to contain almost no hypericin!

Paracelsian compared each sample to Prozac and Zoloft, and to the Perika brand of St. John's wort. The first two are prescription drugs which treat depression by inhibiting the absorption of the neurotransmitter seratonin in brain cells. Perika has been shown to have similar effects in rat studies. Each St. John's wort preparation was put in a test tube with rat brain cells, and the cells' ability to absorb seratonin was then measured. The process was repeated with the neurotransmitter dopamine, which is affected to a lesser degree by the drugs. Only two preparations, Quanterra and NatureMade, passed the Paracelsian's "BioFIT" test for their ability to inhibit the absorption of both seratonin and dopamine.

The last article includes a discussion of the strengths and weakness of the investigation. Cited as strengths were choosing two labs, blinding the labs to the brand names and to the presence of a dummy preparation, and buying the herbal preparations at a store as consumers would. Among the weaknesses cited were not testing the preparations on humans, omitting some brands (for example, the Kira brand being used in the Duke study was not tested), and not confirming the results with other laboratories. Finally, some scientists now think that hyperforin, not hypericin, may be the relevant chemical in the herb.


(1) Why is it good to buy the drugs at a store rather than requesting samples from the manufacturer? Can you think of any drawbacks?

(2) The PhytoChem and Paracelsian results seem to favor different products. The articles don't say much about this. What do you make of it?

(3) The second article noted that both PhytoChem and Paracelsian have commercial interests in testing herbal products. Paracelsian hopes its "BioFIT" testing will become a labeling standard. Should we be worried about this?

Tall men finish 1st in the mating game, research suggests.
Boston Globe, 13 Jan. 2000, A3
Rick Callahan (Associated Press)

Based on examination of the medical records of 3200 Polish men from 25 to 60 years old, a team of British and Polish scientists reports that taller men are more likely to marry and tend to have more children than shorter men.

The data were collected in Wroclaw, Poland from 1983 to 1989, and were obtained from military service records. Of the 4400 men represented, the study excluded those identified as abnormally short or tall. The average height of the 3200 men in the remaining sample was 5 feet six inches. The analysis then showed bachelors were 1 inch shorter on average than married men, even after accounting for the fact that men's heights in general have increased over the last several decades. Men without children were 1.2 inches shorter on average than men who had children.

When the sample was broken out by age, it was found that tall men in their twenties, thirties and forties had more children than short men, but the effect was not observed for men in their fifties. Robin Dunbar of the University of Liverpool attributed this to World War II, in which many men who would now be in this age group lost their lives, thus reducing the opportunity for women to select mates by height. Dunbar added that there is a body of research suggesting that women in many cultures prefer taller mates.

The article notes that, while this is the first study to directly link height to reproductive success, other studies have shown that taller men tend to have higher incomes and social standing.


(1) The lead-in sentence of the article says: "If it seemed as if the tall guys got all the girls in high school, it wasn't your imagination..." But does the study really apply to dating in American high schools? To what populations do you think these findings are reasonably generalizable? Is it important that the data are from military records?

(2) Does it surprise you that the study ruled out 1200 of the original 4400 records as outliers? What effect might this have on the interpretation?

(3) Suppose that the height data followed a normal curve (why might this not be true?), and that outliers were excluded symmetrically from each tail. How many standard deviations would the study's threshold for outliers represent?

New test-taking skill: working the system.
Los Angeles Times, 9 January, 2000, A1
Kenneth R. Weiss

The article states that:
The number of students who get extra time to complete the SAT because of a claimed learning disability has soared by more than 50% in recent years, with the bulk of the growth coming from exclusive private schools and public schools in mostly wealthy white suburbs.
This is based on the Times analysis of data provided them by the College Board.

Only 1.9% of the students who take the SAT test get special considerations in taking the test. However, at 20 prominent Northeastern private schools, nearly 10% get special consideration. In contrast to this, among the 1,439 students from 10 inner-city schools who took the SAT test not a single one was given extra time or other special consideration.

That the extra time is helpful is suggested by the fact that, of the students not given extra time, from 20 to 25% do not complete all the questions in the two hour test. The article states that research has shown that the gain for extra time can be as high as 100 points which can make the difference between acceptance and rejection at selective colleges and universities.

Five years ago the College Board set up a panel of psychologists and other experts to review requests for special consideration for students who have had no history of disabilities or testing accommodation in school. Last year the panel refused 82% of the 670 appeals.

The parents have responded by shopping around to find a psychologist who will support a need for special consideration on tests and then going directly to the high school sometimes accompanied by a disability advocate. The school is not in a position to challenge the recommendation of the psychologists and generally accepts the request. The student then does not have to go before the review board.


Some suggest that the problem is not that the wealthy are faking the need for extra time but that the need for extra time for poorer students is not being recognized. What do you think of this suggestion? How could it be tested?

Jeanne Albert suggested and wrote the next review.

The teacher factor.
New York Times, Jan. 9, 2000, "Q and A" Education Life, 4A, p.16
Marilyn Marks

This article describes the research of Dr. William Sanders, a University of Tennessee statistician, on the degree to which teachers impact student achievement. Based on his results, since 1992 the state of Tennessee has required all elementary schools to perform his analysis each year in every classroom, and has required that each teacher's performance be reported to the state. Currently he is assisting districts across the country in developing similar studies.

Dr. Sanders has analyzed six million student records and evaluated the performance of more than 30,000 elementary school teachers. Each teacher is assigned to one of five effectiveness categories by examining the degree to which students improve or worsen after taking the teacher's class.

Most of the article (the "Q & A") is presented as an interview with Dr. Sanders. His response to "What specifically have you found about the importance of teachers" begins, "Teachers are clearly the most important factor affecting student achievement." He goes on to list some of the other factors examined (e.g. class size, location/size of school, per-pupil expenditures, ethnic make-up of school/classroom), but concludes that "teacher effectiveness is 10 to 20 times as significant as the effects of other things. It surprised me, frankly."

Sanders also compares the effects of several years in a row of more effective versus less effective teaching. According to his research, if two students left second grade at the same achievement level, and one had teachers in "the top 20th percentile" for the next three years, while the other had teachers at the bottom, "on average" their (relative) scores on the fifth- grade math test were enormously different: 96th percentile for the first student, 44th percentile for the second.


(1) What do you think Sanders means when he says that "teacher effectiveness is 10 to 20 times as significant as the effects of other things."?

(2) The phrases "student achievement" and "teacher effectiveness" are never defined, but Marks does imply that teacher effectiveness depends on student achievement, which in turn appears to be determined through standardized testing. Discuss how the phenomenon of "teaching to the test" might be relevant here.

In the last two issues of Chance News we discussed the following item suggested for the Forsooth column:

Studies estimate that one in six adolescent girls has an eating disorder, but that adolescent gymnasts are more than twelve times as likely as non-gymnasts to have serious eating problems.
Paragraph 2, page 160, The Rites of Men:
Manhood, Politics, and the Culture of Sport,
Varda Burstyn, Univ. of Toronto Pr., 1999.

Reader David Budescu did not think there was anything wrong with this. We agree and provided an example that showed that it was possible to have the percentages mentioned in the quote. But our example had two thirds of the girls gymnasts which is not very realistic! Budescu commented that to make our example work we would have to have a larger proportion of the girls gymnasts than seemed reasonable. Daniel Swearingen and his colleagues Glen Behrendt and Jim Roche gave an argument to show to match the given statistics at least 1/11 of the girls would have to be gymnasts. Here is a slightly simplified version of their argument:

In what follows, all people referred to are assumed to be teenage females, and the word 'disorder' will refer to a certain type of eating disorder. We are given that gymnasts are 12 times as likely as non-gymnasts to have the disorder. We are also told that 1/6 of the population have the disorder. Let p1 and p2 represent the percentages of gymnasts and non gymnasts who have the disorder. If only a small percentage of the population are gymnasts, then p2 is fairly close to 1/6. But since p1 = 12p2, this would force p1 to be around 2, which is impossible. Thus, the gymnasts must form an appreciable fraction of the population.

The larger the gymnast population is, the smaller p1 has to be to satisfy the fact that 1/6 of the population suffers from the disorder.

Let the fraction of the population which are gymnasts be denoted by g. How small can g be? The above argument shows that this fraction can be made smallest by letting p1 = 1. From this it follows that p2 = 1/12. We now have the equation

(g)(1) + (1-g)(1/12) = 1/6,

which is easy to solve for g. We obtain g = 1/11. Here is a table that achieves this minimum value.

                    d      no d      Total
             g      6        0         6
         Not G      5       55        60
         Total     11       55        66
Swearingen remarks:
This still requires that 1/11 of all adolescent females are gymnasts, a proportion which seems rather high. Moreover, this minimal solution further requires that all adolescent female gymnasts have eating disorders. The requirement that the proportion of gymnasts and the proportion of gymnasts with eating disorders both be not too high are at odds!

At this point I think that Mr. Budescu's query remains unanswered, and I'd be interested in seeing Prof. Burstyn's data that led to the original claims.
As a reference for the infamous quotation, Burstyn gives: Trish Wilson, What price glory?, Sojourner, 22(9),April 1997. The only statistical statement Wilson gives is:
The prevalence of eating disorders among female athletes is reported to be somewhere between 15 and 62 percent. Even at the lower estimate this is much higher than the 1 to 3 percent in the general population.

The 15% and 62% percent appear to have come from three very similar studies on pathogenic weight-control behaviors of female athletes that appeared in The journal "Physician and Sportsmedicine" 14(1) Jan. 86,15(5) May 87, 16(9) Sept. 88. Each of the three articles asks the participants to fill out the same survey which includes a question asking if the girls have used any of the following pathological methods to lose weight: self-induced vomiting, diet pills, fasting, diuretics, fluid restriction, laxatives.

One study gave this survey to 487 girls at a competitive swimming camp and reported that 15.4% used at least one of these methods. A second study gave the survey to 182 college athletes and reported that 32% used at least one of these methods. The third study gave the survey to 42 gymnasts and reported that twenty six (62%) used at least one of these methods.

We could not find where the "one in six adolescent girls has an eating disorder" comes from though it is mentioned occasionally in newspaper articles. One problem is that different researchers use different criteria for eating disorders. Here is a statement of what eating disorders are from "Current concepts: eating disorders", Becker et. al., The New England Journal of Medicine, 340(14) April 8, 1999.

Eating disorders affect an estimated 5 million Americans every year. These illnesses -- anorexia nervosa, bulimia nervosa, binge-eating disorder, and their variants -- are characterized by a serious disturbance in eating, such as restriction of intake or bingeing, as well as distress or excessive concern about body shape or body weight,

The article states that an estimated 3 percent of young women have these disorders, and probably twice that number have clinically important variants.

There have been two recent studies of eating disorders for athletes, one surveyed 522 elite female athletes in the Norwegian Confederation of Sports and the other 1445 varsity student athletes from Division 1 NCAA member institutions. These studies used larger populations and were more careful to specify when an athlete has an eating disorder as compared to just being at risk for such a disorder. For a discussion of these studies see: Athletes and Eating Disorders: The National Collegiate Athletic Association Study, Johnson, Powers and Dick, International Journal of Eating Disorders 26: 179-188, 1999. In a Newsday article about this study (07/28/1997) it is stated that 19.7 of the gymnasts warranted clinical attention and 71.7 percent were at risk.


(1) Why do you think there is so much variability in the three surveys of swimmers, athletes, and gymnasts? What problems do you see with such surveys?

(2) Do you think that the famous quote deserves a forsooth!

In the last chance news (9.01) we discussed an article in the New York Times that referred to a 1984 book, "Normal Accidents: Living With High-Risk Technologies," in which the author, Charles Perrow, argued that disasters such as the near meltdown of the Three Mile Island nuclear reactor and the explosion of the Space Shuttle Challenger should not be thought merely to be the result of "human error". Perrow suggests that the emergence of more and more intricate and interconnected systems causes such accidents, hence they should be thought of as "normal accidents".

A competing view of systems is called the "high-reliability" view. People who ascribe to this view believe that by building many backup systems, safety can be enhanced. The normal accident theorists argue that the existence of complicated backup systems can actually increase the likelihood of an accident. We suggested that the experience of airplane accidents might better be explained by the high-reliability view. The article stated:

Mr. Perrow sees the planet on New Year's Day as a giant laboratory for the influential theory he formulated in the early 1980's about the nature of catastrophic accidents. "This is going to be very exciting, to see just how interdependent we are," he said of the date change.

We provided the following two discussion questions:

(1) Do you agree that airplane experience seems to support the "high-reliability" theory? What do you think Charles Perrow would say about this?

(2) What do you think Charles Perrow learned from the Y2K experience?

Well, Charles Perrow sent us his answers to these questions:

Airplane safety:

First, regarding airline safety, my book argues that airlines are very safe because of these characteristics: repeated trials of risky activities (take off and land 2 to 10 times a day, every day); near optimum conditions for investigating accidents (black boxes, weather info, radio contact); an organizationally rich environment (many interested parties checking on each other's attributions of cause of accident, such as ALPUnion, controllers and their union, engine manufactures, airframe manufactures, FAA, NTSB, widows, etc.); a system used by elites (who don't live next to chemical plants or travel in rusty ships); and decoupling of highways in the sky with a central monitoring station and constant communication (when air traffic control introduced highways and separations accidents dropped precipitously, because it decoupled what had been a tightly coupled system). The "glass" cockpit innovations have helped, of course, but have been offset by allowing planes to fly faster and higher and in worse weather. I think some of these differ from High Reliability Theory's emphasis upon safety goals and training. Air transport is an "error- avoiding" system, and it would be hard to change that, while marine transport is an "error-inducing" system and will probably remain so, for systemic reasons.

And what about Y2K?

The first puzzle is that there were very few failures; the second is that the failures were discrete ones --there were no significant interactive failures; the third is that both well remediated and minimally remediated systems had only a few discrete failures. The lesson from the first two puzzles is that our electronic systems are both more robust - few discrete failures - and less interdependent - few interactive failures, than we thought. Of course, if electric power had failed we would probably have seen a lot of interaction of failures that would have defeated safety systems and challenged our ability to extemporize and innovate. But electric power systems are not heavily digital and were said to be well remediated.

We were alert, of course; most serious accidents occur at night or in shift changes. And experts expect Y2K failures to be stretched out over several months (under conditions of reduced alertness.) But that the failures of the first day were so few and discrete was still a surprise. Since the event (roll over of date-dependent chips to a 00 year date) was unprecedented, we had no way to judge how devious dated chips would be in the multitude of systems and usages. The best explanation is probably remediation. A lot of it was done in the electronically rich countries. That must have helped.

While robustness of system may explain the puzzle of few failures, remediation would help explain the second puzzle of few interactive failures: with few discrete failures, there is less chance for interactive failures, no matter how tightly connected the systems are. Regarding the third puzzle, the similar performance of both electronically sparse and electronically rich nations, I suspect it was remediation that made the big difference. At least some was done in the electronically sparse countries; in the last months of the year, the financial firms especially, and some of the key manufacturing firms, insisted upon remediation in those cases were there was interdependency. Though there was less remediation and testing in the sparse countries, there was also less initial interdependency. Discrete failures can be more easily fixed in systems that are not highly interdependent. Interdependency would be so low that discrete failures remained discrete, and of course electronically sparse nations would have fewer systems at risk of even discrete failures. If you do not rely upon spy satellites to warn you of terrorist attacks upon key infrastructures, such as power sources or communication networks, the failure of spy satellites is not a problem for you. In electronically rich nations, however, had there been even a minor terrorist attack (e.g. on the Seattle space needle by a terrorist coming from Canada at the time the satellites were blinded - as they were for a few hours) the nation could have over-reacted, especially with the unexpected interaction of a hard-liner suddenly taking over Russia. At the height of the cold war a spy plane wandered into Soviet territory just when the British and French started the Suez war, provoking and misleading the Soviets.

There is plenty of evidence of discrete failures in the electronically rich nations, indicating that there obviously was a Y2K problem. But the evidence also shows that the failures were both very few in number, and also discrete and fixable. This suggests one large error people like me made: there were not enough discrete errors to provoke interactive complexity. Perhaps remediation worked in the electronically rich nations and interdependency was too low for errors to have much effect in the sparse nations. But it is also possible that I and others grossly overestimated the degree of interactive complexity itself. Normal Accident Theory assumes that normal accidents will be inevitable, given enough running time for systems, but quite rare. Even more rare is the possibility of serious damage; catastrophes require a variety of conditions to come together in just the right way to kill hundreds with one blow. It is the rare accident where we can not say "we were lucky it wasn't worse." Bhopal, for example, had maximum killing power; Chernobyl minimal, Three Mile Island none. Normal Accident Theory suggested to me that we would need quite a few Y2K-like events to produce major accidents on day one, so I thought we were unlikely to have a major catastrophe, but I still predicted more disturbances than we had. My son e-mailed me that he was looking for a good recipe for crow to send me, perhaps shake an'bake. - Chick Perrow 1/10/00


Chance News
Copyright © 2000 Laurie Snell

This work is freely redistributable under the terms of the GNU General Public License as published by the Free Software Foundation. This work comes with ABSOLUTELY NO WARRANTY.


CHANCE News 9.02

(January 3, 2000 to February 4, 2000)