!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

CHANCE News 6.08

(8 June 1997 to 8 July 1997)


!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Prepared by J. Laurie Snell and Bill Peterson, with help from Fuxing Hou, and Joan Snell, as part of the Chance Course Project supported by the National Science Foundation.

Please send comments and suggestions for articles to jlsnell@dartmouth.edu.

Back issues of Chance News and other materials for teaching a Chance course are available from the Chance web site:

http://www.geom.umn.edu/locate/chance

============================================================

Judge a Statistics book by its exercise, and you cannot go far wrong.
George Cobb
=============================================================

Contents


<<<========<<




>>>>>==============>
Since the issue of sampling in the Census 2000 is constantly in the news, Milt Eisner suggested that we include some references, giving the technical details of the sampling methods that are going to be used in the next census. We have done so, but our write-up is rather long so we have put it toward the end of this newsletter. Preparing for census 2000.
<<<========<<




>>>>>==============>
In the last Chance News (6.07) we commented on an Associated Press article in the Boston Globe describing a study that identified behaviors associated with suicide attempts by adolescents. One of the behaviors mentioned in the article was:

Substance abuse before last sexual activity.

John Finn wrote that this is a typical example of the misuse of language in statistical reporting in the news. He remarked:

I am willing to bet, the questionnaire given to the high school students did not ask them whether they had "abused" alcohol, marijuana, or other drugs before sexual activity, but only whether they had "used" any of them. For the people making the report to then characterize this as "abuse" seems hardly scientific to me, being instead a moral judgment.

We were not able to find the original questionnaire. However, the article appeared in the journal "Pediatrics" and on their web site we found the author's abstract. Here the behavior in question appears as:

Substance use before last sexual activity

rather than

Substance abuse before last sexual activity

so John would surely have won his bet. In his letter, he gives a number of other interesting examples of such "reporting abuses." You will see this letter at the end of this newsletter.
<<<========<<




>>>>>==============>
Maya Bar-Hillel wrote that the June 22, 1996 issue of the "Economist" had an article on gender differences in science. This article mentioned a classic study by Pauli and Bauer related to report of the Swedish study in the last Chance News. In this study, 180 men and 180 women reviewers were asked to review comparable papers and rate them on a scale of 1 to 5 with 1 being high. One-third of the papers were purported to be written by a John T. McKay, one-third by a Joan T. McKay, and one-third by a J. T. McKay.

Papers believed to be written by a woman Joan T. McKay were given an average rating of 3.0 by both men and women reviewers. Papers believed to be written by a man, John T. McKay, received considerably better ratings: a 1.9 average rating from the male reviewers and a 2.3 rating from the women reviewers. This study and the current state of women in academia are discussed in an article by Lynne Billard in the Dec. 1995 issue of the Australian Mathematics Society Gazette.

Maya asked for the reference for the Pauli and Bauer study. It is:

Paludi, M.A. and Bauer, W.D. (1983) Goldberg Revisited: What's in an Author's Name? Sex Roles, v.9 pp.387-390.
<<<========<<




>>>>>==============>
Lucio Bragagnolo wrote:

I found your address while asking a question of www.findout.com. My question was the following:

What's the mathematical formula with which I can find out how many random-chosen people must I gather to be reasonably sure to have at least one person born in every day of the year, from January 1 to December 31? The year of birth doesn't count. And the answer was the following:

As a consumer-information service, we at FINDOUT are not qualified to definitively answer a question such as yours, though we can point you toward sources much more able to do so. In particular, we recommend that you take the time to visit the Chance Database maintained by J. Laurie Snell at Dartmouth University. Dedicated to the subject of probability and the like, the database strives to make available educational and informative materials relating to questions similar your own: questions of "chance." Accordingly, we recommend simply e-mailing Snell your question directly -- the problem could be one she's dealt with before!

Of course we had to maintain the honor of chance!

This is a well-known problem usually referred to as the "coupon problem." In our day we asked how many boxes of Wheaties we had to buy to get a complete set of baseball pictures. (We naively assumed the pictures were randomly distributed in the boxes.) The solution to this problem can be found Feller's "Introduction to Probability Theory and its Applications." In the third edition it is problem 7 of Chapter 1V section 6. Using the birthday interpretation, this solution is as follows:

Assume that you ask r people and A(i) is the event that none of the r people has the ith birthday. Then the probability that you fail to get all possible birthdays is the probability of the event A(1) or A(2),.., or A(365). The familiar inclusion exclusion method gives:

   P(you fail to get all 365 birthdays) =
 
   choose(365,1)(1-1/365)^r - choose(365,2)(1-2/365)^r +
   choose(365,3)(1-3/365)- ... - choose(365,364)(1-364/365)

where choose(n,k) is the number of ways to choose k objects out of n.
Taking 1 minus this probability gives the probability of getting all 365 birthdays when we ask r people. From this we find that, for a favorable bet that we get all 365 birthdays we need to ask 2,287 people. For a 99% chance of success we need to ask 3828 people. Since there are about 4000 Dartmouth students we can sing Happy Birthday at Dartmouth every day of the year and be assured that we are singing it for at least one student.

The expected time to get all the birthdays is easier to compute. When you have i different birthdays the number of people you have to ask to get a new birthday has a geometric distribution with parameter 1/(365-i). From this we can see that the expected time to get them all is:

 
        365*(1 + 1/2 + 1/3 + ... + 1/365) = 2365. 
The current issue of Teaching Statistics (Summer 1997, Vol. 19, No. 2) has an article "Modelling the Coupon Collector's Problem" by Wendy Maull and John Berry. They describe an activity centered around students rolling a die until the first time they get all six numbers.

DISCUSSION QUESTIONS:

(1) Snell bet Williams that every birthday would be represented in the current Dartmouth student body. He got ten to one odds, but Williams insisted that leap year be included. Was this a favorable bet?

(2) What is the expected time required to get about half the 365 possible birthdays? Having these, how much longer will it take you to get the second half?
<<<========<<




>>>>>==============>
Paul Bugl mentioned that the Illinois lotto game on June 18, 1983 resulted in 78 people splitting the jackpot. The winning combination was 7-13-14-21-28-35. Paul wondered if there had been a lottery jackpot shared by more than 78 winners. The best we could find was the 133 jackpot winners who shared a 16,293,830 pound jackpot in the UK National Lottery on January 14, 1995. The winning numbers were 7, 17, 23, 38, 42. These two events again show the public's love of the number 7 and its multiples.

DISCUSSION QUESTIONS:

(1) In the UK lottery, 69.8 million tickets were sold. If buyers chose their numbers at random, what would have been the probability of getting 133 or more winners? (For this lottery you choose 6 numbers between 1 and 49 and, to win the jackpot, you must get all six correct not counting order.)

(2) In the Illinois lotto game in 1983 you had to get six numbers correct, chosen from 1 to 40 not counting order. While we do not know the number of tickets sold on June 18, 1993, the total winnings for lotto in 1983 were $26,877,512 and it was played 46 times during this year. Which of the two events described above do you think would be less likely to occur if buyers chose their numbers at random?

(3) In the Illinois Lotto game, the jackpot was advertised to be $1.75 million. In fact, the 78 winners were paid a total of only $744,000, with each winner getting a cash payment of about $9,500. Why did they not get the entire $1.75 million as advertised? Two of the winners sued. Do you think they won?
<<<========<<




>>>>>==============>
Statistician builds what may be a better data mousetrap.
The New York Times, 8 July 1997, C8
Karen Freeman

This article discusses a new statistical method developed by Gary King in the government department of Harvard. The new technique is described as a way to estimate the behavior of individual members of groups when only aggregate data is available.

Students have long been warned that ecological correlations based on rates or averages tend to overstate the strength of an association. The article suggests that this warning was also responsible for researchers not developing, until now, methods that do work to make inferences about relations between individual members, given only aggregate data.

We are not told what King's method is. However, we are told that it is being widely accepted by political scientists and is beginning to be used in Federal voting rights cases. Here is a situation where it could be used. The census estimates the population of a state and this in turn gives the state the right to a certain number of districts with roughly equal populations. The state can determine the boundaries of the districts. The issue might be: has the state done this to deprive a particular group of their right to representation in Congress?

For example, if 30% of the population is black and blacks are known to favor Democratic candidates, it might be claimed that the district lines were drawn so that the blacks could not determine the vote in any district. To show this, you would have to know something about the voting behavior of the blacks in each district. From previous elections, you know how the Republican and Democratic vote split by district and, from the census, you know the total black and white population by district. From this, King's method would allow you to estimate the black vote within districts.

King recently testified as an expert witness in in a voting rights case in a Federal Distric Court and was able to show, by his method, that the the vote split by racial division varied significantly over a series of elections.

You can learn about details of King's method from reprints available at his web site.
<<<========<<




>>>>>==============>
Sometimes mother nature knows best.
The New York Times, 20 March, 1997, A25
Susan Love

Letter to the Editor.
The New York Times, 27 March, 1997, A28
Meir Stampfer

The estrogen question: how wrong is Dr. Susan Love?
The New Yorker, 9 June 1997
Malcolm Gladwell

The Mail. The New Yorker, 14 July 1997
Letter from Susan Love with reply by Malcolm Gladwell

Susan Love is a well-known breast surgeon and associate professor of clinical surgery at U.C.L.A. She is a leading figure in the battle against breast cancer and has written two enormously successful books: Dr. Susan Love's Breast Book, published in 1990, and this year's Dr. Susan Love's Hormone Book.

In her March 20 article in the New York Times, Love attacks the pharmaceutical companies for trying to make menopause into a disease and to encourage women to take replacement hormones for life. She feels that short time use of these hormones may be justified for women who have had hysterectomies or for symptoms of hot flashes and insomnia as they approach menopause. However, she applies her "nature knows best" argument to argue against healthy women using replacement hormones to prevent diseases.

She asserts that studies, showing that the use of replacement hormones prevents osteoporosis, are confusing and she is not convinced that the evidence for protection against heart disease is as convincing as is the evidence that prolonged use of estrogen causes breast cancer.

Love writes:

Pharmaceutical companies defend their products by pointing out that one in three women dies of heart disease, while one in eight women gets breast cancer. Although this is true, it is important to note that in women younger than age 75 there are actually three times as many deaths from breast cancer as there are from heart disease.

In his New Yorker article Gladwell says that this statistic is central to Love's argument but claims that she has her numbers backward. He writes:

In women younger than seventy-five, there are actually more than three times as many deaths from heart disease as from breast cancer. Even the general idea behind this argument-that heart disease is more of a problem for older women and breast cancer is more of a problem for younger women-is wrong. In every menopausal and postmenopausal age category, more women die of heart attacks than die of breast cancer.

Gladwell then goes on to give specific statistics to back up these statements. Love's apparent mistake was pointed out in a letter to the times by Harvard epidemiologist Meir Stampfer who thought she had just mixed up the two categories in reading the government's mortality rates. However, Gladwell claims that, in a meeting with Love after this letter appeared, she defended her figures and quotes her as saying:

Most women at fifty know someone who has died of breast cancer. Most women at fifty don't know someone who has had heart disease. That's because under seventy-five there are three times as many deaths from breast cancer as from heart disease.

It does seem to us that women under 50 tend to know other women who died of breast cancer but rather exceptionally know women who died of heart disease. Therefore we decided to check to see if Gladwell was really correct with his statistics. The required data can be found from the National Center for Health Statistics which can be accessed from the CDC web page (www.cdc.gov) Official mortality rates are given in deaths per 100,000 in terms of very specific diseases. The decision of which diseases to consider as heart disease is, perhaps, rather subjective, but we followed the choice of the NCHS in a related study. (Strokes are not considered heart disease)

 
    age           breast cancer          heart disease

   25-35              2.7                     6
   35-44             15.2                    17.1
   45-54             41.6                    57.2
   54-64             69.8                   195.7
   65-74            105.6                   566.2
   75-84            145.9                  1741.3
   85-100           195.5                  6252.6
The results agree with those given by Gladwell. However, the story does not end here. Love replied in a letter to the New Yorker (14 July) that she got her numbers from a Nurses' Health Study for which Stampfer (who wrote the letter to the editors saying she was wrong) was one of the investigators. She writes:

According to the data published by Stanpfer and his colleagues, (in the Nurses' Study article) there are more deaths from breast cancer than from heart disease over the eighteen years of follow-up. The exact figures vary depending on whether you are looking at smokers (most premature heart disease is in smokers) or non- smokers. In the healthy-fifty-year-old nonsmoker category, there are three times as many deaths from breast cancer as from heart disease. I propose that this is the population closest to the perimenopausal woman contemplating taking estrogen.

Malcom Gladwell remarks that women in the Nurses' Health Study have a far lower incidence of heart disease than American women as a whole and remarks:

What Love is doing is a bit like arguing that we can find out whether American students know enough calculus by testing the freshman class at M.I.T. and then touring the country pretending that your results---that no one needs to take more math---speak for all students.

Love uses the difference between the nurses and the general population to support her argument that women are better off changing their life style than using estrogen to protect against heart disease. She writes:

The fact that breast-deaths are about the same but the heart-disease deaths are lower than those of the general population demonstrates that heart disease can be prevented by a life-style approach while breast cancer cannot.

DISCUSSION QUESTIONS:

(1) Whom do you find more convincing, Love or Gladwell? Why?

(2) Why do you think Gladwell thinks that no one needs to take more math than calculus?

(3) We purchased Dr. Susan Love's Hormone Book and found the following statistics: On page xvi, we are told that only 30% of women who are given prescriptions for hormone replacement therapy ever fill these, and for those who do, more than half stop taking the pills within a year. Later, on page 36, it is reported that between 1/6 and 1/4 of all post-menopausal women take Premarin, the particular hormone replacement drug so delicately described in Love's article as horse urine extract. What do you make of this?
<<<========<<




>>>>>==============>
Hormone use helps women, a study finds.
The New York Times, 19 June, 1997
Jane E. Brody

This article discusses a study reported in the 19 June 1997 issue of the New England Journal of Medicine aimed at determining the benefits and risk of hormone replacement therapy. The study was based on the well-known Nurses' Health Study. This study was started with 121,700 women participating in 1976 and continued until 1992. These women were questioned and examined every two years.

The study was a "case control" study. It included those women in the Nurses' Health Study who were past menopause with no history of cardiovascular disease or cancer. The 3637 who died during the 1976-94 were the "cases". For each case, the researchers selected 10 "controls" from women who met the above criteria and matched the case in age, age at menopause, and type of menopause.

After adjustment for confounding factors the study found that, in the first decade of hormone use the chance of dying was 37% lower for those on hormone therapy primarily because of fewer deaths from heart disease among this group. This dropped to 20% after 10 or more years because of the increased risk of breast cancer. The risk of breast cancer mortality for those on estrogen therapy for more than 10 years increased by 43 percent.

DISCUSSION QUESTION:

Susan Love argues that nurses who take hormone replacement are more health conscious generally and thus also do more to prevent other diseases such as heart disease. Do you think this is a valid argument?
<<<========<<




>>>>>==============>
An Electronic Companion to Statistics.
by George Cobb and Jeffrey A Witner, Jonathan D. Cruyer
with the assistance of Peter L. Renz and Kristopher Jennings
Cogito Learning Media, New York, 1997
1-800-WE-THINK
think@cogitomedia.com
Suggested retail price: $29.95

The last two years have seen the development of some wonderful resources to supplement standard texts for teaching a basic statistics course course including: activities from Richard Shaeffer and Allan Rossman, case studies from Samprit Chatterjee, a multi-media CD-ROM from Paul Velleman and now an electronic companion from George Cobb et al. These are all unique contributions---very different from each other---and produced by statisticians who have thought long and hard about what statistics is all about and how it should be taught.

Cobb's book and CD-ROM is meant to accompany any of the standard statistics texts to give students a chance to check out how they are doing and whether they really understand what they have learned. In order of topics, they follow books such as those of David Moore and others who "follow the modern distinction between exploration-and-description and inference". The 13 units, reviewing standard topics, can be easily changed to fit other models.

On the CD-ROM each unit is introduced with a short video from "Against All Odds" providing a real world example of the topic. Then there is a brief discussion of each topic and an opportunity for students to self-test their knowledge of this topic. The self- testing is livened up by "drag and drop" answers to fill-in-the- blanks and true-false questions. Students who are stuck can click on an important term to be reminded what it means or click on a "hint" button to get a suggestion how to start to think about the question. The exercises ask questions about actual studies and reports of studies in the media. They are beautifully designed to be sure the student really understands the topic. George and his colleagues have made sure that this project passes the "Cobb test" (see our quote for this issue).

One might think that the accompanying workbook, which treats the same topics using only the written word, would pale by comparison with the CD-ROM. George is not called the "intellectual" of the statistics reform group for nothing. By the use of words alone George reminds the students what the topics mean, how they relate, and gives students new ways to think about difficult topics. For example when reviewing the concept of probability distribution, he tells the student: It helps to try to become comfortable with four different variations on the one basic idea:

Here is another succinct comment that will be more than a review to most students:

Correlations summarize balloons. If your plot isn't balloon shaped, don't use a correlation.

The relations between the different statistical concepts are illustrated both in the workbook and on the CD-ROM by diagrams called concept maps. Thinking in terms of concept maps led George to depart slightly from tradition and put the topic of time series between describing distributions and describing relations. This is natural since time series are just a relation between two variables with one variable time.

Students who work their way through this review have learned that, by putting it all together, they have mastered a powerful tool for understanding important real life problems.

Well, once again we have to make a disclaimer. Like Paul Vellemen, George Cobb was a Dartmouth undergraduate and a student in our probability course some years ago. We give him an A+ on his latest project.

DISCUSSION QUESTIONS:

(1) In testing the topic "Producing Data" on the CD-ROM we were given a random selection of 15 questions from a set of n. We repeated the test and got nine of the same questions. How should we estimate n from this?

(2) Why do you think George says that we should not use correlation on data that is not balloon shaped?
<<<========<<




>>>>>==============>
Parental origin of chromosome may determine social graces.
The New York Times, 12 June 1997, A18
Natalie Angier

A study published in the journal "Nature" reports preliminary evidence that genetic factors found on the X chromosome may help explain the generally greater social skills of girls as compared to boys. The study focused on children with Turner's syndrome, a rare hereditary condition in which a girl inherits only one X chromosome. (Girls typically get one X chromosome from each parent; Turner's occurs in about 1 girl in 3000 born each year.) Girls who inherited their single X chromosome from their fathers were found to exhibit greater social skills than those who inherited the chromosome from their mothers. For example, they had an easier time making friends, had more awareness of others' feelings, and had better relationships with teachers and families.

There were 78 girls in the study, 55 of whom carried a maternal X chromosome and 23 a paternal X. Parents were asked such questions as whether the child could be characterized as "very demanding of people's time", "unaware of acceptable social behavior" or "difficult to reason with when upset." The researchers also administered cognitive and behavioral tests, comparing the results with normal males and females.

Attempting to explain the differences found, the researchers speculate that there is a gene or gene cluster on the X chromosome which is chemically "imprinted" to behave differently depending on whether it comes from the father or mother. They conjecture that this gene is associated with the development of the portion of the brain responsible for social intelligence, and its function is somehow turned off when the gene is inherited from the mother. Dr. David Skuse, a member of the research team, argues that these findings can be generalized beyond Turner's patients to the population at large. Because boys always inherit a maternal X, this could explain why they suffer behavioral disorders at a higher rate than girls.

Other scientists quoted were more cautious in interpreting the results. Dr. David Page of the Whitehead Institute for Biomedical Research found the behavioral differences between the two groups of Turner's girls "impressive" but called it "quite an intellectual leap to talk about differences between all females and all males."

DISCUSSION QUESTIONS:

(1) The article reports that, among the school-aged girls in the study, 40% of those with maternal X chromosome had received statements from teachers indicating trouble at school, compared with 16% for those with a paternal X chromosome. Dr. Skuse is quoted as saying: "We've gotten to the point where we can make a pretty good guess of where a girl's chromosome came from just by knowing about her social behavior." Do you agree? Try to estimate the chance that a Turner girl who gets such a report carries a maternal X. What else do you need to know?

(2) The article states that girls with Turner's syndrome "are usually of normal intelligence but suffer from a variety of problems including infertility and a need to take hormones at adolescence. They are often short and sometimes have an extra thickness of skin around the neck, lending them a football player's aspect." What effect on social development would you expect such physical manifestations to have? How does this affect the researchers' conclusions?
<<<========<<




>>>>>==============>
DNA in fingerprints used as identifier.
Boston Globe, 19 June 1997, pA5.
Richard Saltus

Previous issues of CHANCE News have discussed forensic use DNA "fingerprinting," but the term was always a figurative reference to traditional fingerprinting. The DNA identified at a crime scene was usually recovered from blood, hair or semen. The present article notes that scientists have now successfully analyzed tiny amounts of DNA found in a human fingerprint. The advance was made possible by the discovery of smaller distinctive units of DNA called tandem repeats, consisting of only a few letters of the nucleotide sequence. The polymerase chain reaction can be used to make millions of copies of a small amount of DNA in a sample.

Henry Lee of the Connecticut State Police Crime Lab reports that he has used the method to identify a suspect from material left on a toothpick at a crime scene. But there is a potential problem: in fingerprint tests researchers have found that the strongest DNA profile does not always belong to the last person to handle the object from which the prints are taken.
<<<========<<




>>>>>==============>
Dealt a bad hand, black men can beat the odds.
Boston Globe, 16 June 1997, pC1.
Judy Foreman

Black men in America are five times as likely as white men to die as a result of high blood pressure. The effects of high blood pressure show up as more heart failure, stroke and kidney disease. Dr. Clarence Grim of the Medical College of Wisconsin theorizes that this may be the result of an "unnatural selection" process. On slave ships, the ancestors of today's black Americans were deprived of salt and water. The survivors may have been those genetically most able to conserve salt in their bodies. But this trait is a distinct disadvantage in modern America, where the standard diet is overloaded with fat and salt, both contributors to high blood pressure.

The article notes that there are steps black men can take to beat the odds, including better diet and exercise, but points out that the fitness movement is a middle class phenomenon. Developing a relationship with a regular health care provider is also cited as beneficial. A recent survey by "Men's Health" magazine found that about 1/3 of all men, black and white, had not seen a doctor in the past year. However, "oversampling" black men turned up important differences in attitude: blacks were much more likely than whites to avoid doctors due to cost and lack of trust.

DISCUSSION QUESTIONS:

(1) "Oversampling" sounds like a recipe for biasing the survey. What do you think this means? How can it improve the results?

(2) The article notes that violence and HIV are key factors depriving blacks of a chance for a long life. Homicide is the leading cause of death for black men 15-24 years old, and the fourth leading cause for black men of all ages, whereas it is not even in the top ten causes of death for white men. HIV infection is the third leading killer of black men; for white men, it's seventh. "But perhaps more telling is this: A male black born in 1995 had a life expectancy of 65.2 years. A white male born in that same year could expect to live 73.4 years." Which description do you find more telling? Explain.

(3) Can you tell whether HIV or homicide is the more important contributor to the decreased life expectancy just cited? What would you need to know to express each of these factors in terms of years lost?
<<<========<<




>>>>>==============>
Good cancer news, on voltage and kids.
Boston Globe, 3 July 1997, pA3
Larry Tye

In 1979, two Denver researchers suggested that there is a link between childhood leukemia and exposure to high-voltage power lines. Numerous studies have been done since, with some finding evidence for a weak link. But a new study by the National Cancer Institute, which appears in the "New England Journal of Medicine" reports that no such link exits.

About 1600 cases of childhood leukemia are diagnosed each year. The present study examined 1258 children, half of whom had leukemia. No increased risk was found, even in children living near the highest voltage lines. An accompanying editorial in NEJM labels the notion of the alleged link with power lines as being "fanciful," and urges that researchers now direct resources to finding the true causes of childhood leukemia.

DISCUSSION QUESTION:

The article notes the present study is considered highly authoritative. Can you tell anything about the design that would make it so? What do you think was going on in the earlier studies?
<<<========<<




>>>>>==============>
Preparing for the 2000 Census.
Interim Report II
National Academy Press
Andrew A. White and Keith P. Rust, Editors

Avialable free of charge while they last. Contact Agnes Gaskin (agaskin@nas.edu) by email to request a copy. For an official summary of what is in the report see

NAP and Preparing for the 2000 Census: Interim Report II

The Census has been in the news recently because of attempts by certain members of congress to block the use of sampling. It has also been in the news because of the issue of whether there should be a multi-racial category. The census provides an example of difficult statistical problems tied up with political and legal ramifications. Since it is the basis for federal funding and political representation this is bound to be the case. This report provides a nice review of the present plans of the Census Bureau to carry out the year 2000 census.

Readers of "Chance Magazine" will already know a lot about how the census is carried out and the resulting political and legal issues from the series of 10 articles that Steven E. Fienberg wrote for this magazine. References to 9 of these articles can be found in the bibliography of his last article, written with Margo Anderson: "An adjusted Census in 1990: The Supreme Court Decides", Chance Vol. 9, No. 3. (The 10th article is in Vol. 4, No. 4)

This report will give you a real sense of the enormous preparation and research that is necessary to carry out a successful census.

To even start the process, the Census Bureau needs to establish maps that include all housing units in the country as well as lists of these housing units linked to the maps. The Bureau is developing methods to continuously update these maps and lists. They have experimented with using a variety of sources of information such as: social security records, income tax returns, postal service records etc. Each of these sources has its own notation and ways of storing the information. Also privacy issues enter into sharing information. After looking at the problems involved using several different sources the Census Bureau has decided at this time to use primarily the post office information to update their basic data bases.

The first step in the census process is to mail out the short and long forms. About 1 in six get a long form. Obviously, a high return rate is essential. The Bureau has found that the number of questions, how they are stated, and how the form itself looks play an important role in the return rate. The Bureau carried out small-scale tests and focus groups in 1992, and new procedures were implemented in a significant 1995 census test. These improvements are expected to increase the response rate.

In the 1990 census, the only way to respond was by mail. For the 2000 census, the Bureau is planning new ways to respond. Those who do not respond in two weeks will be provided a replacement form. Special "Be Counted" forms will be available in public locations for people who did not receive them. Finally, it will be possible to reply by telephone using a toll free number and possibly even by the internet.

In stage I field workers will follow up non-respondents until they obtain at least a 90% response in each census tract. A tract has an average of 4,000 people and 1,500 housing units. In stage II a 1 percent random sample will be taken of those who did not respond. Unlike stage I, field workers will follow up those sampled in stage II until everyone in the sample is accounted for. The characteristics and counts from the 1 percent sample will be used to estimate the characteristics and counts in the stage I non-response group. Variations on this design are also being considered.

Finally, the Bureau plans to tackle the undercount problem. In the 1990 census it was estimated that about 1.6 percent of the population was not counted and the assumption is that a similar percentage will not be counted if the census stops after stage II. The Bureau plans to use essentially the same method used in 1990. In this census the undercount estimation took place after the official census was completed and it was finally decided not to use the results to modify the census count. This time the undercount adjustment will be incorporated into the official census.

The method used for the undercount estimation is the familiar capture-recapture method. After stage II is completed, a sample of 25000 blocks, containing 750,000 housing units and 1.7 million people is chosen. (A block is about a city block and the smallest geographic unit for the census). Field workers go into each block and make an independent list of the households and an independent enumeration of the people living there. Let E be the number of people enumerated by stage II and P the number in the post- enumeration sample. The results of the two enumerations are then compared. Assume that 98 percent of the people counted in P were also counted in E. If life followed our text book descriptions of the capture-recapture method, we could estimate the true population by E/.98 = 1.012*E.

Of course life is not that simple. The two sets of data have to be studied to determine whether the differences were real differences in the enumeration process or could be accounted for by the fact that the person counted has moved or used a different name etc. Then there is a basic requirement of the census that every person should have the same probability of being counted. To give this some chance of being true, the entire population will be divided into about a thousand subgroups called poststrata. These are subgroups with similar characteristics based on age, sex, race, location etc. Estimations will then be made within each poststrata.

The details of the real world estimation problem can be found in two popular articles written by researchers involved in the process. These are: Howard Hogan, "The 1990 Post-Enumeration Survey: An Overview", The American Statistician, Nov. 1992, Vol. 46, No. 4, 261-269 and Kirk M. Wolter, "Accounting for America's Uncounted and Miscounted", Science, Vol. 253, 12-15.

In the last census, statisticians were divided on the issue of whether this method of estimating the undercount resulted in a more reliable estimate than the count obtained without it.

One who is concerned about this also for the 2000 census is David Freedman. You can find his views of how this census is going to be carried out and his concerns about some of the technical statistical problems in the technical report:

Planning for the census in the year 2000: an update Authors: Morris L. Eaton, David A. Freedman, Stephen P. Klein, Richard A. Olshen, Kenneth W. Wachter and Donald Ylvisaker.

You can obtain this at Directory of /pub/tech-reports. It is technical report 484.
<<<========<<




>>>>>==============>
Letter from John Finn:

Dear Chance News --

I'm particularly interested in what I call the "semantic" aspect of Chance. When we read statistical statements there's not just the matter of understanding the numbers; we need also to understand the meanings of the terms used.

The phrase "substance abuse" is currently quite popular, and is, I imagine, supposed to sound more technical than "drug abuse". Or maybe it's supposed to be euphemistic.

But what does the term "mean" here? Indeed, just what "is" "substance abuse"? When I hear the phrase the first thing that comes to my mind is something like Portnoy's use of liver. And the term is growing in scope, so that Dartmouth College now has a "substance-free" dorm. When I see that description I wonder if this means some sort of "virtual" dorm that exists only in a computer program, or maybe an otherwise ethereal structure, perhaps intended for only philosophy majors, that is unencumbered by brick, wood, paint, plaster, and other substances. Certainly I have at times attended what I'd call "substance-free" lectures, but I don't think that's what proponents of the phrase have in mind by it.

I'm being facetious, of course, but I think the term deserves mockery, and is itself an abuse of language. What the term seems usually to mean is "any" use whatever of taboo "substances", i.e., alcohol and other recreational drugs. To speak of "abuse" of these "substances" suggests that there's a proper use of them, but that, I am quite sure, is not what those who use the term mean. Here, I am willing to bet, the questionnaire given to the high school students did not ask them whether they had "abused" alcohol, marijuana, or other drugs before sexual activity, but only whether they had "used" any of them. For the people making the report to then characterize this as "abuse" seems hardly scientific to me, being instead a moral judgment.

Certainly the government agencies that deal with "drug abuse" use the term to mean any use whatever of any controlled drug (the correct name for what are usually called "illegal drugs"). I know this from having read a number of their reports.

We Chance participants try, I believe, to be aware of abuses of statistics, and I imagine there are others besides me who feel that the misuse of language in statistical reporting does constitute such an abuse. If, for instance, a report on the behavior of homosexuals refers to them as "deviants" or "perverts", I would tend to mistrust the scientific objectivity of the report (even though it's arguable that "deviant" has an objective meaning, and that homosexuals are deviant in that their behavior deviates, or differs, from that of the majority; still "deviant" has a perjorative sense, and is in fact used as an epithet). And, of course, it wasn't long ago that reports on homosexual behavior would include such terminology, just as it wasn't long ago that reports on the behavior of Blacks or Jews would be colored by their choice of words.

Today, of course, we wouldn't expect to find these semantic abuses in reports on Jews, Blacks, or homosexuals. But reports on recreational drug users are, in my opinion, rife with such abuses. Indeed, in formulating a book on the uses and abuses of statistics in the "war on drugs" I soon realized that in this area the abuses are much more a matter of language than of numbers.

To give a single example of the sort of thing I'm talking about, consider crack cocaine. "Everyone knows" the following:

(1) crack is a highly refined, highly potent derivative of cocaine;

(2) crack is highly addictive; much more so than cocaine;

(3) crack causes violent behavior;

(4) pregnant mothers who smoke crack give birth to "crack babies", who are addicted to it.

(By "everyone knows", I mean that these are the premises of just about any story in the media about crack. These include articles by Christopher Wren, who writes most of the New York Times's longer drug stories. Jeanne Albert wrote Wren, asking nicely that he do a little homework and get his facts straight, but he neither replied to Jeanne nor improved the accuracy of his reporting. His attitude really seems to be that he's not going to be hampered by anything so trivial as facts. I am considering writing Gina Kolata, and asking her to try to get through to him, since she did a very objective story on drugs.)
Let's have a look at these "facts" about crack.

(1) "Crack is a highly refined, highly potent derivative of cocaine." Actually, in the last year or so some of the media finally, after 15 years of sticking to this description, came out and said what crack really is: regular cocaine mixed with baking soda. Mix cocaine and baking soda in warm water, let the water evaporate, and you've got crack. The baking soda causes the cocaine, which is really cocaine hydrochloride, the water-soluble salt form of cocaine, to revert to the free-base form. This has a lower vaporization temperature and is thus more amenable to "smoking" (which is really inhaling the vapors of the heated substance), and affords a more effective delivery of cocaine molecules to the brain than does snorting the salt. Rather than a derivative, it would be more correct to call crack an "anti"- derivative of snortable coke, since making crack simply reverses the last step in the manufacture of the salt. A gram of crack has fewer cocaine molecules than a gram of snortable coke, being diluted with the baking soda. Yet the penalty for possession of a gram of crack is the same as that for possession of 100 grams of the salt.

(2) "Crack is highly addictive; much more so than cocaine." Crack "is" cocaine. This statement is as nonsensical as would be saying that whisky and soda is much more addictive than alcohol. And what does it mean for a substance to be addictive? It used to mean that taking the substance for a length of time and then suddenly stopping would cause definite physical withdrawal symptoms, as is the case with opiates. Indeed, for about 100 years medical science held that while the opiates were addictive in this sense, cocaine certainly was not. Then in 1987 the definition of addictive substance was changed so as to make it applicable to cocaine, and pretty much amounts now to saying that a substance is addictive if users tend not to want to stop using it. Of course, by this defintion, salt, butter and ice cream are addictive substances, so in truth the term has come to lose any precise meaning at all. But "addictive substance" still "sounds" like a precisely defined technical term.

I suppose we can assume that saying one drug is "more addictive" than another means that a greater proportion of users of the first become compulsive in their use. About 6 months after crack's debut in the media, after consistently asserting how highly addictive it is, the New York Times, in an article buried deep within its pages, acknowledged that there was no evidence of crack being any more addictive in this sense than any other drug. In fact, the percentage of users who become compulsive is about the same for every recreational drug: 10 to 20 percent. The one exception, I believe, is cigarettes, which have a higher compulsive use rate.

(3) "Crack causes violent behavior." This is entirely anecdotal. No controlled study has shown any correlation between crack use and violence.

(4) "Pregnant mothers who smoke crack give birth to "crack babies", who are addicted to it." A baby can be born addicted to opiates in the sense of going through opiate withdrawal during the first few days after birth. (European doctors, by the way, are not aware of "heroin babies" presenting a significant problem; American doctors feel they do). Cocaine "addiction" is not a matter of physical withdrawal, but merely of wanting more cocaine. It is vacuous semantics to speak of a newborn baby being addicted to cocaine. And crack "is" cocaine -- mixed with baking soda. Babies born to mothers who smoke crack have been found to have health problems. But studies confirming this included mothers who were heavy cigarette smokers, heavy drinkers, poor eaters, and so on, and did not control for these factors.

What can we conclude from this about the accuracy of statistical statements about crack we're likely to see in print or on TV? I leave that for you to decide.

-John Finn

<<<========<<




>>>>>==============>
Please send comments and suggestions to jlsnell@dartmouth.edu.

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

CHANCE News 6.08

(8 June 1997 to 8 July 1997)


!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!