Prepared by J. Laurie Snell, Bill Peterson, Jeanne Albert, and Charles Grinstead, with help from Fuxing Hou and Joan Snell.
Please send comments and suggestions for articles to
Back issues of Chance News and other materials for teaching a Chance course are available from the Chance web site:
Chance News is distributed under the GNU General Public License (so-called 'copyleft'). See the end of the newsletter for details.
Chance News is best read with Courier 12pt font and 6.5" margin.
New Yorker cartoon 10/2/2000:
CAPTION: "Hit 'em right after they won the lottery."
(Two men look at a house crushed by a meteor.)
Cartoonist: Gahan Wilson
Contents of Chance News 9.10
If you would like to have a CD-ROM of the Chance Lectures that are available on the Chance web site, send a request to firstname.lastname@example.org with the address where it should be sent. There is no charge.<<<========<<
According to research from Barclays Capital, the FTSE 100 would have achieved more than 50 percent growth in 73 percent of the five year periods between 1990 and June 2000
16 July 2000
When he settled the $9 bill, the young doctor left a tip of $10,000 (pounds 6,538), or a refreshingly generous 1,111 per cent.
7 June 2000
Of the three quarters of seats sold direct, 75 percent are sold online -- a proportion that is set to grow until at least half of all Ryanair's seats are sold online.
Travel Trade Gazette
3 July 20000
We realized how late we were with this Chance News when Norton Starr sent us another month's Forsooth! Here it is.From RSS News V28 #2, October 2000
The deputy earns roughly 3000 percent less than the foreigner he shadows.
'Five Years in Tibet'Mr. Buchanan is registering between 1 and 2 percent in national polls which, given their margin for error, means his support could be zero.
by Alec Le Suer
The TimesMany of our arithmetic senior citizens have been advised by their doctors that swimming is the best exercise to keep them mobile.
14 August 2000
(Useful advice for those of us that seize up on mathematical problems?)
The minimum purchase of Premium Bonds is 100 pounds, which gives a 1 in 240 chance of winning a prize each month. So a person would have to hold their investment in the bonds for 12 years to statistically stand a chance of any return.
Yorkshire Evening Post
23 October 1999
We had some difficulty understanding why this last item deserved a Forsooth but we were curious how this bond lottery works.
We found the answer at the U.K. National Savings web site. The bonds referred to are British government bonds called Premium Bonds. These bonds have a unit price of one pound. A minimum purchase is 100 units and you can hold at most 20,000 units. You get no interest but each unit you hold is a ticket for a lottery held every month. Every month one serial number for every 20,000 eligible bonds is chosen to be awarded a prize. The prizes range from 50 pounds to 1 million pounds. The million-pound prize is guaranteed every month and the other prizes are chosen so that the total prizes awarded will amount to a month's interest on all the bonds eligible for the lottery at a $4.25 percent annual rate (interest rate subject to change). No taxes need be paid on the winnings.
DISCUSSION QUESTION:Would you rather play this lottery or a more conventional lottery such as the Powerball lottery?
We are often told that something is 'three (or whatever) times more than' when what is meant is 'three times as much as'. 'Three times more than' must mean 'four times as much as'. More strikingly, '30 per cent more than' is very different from '30 per cent as much as'. The improper usage is confusing because the writer or speaker may intend the correct meaning.
The difference between '300 times more than' and '300 times as much as' may be small, but reduce 300 to three and the difference is far from trivial. This has been irritating and confusing me for years. Is there anything the RSS can do to establish correct practice?
Maurice B Line
DISCUSSION QUESTION:What is the answer to Maurice's question?
Having failed to find a Barney Bissinger in Hershey Pennsylvania, we asked in a discussion question, if readers believed there was such a person. Several readers wrote that there was indeed. For example Emil Friedman and David Cruesse both know him and David writes:
Tom Lane sent us references to several of Bissinger's papers and also remarked that his Erdos number is 2.I have known Dr. Barnard Bissinger for a number of years. He retired after a long career working in statistics for the military, and now consults part-time for the medical school in Hershey, PA, and other clients.
I am sure he knew of the envelope paradoxes that his letters illustrated, but that he was trying to see how Marilyn would answer.
The October 3, 2000 issue of the Dartmouth student newspaper, known affectionately as "the D", carried a full page ad saying:
The majority of Dartmouth students drink 0, 1, 2, 3 or at most 4 drinks when they party.
Seeing this, Peter Doyle remarked:
But surely that implies that the majority of Dartmouth students drink 4 or more drinks when they party which seems a lot to me.
A call to our local statistician confirmed that indeed they would have chosen 3 for the ad if they could say that the majority of students drink 0,1,2 or 3 drinks when they party. Since they couldn't, Peter was right.
The article describes this shift in strategy that some colleges are using in order to reduce excessive student drinking. In recent years, most colleges and universities have tried to send students alarming messages about the consequences of "binge" drinking (defined as 5 drinks in a row for men or 4 drinks in a row for women), but apparently students continued to drink heavily. Believing that excessive student drinking has been exaggerated and that harping on it may have caused students to perceive that heavy drinking is the norm, a handful of colleges have tried a new approach: emphasizing statistics stating that most students on campus drink in moderation. (A group of 21 national higher-education associations has asked that its members and the news media discontinue using the phrase "binge drinking," calling it inaccurate and counterproductive.)
The strategy was first suggested in 1986 by sociologist H. Wesley Perkins, who had concluded that on surveys students often overestimated how much other students drank, and that the bigger the overestimate, the more they drank themselves. Since then, according to the article,
"several other studies have shown the same gap between perception and reality. One study of 48,000 students on 100 campuses nationwide found that at campuses where most students said they drank once a month, 90 percent presumed that their peers drank weekly or even daily."
In the new approach, which is really a positive rather than a negative marketing strategy, colleges are promoting statistics such as "Zero to 3" (on frisbees distributed at Cornell), and "What's the norm?--Four or fewer" (on footballs at Hobart and William Smith.) According to the article, several colleges and universities have experienced as much as a 20 percent drop in "reported drinking". But Harvard researcher Henry Wechsler, who first called attention to excessive student drinking in 1993, disputes the effectiveness of the new programs. According to a study he published last month, the rate of "binge" drinking on campus has remained the same (44%) since 1993.
(1) The article reports that this new approach was started in 1990 by Northern Illinois University when the University ran an ad stating that "Most students drink five or fewer drinks when they party." Do you think that this implies that the majority of students were binge drinkers at Southern Illinois in 1990?
(2) Do you think that students accurately describe their own drinking behavior? their peer's behavior? What do you think is the best way to get this kind of information?(3) The article says that, at schools where the social-norms approach has been adopted, perceptions of heavy drinking declined, along with alcohol-related injuries. Why might this have occurred?
The first article was written just before the second debate between Gore and Bush and comments:
Depending on which pollster you believe, the following is true:The writer observes:
A) The race for president is highly "volatile" this year.
B) It's no more volatile than usual.
C) It's actually pretty stable.
B) Bush is ahead.
C) No one's ahead.
In its polling Oct. 2 to 4, Gallup had Democrat Gore ahead by 11 points, 51 to 40.
In its polling Oct. 5 to 7, Gallup had Republican Bush ahead by 8 points, 49 to 41.
He comments that this is as if 9 percent of the voters changed their mind and wonders what is going on.
Frank Newport, editor in chief of Gallup says this change is not abnormal and reflects an authentic surge after the candidates' first debate. He also observes that more attention is paid to a difference when it changes who is ahead than when it just makes one candidate more in the lead.
Others believe that this volatility is unusual and give a variety of reasons for it such as: the candidates don't seem that different, there are no polarizing issues etc.
In the web version of the first article you can see a graphical comparison of four different polls during period from Oct. 2-7:CNN/USA Today/Gallup, Reuters MSNBC, Voter.com/Battleground. The latter two suggest less volatility than that observed in the Gallup polls.
The article suggests that differences in polling methods can explain the difference in volatility. For example, Gallup pushes undecided voters to declare a choice and this can cause swings as small things happen to change the minds of the undecided. Most polls seek out likely voters but use different ways to determine who is a likely voter. Differences in the time the poll is conducted can give different results. Some carry out their polls on weekends but others avoid, this believing that they will miss married voters with children.
The aim of the second article is to explain the concept of "margin of error". It gets off to a good start:
When a poll has a margin of error of 3 percentage points, that means there's a 95 percent certainty that the results would differ by no more than plus or minus 3 points from those obtained if the entire voting age population was questioned.
However, the next paragraph has some problems:
Margin of error is sometimes misunderstood. Let's say George W. Bush is up by 5 points. It sounds like this lead well exceeds the 3-point margin of error. But in fact, Bush's support could be off by three points in either direction. So could Al Gore's. So the real range of the poll is anywhere from an 11-point Bush lead to a 1-point Gore lead.
The author is trying to say something about the margin of error for the difference between the two candidates. The writer is assuming, incorrectly, that the candidates errors are independent. A way to calculate this is discussed in "Poll Faulting" by Ansolabehere and Belin, Chance Magazine, Vol. 6, No. 1,1993.
Consider a poll with outcomes: Bush, Gore, and other. Other includes the undecided voters as well as those who favor another candidate. Then assuming a simple random sample, the distribution of the sample is a multinomial distribution with parameters p1,p2,p3 corresponding to the proportions of these groups in the entire population. Let s1,s2,s3 be the sample proportions. Then we are interested in the difference s1-s2. The expected value of s1-s2 is p1-p2 and the variance is
var(s1-s2) = var(s1) + var(s2) - 2cov(s1,s2) = (p1(1-p1) + p2(1-p2)+ 2*p1p2)/n.
Then the standard deviation of the difference between the two candidates is the square root of this expression and multiplying this by 1.96 gives the margin of error at the 95 percent confidence limit. For example, suppose that a poll of 1000 voters gives Bush 46 percent, Gore 41 percent and other 13 percent. Such a poll has about a 3 percent margin of error for either candidate. Using the sample proportions s1 and s2 to estimates p1 and p2, we obtain a 5.8 percent margin of error for the difference between the two candidates. Thus based on the Oct. 5-7 with Bush ahead by 5 points, Gore could be in the lead within the margin of error for this difference.
The authors of the Chance article suggest that a simple estimate for the margin of error for the difference can be obtained by multiplying the margin of error for a single candidate by the square root of 3. In our examples this gives 5.5. This rough approximation is typically a little low.
The Chance Magazine article also discusses how one can test if the differences between polls can be explained by chance variation. The authors apply their method to a set of polls during the 1992 presidential race between Clinton, Bush, and Perot. They conclude that the differences between polls were too great to be the result of chance fluctuations.
(1) After remarking in the first article that the 5-point Bush lead is within the margin of error for the poll the author writes:
Does that mean Bush shouldn't be characterized as leading that poll? No, say many experts. The 5-point Bush lead is still the most likely reflection for the race.
What do you think the experts had in mind in saying this?
(2) What would the margin of error be for the difference between two candidates if there were no other candidates and no undecided voters?(3) In a recent discussion on TV of voter turnout, a commentator stated that politicians obviously did not really want a larger turnout. If they did they would vote to allow a longer period for balloting -- for example 24 hours or even up to a week. What are the pros and cons of allowing a longer time for voting?
This article describes the results of studies conducted by the NAACP and by the Dallas Morning News that had been released the same week. The studies are based on different parts (time periods) of the same collection of traffic data, but both focused on the rates that African-Americans, Hispanics, and whites were ticketed and searched on Texas roadways. According to the article's lead paragraph:
Black and Hispanic motorists across Texas are more than twice as likely as non-Hispanic whites to be searched during traffic stops, while black drivers in certain rural areas of the state are also far more likely to be ticketed..."
Although the Dallas Morning News found that statewide, African-Americans and Hispanics received tickets "at rates that were proportional to their driving-age populations", they also concluded that in many rural counties blacks were ticketed nearly twice as often as non-Hispanic whites. In response, the Texas Departmentof Public Safety said the study was flawed because, as the article states:
It compared the race and number of ticketed drivers with the local population where the stop occurred, but didn't consider that the drivers might be from elsewhere.
The NAACP study found similar differences between the rate that African-Americans, Hispanics, and non-Hispanic whites are searched after being stopped.
Texas began collecting and compiling information from traffic stops after it had suspended seven police officers in East Texas for racial insensitivity. The state's own data apparently show that non-Hispanic white drivers are stopped more often "than their estimated state wide population", but that blacks and Hispanics are twice as likely to be searched.
(1) The Dallas Morning News apparently compared the percentages of those ticketed who are African-American, Hispanic, or white, to the percentages of the "driving-age population" who are in these groups, and found rates that were "proportional". Is this a meaningful comparison? (See next question...)
(2) Laurie contacted Tom Sager, a professor at the University of Texas who worked with the newspaper to analyze the data, and he said that there were problems with the study. In particular, there was no information concerning the racial composition of drivers actually on the roads in Texas (as opposed to simply the driving-age population), nor was there information on the rates that each racial group violated traffic laws. Professor Sager told Laurie:
I advised the reporter of these [and other] limitations. That was more than six months ago. I am a little surprised that the paper proceeded with the story in view of the problems.
How could the information mentioned above be used? Why do you think that knowing the racial composition of actual drivers is more meaningful than knowing the racial composition of the driving-age population?(3) The NAACP and Texas itself have apparently determined that non-white drivers, once stopped, are searched at higher rates than white drivers. According to the article, "the state report explained the high number of Hispanics who were searched as a byproduct of the traffic of illegal immigrants and drugs from Mexico." Do you think this is a reasonable argument (that racial "profiling" isn't occurring)? What other information might help you answer this question?
This article and the accompanying simulation programs on the Web consider the problem of using averages to predict future outcomes. The first example that is considered is a $200,000 retirement account that is invested in a S&P 500 index fund. One wants to know how much he or she can withdraw each year from the account so that the account will last 20 years.
The obvious way to proceed is to compute the average rate of return of the S&P 500 index over a long period of time and to use this average as an estimate of the future rate of return. The average rate of return on this index, since its inception in 1953, is 14 percent. If one computes, using a simple formula, the amount one can withdraw each year for 20 years from this account, assuming that it returns 14 percent annually, one arrives at a figure of $32,000.
There are two problems with this approach. The first, which is not addressed in this article but has been discussed by many analysts, is that it is not at all safe to assume that the average rate of return over the next 20 years will match that in the preceding 47 years. It would be much more prudent to assume a smaller rate of return will prevail in the future, for if one is too conservative one ends up with extra money, while if one is too liberal one goes broke before the 20 years is up. (In fact, if the 20-year period represents the beneficiary's life expectancy, then one should probably be conservative here as well and assume the beneficiary will live substantially longer than 20 years.)
The second problem, which the author calls the "Flaw of Averages", is that the variation of the actual rate of return around the average return generally hurts the investor. In the author's words, "If you assume each year's growth at least equals the average of 14 percent, there is no chance of running out of money. But if the growth fluctuates each year but averages 14 percent, you are likely to run out of money."
On the author's web page, one can find a simulation of this example. Although it is not stated what distribution is being used to generate the simulated yearly returns, the simulation is still quite impressive. It shows the account balance, over either the next 20 years or until the account is zeroed out. Most of the time, the account zeroes out well before the 20 years are up.
The sobering thing about this idea is that many people and institutions use averages to determine long-range spending levels. (This reviewer's college makes long-term spending guidelines that are based upon average rates of returns of stocks and bonds.) With today's computers, it is very easy to do a large number of simulations; however, people must first be convinced that simulations are valuable.
It is worth pointing out that there are two ways to define the term "average rate of return." One way is to simply write down the return each year and then take the average of these rates. The other way, which is the way that most investors think, is to determine what constant rate of return would give the same return over the time period in question as that which actually occurred. Thus, for example, if over a two-year period a fund returns 100 percent one year and 0 percent the next, the first average would be 50 percent, while the second average would be 41.4 percent. It turns out that which definition one chooses makes quite a difference in the Flaw of Averages. We asked the author of this article which definition he uses in his simulations, and it turns out that he uses the first of the two definitions.
We wrote two programs to simulate the situation. In each case, the yearly rates of return were chosen from a normal distribution, with a certain mean and standard deviation, for a period of 20 years. The lists of yearly returns were discarded if they did not give an average (with the meanings discussed above) within .005 of .14. In each case, we pulled out $30197 at the end of each year, which is the amount that will zero out the account in the case that the account makes exactly 14 percent each year.
Each program was run 1000 times. Using the first definition of average rate of return, the account zeroed out 64 percent of the time before 20 years was up. However, using the second definition, the account only zeroed out 39 percent of the time. While this probability is still high enough to be of considerable concern, it is still much smaller than in the first case.
It is also interesting to note that if one takes out $25,000 per year, instead of $30,197, then the probability that the account will last at least 20 years is about 92 percent (using the second definition of average). Also, if one takes out $25,000 per year, the average amount in the account at the end of 20 years, including those years when it is zeroed out early, is $637,000. If one takes out $30,197, the average is $178,000. It must be noted here that the value for the standard deviation certainly affects the probability of zeroing out; as the standard deviation rises, the probability of ruin increases.
DISCUSSION QUESTION:Which definition of the term "average rate of return" do you think makes more sense?
Andrew Bernard at the Dartmouth Tuck School of Business and Meghan Busee at the Yale School of Management posted a paper on the web predicting how many medals each of the countries would win in the 2000 summer Olympics in Sydney. Their predictions were discussed in major newspapers and NPR both before and after these Olympics.
Bernard gave a talk in the Dartmouth Chance course which you can view under Guest Lectures.
It was fun to hear Bernard's account of how they developed their formula for predicting the number of medals each country would win. We encourage our readers to view this video.
In his talk Andrew explained the way they went about looking for a good set of variables for predicting the number of medals that each country would win. They obtained data for previous summer Olympics and various socio-economic indicators for the years 1960 to 1996. They first considered the prediction that the number of medals would be proportional to the population. Looking at previous years data showed that this was not a very good method. Taking logarithms of the population improved the fit but it was still poor. The problem here is easily seen by the fact that four countries, China, India, Indonesia, and Bangladesh have about 43 percent of the world's population but, for example, in 1996 won about 6 percent of the medals.
The authors then considered the gross national product (GNP) as a predictor. The four countries we mentioned with 43 percent of the population, have only about 5 percent of the world's GNP which is about the percentage of the medals they typically win. Thus, this looks more promising. The authors found that the GNP per capita next was, indeed, a better predictor and it was even better when they used the logarithm of the GNP per capita.
Looking at previous Olympics suggested that there was a kind of "home team advantage" which on average seemed to increase the percentage of medals won by about 1.5 percent over what they would win when the Olympics were in another country. The authors added a home team advantage variable and other variables to take into account, for example, that countries can "manufacture medals" as happened in the Soviet Union. However, the fit was still not impressive until the addition of the final variable they considered: the number of medals each country won in the previous summer Olympics.
Alas, it appears that, at least for the 2000 Olympics, the whole process could have been simplified by considering only this one variable. To show this Greg Leibon in his Chance class looked at the 23 countries that won 10 or more medals in 1996 and considered their medal shares. This group of countries won 679 medals in the 1996 Olympics, so their medal shares are the percentages of these 679 medals that each of these 23 countries won. We give below the medal shares they won in the 1996 Olympics, their medal shares predicted in the 2000 Olympics by Bernard and Russe and the medal shares they actually won in the 2000 Olympics.
|Country||1996 medal share|| Bernard-Russe
2000 medal share
|2000 medal share|
|Russia n Fed.||9.28||8.87||12.21|
We see from this that the Bernard-Russe predictions are very close to the predictions one would make by simply assuming that the medal shares would be the same within this group that they were in the previous Olympics. The "same as the last Olympics" prediction gives a correlation of .946 with the 2000 results while the Bernard-Russe predictions gives a correlation .949. Thus both methods give about the same very good predictions.
Predicting the same as last time has a long history of success in prediction schemes. In his column "A Statistician Reads the Sports Page" in the Summer 2000 issue of Chance Magazine, Scott Berry discusses his model, presented in the Spring 1999 issue of Chance Magazine, for predicting the top 25 home run hitters for the 1999 season following the historic 1998 home run season. Like Bernard-Russe he was trying to make better predictions by adjusting the 1998 totals. He asks how his results compared with those of Mr. MLE who predicts that each player will hit the same number of homeruns as he hit last year. He comments:
I am disappointed that my projections did not beat Mr. MLE -- I still think it will do better in the future and better for the whole league.
Another example of such a simple scheme was described by a weather expert whose job it was to evaluate the performance of weather forecasters. To evaluate their predictions, for example for tomorrow's maximum temperature, he compared the success of a forecaster's predictions with the prediction scheme that takes the average of yesterday's maximum temperature and the historical average for tomorrow's maximum temperature for the region being predicted. If the forecaster did better he was impressed.
In their paper Johnson and Ali also attempt to make predictions both for participation and success at the Olympics. Like Bernard and Russe they use linear regression models based on population, GNP, home team effect, and other socio-economic-political variables. Unlike Bernard and Russe they do not use the previous Olympic results in their prediction in order to better assess the predictive value of the socio-economic-political variables. Of course their results are not as good but are still pretty impressive. Here is an analysis of the top ten of their predictions in terms of medal shares.
2000 medal share
| 2000 medal share
The correlation between the predictions and the actual results is .78. A scatter plot reveals that China and Russia are outliers as they were in the Bernard-Busse analysis.
(1) Why do you think that Scott calls the forecaster who predicts the same as last time "Mr. MLE"?
(2) For complicated reasons it is not known before the Olympics exactly how many medals will be won. Bernard and Busse made their predictions on a medal share basis assuming 888 medals would be awarded and Johnson assumed 900. In fact 992 medals were awarded. In evaluating their predictions should we modify them to take this into account? (Bernard said on NPR after the Olympics that they would stick by their original predictions --perhaps because they were exactly correct on the 97 medals won by the US).(3) Can you explain why the predictions for Russia and China were so far off?
This article reports that Keith Baumgardner, a tire consultant who is analyzing Firestone tires in connection with lawsuits against the tire-maker and Ford, found that of the 63 cases of tread separations on Explorers that he examined 27 involved the left rear tire and 18 the right rear tires. The other 18 were on the front wheels or their position was not known.
Ford spokesman Ken Zino said that the company has seen a slight rear trend and was trying to find the cause of this. Another Ford spokesman, Mike Vaught, said that they were investigating a number of theories:
The location of the Explorer's fuel tank on the left side might put more weight on that wheel.
The rotation of the Explorer's drive shaft might place more force on the left side than the right.
American roads might radiate more heat from the center of the pavement than from the edges, increasing the heat transferred to the driver's side tires.
Vaughn stressed that Ford was in the early stages of the investigation and these theories did not contradict the company's view that the problem involved the tires and not the design of the Explorer.
Emil Friedman pointed out to us that the National Highway Traffic Safety Administration (NHTSA) has provided on their web site information regarding complaints, injuries, and fatalities that has been reported to NHTSA related to Firestone ATX and Wilderness tires under investigation by the agency. As of the middle of September there were slightly over 2200 such complaints. They provide an Excel spreadsheet with the information they have about each complaint. This is a wonderful data set to illustrate all the problems of dealing with a real data set.
To verify Baumgardner's claim, we looked at the complaints where there was tread separations on Explorers with at least one injury. We looked only at the 154 cases where trouble with only one tire was reported and it was clear which tire they were referring to. We found the tires that failed distributed as follows
Thus again the rear tires were the problem most of the time but the difference between the left rear and right rear is not as striking as it was for Baumgardner's cases.
(1) How would you test if the differences between the rear right and rear left tires were significant?(2) Obviously, the lawyers for Firestone will try to prove that the problem was with Ford's Explorer and the Ford lawyers will try to prove that the problem was with the tires. Consider being a lawyer in each case and explain how you might try to win your case.
This article is based on a more complete discussion.
The article describes how statistics can be used in yes-no diagnostic questions in medicine and elsewhere.
The main points made in this article are well illustrated by the author's example of an eye doctor trying to determine if a patient has glaucoma. As this writer knows all too well, one of the principle diagnostic tools is the fluid pressure of the eye. The population average pressure is about 22 in the units used. This could be used as a cutoff point to treat patients for glaucoma and we are told that in earlier days this was the case. The authors describe a better way to use population statistics to assist diagnosis.
Of course there are other tests that can be and are used in making a diagnosis of glaucoma. But, for simplicity the authors assume the diagnosis is to be made by this single test.
They propose that we start by finding a large sample population of people whose eye pressure level and glaucoma status are known. If the distribution of the pressures for these two groups do not overlap, then the diagnostic problem is easy. However, typically they will overlap and the question is how do we make a diagnosis for pressures in the interval of overlap?
The authors assume for their glaucoma example that the population curves overlap for pressures between 10 and 40. Then if a patient's pressure is less than 10, they would say the patient does not have glaucoma. If it is over 40 the patient does have glaucoma. The question is how to make a diagnosis when the score is between 10 and 40?
Suppose we choose a cutoff value x between 10 and 40 and we say that all those with pressure greater than x have glaucoma and those with pressure less than x do not have glaucoma. Then some of those with glaucoma will test positive (true positives) and some of those who do not have glaucoma will also test positive (false positives).
How do we choose x? We can error in two ways: we can end up treating a patient for glaucoma who does not have glaucoma and we end up not treating a patient who does have glaucoma. Since there is no danger treating a patient who does not have glaucoma but a great danger (possible blindness) in not treating a patient who has glaucoma we would be more worried about the second kind of error in this example.
To assist further in making the choice of cutoff point and measuring the effectiveness of the test, the authors propose using an ROC curve. ROC stands for "Receiver Operating Characteristic" and came from signal detection theory developed during World War II. The ROC curves were developed to help radar operators "diagnose" whether a blip on the radar screen corresponding to an enemy target, a friendly ship, or just noise.
Suppose we choose a cutoff point x = 20. From the author's population distributions we see that of those who have glaucoma 90 percent would test positive (true positive probability = .9) and 50 percent those who did not have glaucoma would test positive (false positive probability = .5). We would like a high true positive probability and a low false positive probability. Unfortunately, as we increase (decrease) the cutoff value both of these rates decrease (increase) so it is always a tradeoff -- improve one at the expense of the other. Here are values of these three variables for five different threshold values using the author's population distributions.
|Threshold||false positive probability||true positive probability|
The ROC curve is a parametric curve with parameter the threshold
and points on the curve for a given threshold having co-
(false positive probability, true positive probability)
The ROC curve lies in the unit square (0,0), (1,0), (1,1), (1,0). It starts at (0,0) and ends at (1,1)
If the pressure had nothing to do with glaucoma, both rates would be the same for any cutoff point--we would just be tossing a biased coin to make our decision. The ROC curve would then be the diagonal line going from (0,0) to (1,1). For tests that do have some ability to distinguish between those who do and don't have glaucoma, the ROC curve lies above the diagonal line. Since we want the false positive probability small and the true positive probability large, the closer the curve follows the left-hand border and then the top border of the unit square the better the test. Thus the larger the area under the curve the better.
The area under the ROC curve has the following interesting interpretation: Choose two patients, one randomly chosen from the healthy population and the other randomly chosen from the population with glaucoma. Then the area under the ROC curve is the probability that the test correctly identifies which is the healthy patient. Thus the area gives a measure of the discriminatory power of the test. This allows one to compare two different tests. The test with larger areas under the ROC curve is better.
Of course this is all much easier to follow with pictures and the original article provides these. The authors also discuss other examples and applications of the ROC curves to diagnostic problems.
DISCUSSION QUESTION:Do you think your friendly glaucoma doctor has ever heard of an ROC curve?
Note: Chance News Copyright (c) 2000 Laurie Snell This work is freely redistributable under the terms of the GNU General Public License as published by the Free Software Foundation. This work comes with ABSOLUTELY NO WARRANTY.