Chapter 2

Simpson's Paradox


Introduction | Berkeley Study | South African Study | Additional Study | Resources and References

Introduction

One of the most basic and important of all statistical topics is the study of possible relationships between variables.  As we will see in this project, two variables may appear to be related in a certain way, but when a third, "hidden" variable is taken into account, the apparent association disappears, or even reverses direction.  This behavior is known as Simpson's paradox, and underscores the importance of the design of a statistical experiment, and particularly the search for possible confounding variables.

  top

The Berkeley Graduate Admissions Study

In the Fall of 1973, an observational study on possible gender bias was conducted at the University of California, Berkeley.  In that year, there were 12,763 applicants for graduate admission; the  following is a two-way table that gives the data according to the variables outcome (admitted or denied) and gender (male or female).  The data are from reference [1].

Admitted Denied
Male 3738 4704
Female 1494 2827

Of course, it's hard to draw any conclusions about the question of gender bias from this table, because different numbers of men and women applied for graduate admission.  Clearly, we should work with percentages instead.

1.  Construct a two-way table that gives the percentages of men admitted and denied, and the percentages of women admitted and denied. Sketch the corresponding histogram.

From your table in Exercise 1, you should have observed that approximately 44% of men were admitted, but only about 35% of women were admitted.

2. Do you believe that there was gender bias in graduate admissions at UC Berkeley in 1973?

3. Can you think of possible causes for the discrepancy in admission rates other than gender bias?

One factor that you probably thought of in the last exercise was the qualification of the applicants.  Naturally, a discrepancy in admission rates could result if the women, as a group, were less qualified than the men, in terms of college grades or standardized tests. In fact, however, there was no significant difference between the qualifications of the men and the women, as groups.

4. In light of this information, answer questions 2 and 3 again. 

Let us now introduce a new variable that may help explain the data.  At UC Berkeley, as in most universities, decisions about graduate admission are made at the department level.  In 1973, UC Berkeley had 101 different graduate departments, but for simplicity, we will consider only the six largest departments (which collectively account for 4486 of the applicants. The following table is a three-way table that presents the admissions data according to the variables department (A, B, C, D, E, F), gender (male, female), and outcome (admitted, denied). The table is adapted from data in reference [2].

  Male Female
Admitted Denied Admitted Denied
A 512 313 89 19
B 313 207 17 8
C 120 205 202 391
D 138 279 131 244
E 53 138 94 299
F 22 351 24 317

5. Once again, construct the three-way table that gives the percentages of men admitted and denied, and the percentages of women admitted and denied for each department.

6. Construct the two way table (both with counts and percentages) for the variables outcome and gender. Sketch the corresponding histogram.

7. Construct the two way table (both with counts and percentages) for the variables outcome and department. Sketch the corresponding histogram.

8. Construct the two-way table (both with counts and percentages) for the variables gender and department. Sketch the corresponding histogram.

A three-way table, such as the one above, contains all of the information of the three two-way tables (and indeed much more information).

9. Based on your analysis, do you now believe that there was gender bias in graduate admissions at the University of California at Berkeley in 1973?

If anything, it appears that there may have been a slight bias against men.  This study is perhaps the most famous example of Simpson's paradox: an apparent association between two variables can disappear or even reverse when a third variable is taken into consideration.  The paradox results when the association between the two original variables is actually due to the fact that each is strongly related to the third variable. In this context, the new variable is often referred to as a hidden or confounding variable.  Thus, in the case of the Berkeley graduate study, we can see that Simpson's paradox was apparently a result of the following facts:

  1. There is a strong association between the outcome variable and the department variable.  Some departments were relatively easy to get into, while others were much harder to get into.
  2. There is a strong association between the gender variable and the department variable.  In 1973, women tended to apply to the harder departments while men tended to apply to the easier departments.
  3. Because of (a) and (b), there is a strong association between the gender variable and the outcome variable.  In 1973, women were admitted at a significantly lower rate than men.

Simpson's paradox is named for EH Simpson, who studied the phenomenon in a famous 1951 paper (reference [4]).

  top

The South African Health Study

Our second example comes from a South African health study reported by Christopher Morrell in the Journal of Statistics Education (reference [3]).  In the Birth to Ten Study, South African researchers were interested in identifying risk factors related to cardiovascular disease for children living in an urban environment.  The researchers first identified all single births in the Johannesburg/Soweto metropolitan area during a seven week period from April to June, 1990.  During the first year of the study, the researchers collected information on prenatal care, birth, and development for as many of these babies as possible; this formed the first-year sample.  After five years, the researchers attempted to collect additional data on health, environment, and access to medical care for as many of the children in the first-year sample as possible.  This five-year sample was much smaller than the first-year sample, as you might expect, because some families moved away, others were not interested in participating, and so on. The difference between the first-year and five-year samples needed to be addressed if the study was to be statistically sound.  The researchers needed to know that the five-year sample and the group of non-participants were similar with respect to the variables that were important for the particular study.  Thus, the next step for the researchers was a comparison of the first-year and five-year samples.

The table below is a two-way table for the first-year sample in terms of the variables medical aid status (whether or not the mother had medical aid, similar to health insurance, at the time of birth) and participation status (whether or not the child participated in the five-year study).

  Participated  Did not participate
Medical aid 46 195
No medical aid 370 979

10.  Construct the two-way table that gives the percentages of the five-year participants who did and did not have medical aid, and the percentages of the non-participants who did and did not have medical aid. Sketch the corresponding histogram.

Perhaps surprisingly, your table in Exercise 10 should show that the non-participants were more likely to have medical aid than the participants.  

11. Now that you are familiar with the general problem of confounding variables and Simpson's paradox, can you think of one or more new variables that might help explain the results in Exercise 10?

If you identified race as a possible confounding variable then you were right.  The following three-way table presents the data in terms of the variables medical aid status, participation status, (as defined previously) and race (classified simply as white and black).

  White Black
Participated Did not participate Participated Did not participate
Medical aid 10 104 36 91
No medical aid 2 22 368 957

12. Construct the three-way table that gives the percentages of the five-year participants who did and did not have medical aid, and the percentages of the non-participants who did and did not have medical aid for each race.

13. Construct the two way table (both with counts and percentages) for the variables race and participation status. Sketch the corresponding histogram.

14. Construct the two-way table (both with counts and percentages) for the variables race and medical aid status. Sketch the corresponding histogram.

15. Explain the apparent elimination of association between medical aid status and participation status when race is considered.  In particular, address the following questions:

  1. Do you think that whites or blacks tend to have greater access to medical aid?
  2. Which racial group predominated in the original first-year sample?
  3. Which racial group participated to a greater extent in the five-year study?

  top

Additional Study

In this project, we have discussed the concept of the statistical association between two variables in an informal way.  For a deeper, quantitative understanding of this concept, please see Chapters 13, 14, 15 in your text

  top

References and Resources

1. PJ Bickel, EA Hammel, and JW O'Connell.  "Sex Bias in Graduate Admissions: Data from Berkeley" Science, 187 (1975), 398-404.

2. David Feedman, Robert Pisani, and Roger Purves. Statistics, 3rd Edition, WW Norton & Company, New York (1998).

3. Christopher H Morrell.  "Simpson's Paradox: An Example From a Longitudinal Study in South Africa," Journal of Statistics Education, 7 (1999).

4. EH Simpson. "The Interpretation of Interaction in Contingency Tables," Journal of the Royal Statistical Society, Ser. B, 13 (1951) 238-241.


  top