## Saturday, September 26, 2009

### Vaccine trial statistics

The US military has been conducting a large and expensive AIDS vaccine trial in Thailand. Recently positive results were announced with much fanfare .

However the actual results are not that impressive. 74 out of about 8000 in the placebo group became infected as opposed to 51 out of about 8000 in the vaccine group. An approximate test of significance is to ask if we flip a fair coin 125 times what percentage of the time will the head tail split be at least as uneven as 74/51. By my calculation 4.87%. So this is statistically significant at the 5% level but just barely so. Again by my calculations if there had been one fewer infection in the placebo group this value would have been 5.89% or if there had been one more infection in the vaccine group this value would have been 6.09% neither of which would have been significant.

This illustrates one of the problems of picking an arbitrary line like 5% and declaring results on one side successes and on the other side failures. Another problem is the fact that there have been many vaccine trials other than this one all of which failed. Clearly if you run enough trials some will appear to succeed (at the 5% level) just by chance. I think there is a fair chance this is an example.

1. I am getting a different probability than you. By my calculation the probability of a discrepancy that large (and in that direction), assuming that the vaccine was no better than a placebo, is about 0.3%. This is calculated as the probability that, if you flip an unfair coin, with a 74/8000 likelihood of landing heads, 8000 times, it will come up heads 51 or fewer times. This gives 0.3%. It can be calculated in Excel as =BINOMDIST(51,8000,74/8000,TRUE). 0.3% seems pretty significant to me.

2. I agree with James' analysis because what Jonathan's calculation shows is a confidence interval like measure, but one also has to take into account the chances of the other group (placebo) being no better than vaccine. If I apply the simple gaussian test of a standardized random variable (error1 - error2)/sigma where sigma is the usual square root of the trial normalized binomial variance I get 0.0389 (3.89%) but it is known the gaussian is not very good at low error values.

3. Regarding my calculation I am using a two-sided test which makes a difference of a factor of two in the result. I think a two-sided test is appropriate because as the linked NYT article noted some of the vaccine trials were halted early because the vaccine appeared to be harmful. So it is not a given that it is harmless.

For Jonathan's calculation I think it would be better to use the average probability (51+74)/16000 for both groups and then find the chance of a difference of 23 or more.

4. What bothers me about this way of estimating the efficacy of a vaccine is the the result is a convolution of the rate of exposure to a (more or less) rare disease and the rate of the vaccine being effective. In other words, what was estimated were joint probabilities P(exposure,result|vaccine) and P(exposure,result|placebo). I guess the goal should be to estimate P(result|exposure,vaccine) vs. P(result|exposure,placebo). no?

5. oTUHV...: Given that whether you're exposed is independent of whether you got the vaccine, P(exposure,result|vaccine) = P(exposure)*P(result|exposure,vaccine). So the two results differ by a constant. Also, more generally, whenever you estimate P(A/B) you can always breakdown A into some sub-events C, D, E, etc. But that doesn't make confidence tests like the one we are discussing any more mathematically complicated, it just means that it is harder to get a statistically significant sample than it would be if you were able to isolate the sub-event you're actually interested in (in this case, (result given exposure).

6. James: I agree that my method is a bit biased, but I believe your adjustment overcompensates. Using your adjustment, you are effectively assuming that the vaccine has no effect, and thus the best estimate for P(result|no effective treatment) is (51+74)/16000. Why is this better than my assumption, which is that we don't know, a priori, whether the vaccine has an effect, so we have to disregard the results when it was used in estimating P(result|no effective treatment)

7. Jonathan: yes the conditional probability p(result|exposure,group) can be obtained easily by dividing the measured joint probability estimate by p(exposure), but the two estimates of p(exposure) from the respective group (placebo and vaccine) will have an error - now this error plays a role in the hypothesis test we are considering here, correct?

8. oT, I believe the general rule is random noise in the experiment such as different exposure in the control and treatment groups or errors in the AIDS testing will make it harder to detect a real effect but should not increase the chance of a false positive.

9. Jonathan, significance testing is conventionally done in terms of a null hypothesis. In this case the null hypothesis would be the vaccine has the same effect as the placebo. You then compute the chance of obtaining the actual results (or more extreme) assuming the null hypothesis. If the chance is small you then reject the null hypothesis. So it is conventional to assume the vaccine has no effect when computing the significance.

Note this setup implicitly assumes you have no strong prior opinion about whether the null hypothesis is true or false. If there is other evidence it should be incorporated in some way.

10. oT, I forgot to add that it is conceivable that the vaccine is "working" by decreasing sex drive (and therefore exposure) or by making infection harder to detect.

11. Thanks, I haven't thought of this issues long enough and I may be committing some fallacy here, but here's my thought: Suppose the exposure rate was exactly the same in both groups, say 300 out of 8000. We would measure 51 and 74 infections. We then apply the standard line of thought - assume H0 is true and calculate odds of obtaining this result by chance. Now, suppose the exposure rate is normal random with a mean of 300 and a non-zero variance. We again measure 51 and 74 infections but this time we are faced with explaining the measured result by twofold chance, the randomness of infection and of exposure. Is in this case the overall variance larger than that in the first case? If that would be the case then the result would appear even less significant to me. Maybe I am really trapped in my thinking...

12. I don't believe the variance is affected. A bernoulli-distributed variable (in this case, whether someone gets infected) has a variance that is independent of how many sub-variables that bernoulli-distributed variable is a function of. It is p*(1-p). So in this case, assuming a 74/8000 rate, the variance is (74/8000)*(1-74/8000). It doesn't matter how many necessary conditions there are for an infection to take place.

Instead of breaking down the infection variable into 1) an exposure variable and 2) and infection given exposure variable, we could go even further and break down the infection into progressively more granular stages of "exposure": Perhaps, in order to get AIDS without treatment, you need to a)have your bodily fluid come into contact with someone else's bodily fluid (probablity = 0.9, with nonzero variance) b) have the person whose bodily fluid you came in contact with be AIDS-infected (probability = 0.1, with nonzero variance), c) have at least 1/2 an ounce of the other person's bodily fluid enter your body (probability = 0.5, with nonzero variance) d) have a compromised immune system at the time of fluid exchange (probability = 0.2056, with nonzero variance). The product of all of these mean probabilities is 74/8000.

Does the fact that we now have 4 variables, each with nonzero variance, all affecting whether an unvaccinated person gets AIDS, change the problem at all? I don't believe so. All that matters is that p = 74/8000.

13. oT, if you knew the actual exposure in both groups was 300 then I calculate the chance of obtaining a 51-74 split or more uneven to be 2.67%. This is like starting with 600 balls (300 white and 300 black) and drawing 125 without replacement. The approximation I used in the post essentially assumed with replacement (or a very large number of balls which amounts to the same thing). With just 300 balls of each color uneven splits are less likely because as you draw more balls of one color there are fewer of that color left and the next balls are more likely to be the other color. With 8000 of each color I expected this to make little difference. And recomputing exactly I find for 8000 in each group the chance of a split of 51-74 or more uneven to be 4.78% just a bit less then the 4.87% limit as the number of balls goes to infinity.

14. While everyone is embroiled in the mathematical significance of this test, it is more of a concern that they used a "placebo" on some of the victims, err, test subjects. Scientists do need to find some way of measuring the effectiveness of drugs, but the group recieving the placebo in this case, are likely to die if they don't receive the real medication, assuming that it has some effect. Being from a poor country without adequate medication that are affordable if infected means a death sentence. Is a Thai citizen's life worth so much less than an American's?

15. Mary, drug testing in third world countries can be problematic but I don't think there is much reason for concern in this case. If you look at the website linked in the post it describes the trial protocol. Since this was a test for a vaccine everyone in the trial was believed to be healthy and uninfected at the start of the trial. They were all given instruction in how to avoid being infected by AIDS. They were then tested every six months during the trial. If they became infected they were given free treatment. So even the people receiving the placebo appear to have been better off than if they were not part of the trial.

16. I thought about this a bit more and have a clearer understanding now, particularly why random noise acts as an advantage for H0. Experiments with as many variables as possible normalized (such as exposure) can generally reveal more detail, however, positive results in unnormalized experiments are valid once "detected." Thanks for all your responses.