## Acupuncture, the P-Value Fallacy, and Honesty

Credibility alert: the following post contains assertions and speculations by yours truly that are subject to, er, different interpretations by those who actually know what the hell they’re talking about when it comes to statistics. With hat in hand, I thank reader BKsea for calling attention to some of them. I have changed some of the wording—competently, I hope—so as not to poison the minds of less wary readers, but my original faux pas are immortalized in BKsea’s comment.

### Lies, Damned Lies, and…

A few days ago my colleague, Dr. Harriet Hall, posted an article about acupuncture treatment for chronic prostatitis/chronic pelvic pain syndrome. She discussed a study that had been performed in Malaysia and reported in the *American Journal of Medicine*. According to the investigators,

After 10 weeks of treatment, acupuncture proved almost twice as likely as sham treatment to improve CP/CPPS symptoms. Participants receiving acupuncture were 2.4-fold more likely to experience long-term benefit than were participants receiving sham acupuncture.

The primary endpoint was to be “a 6-point decrease in NIH-CSPI total score from baseline to week 10.” At week 10, 32 of 44 subjects (73%) in the acupuncture group had experienced such a decrease, compared to 21 of 45 subjects (47%) in the sham acupuncture group. Although the authors didn’t report these statistics per se, a simple “two-proportion Z-test” (Minitab) yields the following:

Sample X N Sample p

1 32 44 0.727273

2 21 45 0.466667

Difference = p (1) – p (2)

Estimate for difference: 0.260606

95% CI for difference: (0.0642303, 0.456982)

Test for difference = 0 (vs not = 0): Z = 2.60 P-Value = 0.009

Fisher’s exact test: P-Value = 0.017

…

Wow! A P-value of 0.009! That’s some serious statistical significance. Even Fisher’s more conservative “exact test” is substantially less than the 0.05 that we’ve come to associate with “rejecting the null hypothesis,” which in this case is that there was no difference in the proportion of subjects who had experienced a 6-point decrease in NIH-CSPI scores at 10 weeks. Surely there is a big difference between getting “real” acupuncture and getting sham acupuncture if you’ve got chronic prostatitis/chronic pelvic pain syndrome, and this study proves it!

Well, maybe there is a big difference and maybe there isn’t, but this study definitely does not prove that there is. Almost two years ago I posted a series about Bayesian inference. The first post discussed two articles by Steven Goodman of Johns Hopkins:

I won’t repeat everything from that post; rather, I’ll try to amplify the central problem of “frequentist statistics” (the kind that we’re all used to), and of the P-value in particular. Goodman explained that it is logically impossible for frequentist tools to “both control long-term error rates and judge whether conclusions from individual experiments

[are] true.” In the first article he quoted Neyman and Pearson, the creators of the hypothesis test:

…no test based upon a theory of probability can by itself provide any valuable evidence of the truth or falsehood of a hypothesis.

But we may look at the purpose of tests from another viewpoint. Without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behaviour with regard to them, in following which we insure that, in the long run of experience, we shall not often be wrong.

Goodman continued:

It is hard to overstate the importance of this passage. In it, Neyman and Pearson outline the price that must be paid to enjoy the purported benefits of objectivity: We must abandon our ability to measure evidence, or judge truth, in an individual experiment. In practice, this meant reporting only whether or not the results were statistically significant and acting in accordance with that verdict.

…the question is whether we can use a single number, a probability, to represent both the strength of the evidence against the null hypothesis and the frequency of false-positive error under the null hypothesis. If so, then Neyman and Pearson must have erred when they said that we could not both control long-term error rates and judge whether conclusions from individual experiments were true. But they were not wrong; it is not logically possible.

The P Value FallacyThe idea that the P value can play both of these roles is based on a fallacy: that an event can be viewed simultaneously both from a long-run and a short-run perspective. In the long-run perspective, which is error-based and deductive, we group the observed result together with other outcomes that might have occurred in hypothetical repetitions of the experiment. In the “short run” perspective, which is evidential and inductive, we try to evaluate the meaning of the observed result from a single experiment. If we could combine these perspectives, it would mean that inductive ends (drawing scientific conclusions) could be served with purely deductive methods (objective probability calculations).

These views are not reconcilable because a given result (the short run) can legitimately be included in many different long runs…

It is hard to overstate the importance of *that* passage. When applied to the acupuncture study under consideration, what it means is that the observed difference between the two proportions, 26%, is only one among many “outcomes that might have occurred in hypothetical repetitions of the experiment.” Look at the Minitab line above that reads “95% CI for difference: (0.0642303, 0.456982).” CI stands for Confidence Interval: in the words of BKsea, “in 95% of repetitions, the 95% CI (which would be different each time) would contain the true value.” *We don’t know where, in the 95% CI generated by this trial (between 6.4% and 45.7%), the true difference lies; we don’t even know that the true difference lies within that interval at all* (if we did, it would be a 100% Confidence Interval)! Put a different way, there is little reason to believe that the “center” of the Confidence Interval generated by this study, 26%, is the true proportion difference. 26% is merely a result that “can legitimately be included in many different long runs…”

Hence, the P-value fallacy. It is that point—26%—that is used to calculate the P-value, with no basis other than its being as good an estimate, in a sense, as any: it was observed here, so it can’t be impossible; when looked at from the point of view of whatever the true proportion difference is, it has a 95% chance of being within two standard deviations of that value, as do all other possible outcomes (which is why we can say with ’95% confidence’ that the true proportion is within two standard deviations of 26%). You can see that the CI is a better way to report the statistic based on the data, because it doesn’t “privilege” any point within it (even if many people don’t know that), but the CI will also steer us away from being wrong only in “the long run of experience.” CIs, of course, will be different for each observed outcome. The P-value should not be used at all.

Now let’s look at a graph from the acupuncture report:

Hmmm. I dunno about you, but at first glance what I see are two curves that are pretty similar. They differ “significantly” at only two of the six observation times: week 10 and week 34. Why would there be a difference at 10 weeks (when the treatments ended), no difference at weeks 14 and 22, and then suddenly a difference again? Is it plausible that the delayed reappearance of the difference is a treatment effect? The “error bars” don’t even represent what you’re used to seeing: the 95% CI. Here they represent one standard deviation, not two, and thus only about a 68% CI. Not very convincing, eh?

OK, I’m gonna give this report a little benefit of the doubt. The graph shown here is of *mean* scores for each group at each time (lower scores are better). That is different from the question of how many subjects benefited in each group, because there could have been a few in the sham group who did especially well and a few in the ‘true’ group who did especially poorly. It is bothersome, though, that this is the only graph in the report, and that the raw data were not reported. Do you find it odd that the number of ‘responders’ in each group diminished over time, even as the mean scores continued to improve?

Just for fun, let’s see what we get if we use the Bayes Factor instead of the P-value as a measure of evidence in this trial. Now we’ll go back to the primary endpoint, not the mean scores. Look at Goodman’s second article:

Minimum Bayes factor =* e ^{-Z²/2}*

At 10 weeks, according to our statistics package, *Z* = 2.60. Thus the Bayes factor is 0.034, which in Bayes reasoning is “moderate to strong” evidence against the null hypothesis. Not bad, but hardly the “P = 0.009″ that we have been raised on and most of us still cling to. The Bayes factor, of course, is used together with the Prior Probability to arrive at a posterior probability. If you look on p. 1008 of Goodman’s second paper, you’ll see that as strong as this evidence appears at first glance, it would take a prior probability estimate of the acupuncture hypothesis being true of close to 50% to result in a posterior probability (of the null being true) of 5%, our good-ol’ P-value benchmark. Some might be willing to give it that much; I’m not.

### Now for the Hard Part

This has been a slog and I’m sure there are only 2-3 people reading at this point. Nevertheless, here’s a plug for previous discussions of topics that came up in the comments following Harriet’s piece about this study:

Science, Reason, Ethics, and Modern Medicine, Part 5: Penultimate Words

Science, Reason, Ethics, and Modern Medicine, Part 4: is “CAM” the only Alternative?

Science, Reason, Ethics, and Modern Medicine, Part 3

Science, Reason, Ethics, and Modern Medicine, Part 2: the Tortured Logic of David Katz

Science, Reason, Ethics, and Modern Medicine Part 1

Superb. At least from my mathematicaly-challenged point of view. Thanks a lot!

Hello

Great post. Should one interpret from the graph that this is a condition that gets better on its own? Is that what explains why the “sham” patients seem to get better, or is it placebo effect?

If that is the case the blip at the 10th week for the “sham” treatment is probably just a statistical fluctuation. Notice there is also a bit of a blip in the accupunture curve as well…

Thanks

Gordon

Excellent post. I think the most frustrating thing about this topic is that this fallacy is not by any means exclusive to CAM research, it’s pervasive in almost all medical research.

I suppose the only major difference is that the prior plausibilities of most CAM studies are far lower than the rest of medical research, on average.

The problem here isn’t with the p-value, but with the test being used. We should be considering an equivalence test, to determine if the difference between the two treatments is *clinically* significant (i.e., that the difference is greater than some clinically relevant threshold).

Example: Let u1 be treatment mean for acupuncture, u2 be treatment mean for sham acupuncture. Rejecting Ho:u1-u2=0 only tells you that the two treatments aren’t identical. So what. Now try Ho:|u1-u2|>t, where t is a threshold beyond which we can claim clinical significance.

There’s a even more crucial flaw in the paper’s statistics. It’s stated that the primary endpoint was week 10, but

when was that determined?It’s arguably logical since that was the duration of treatment, but it’s arguably even more logical to look a bit after that. Similarly, why (and WHEN) 6? Why not simply look at whether the response was different?I have a sneaking suspicion that 10 and 6 were selected as the primary endpoint only after looking at Figure 3. 5 weeks, 5 points, 15 weeks, 7 points, etc. – none of the other possible comparisons would have been significant. Which means that they were actually doing multiple comparisons, without controlling for them.

When the selected measure is the

onlyone that would produce a significant result, it’s a huge red flag. If the authors very diligently address and explain it, that can mitigate the problem – but apparently they did not. Failing to discuss the issue isprima facieevidence of either incompetence or malfeasance.Further to Scott’s point- What was the clinical trial registry number.

Secondly. You can’t analyse this crap with Fisher’s or a two proportion Z test. Probably needs something more like a GEE perhaps? And that’s assuming they didn’t fiddle the primary outcome which sounds a bit convoluted to me (hence the need for a clinical trials registration process)

A very informative piece concerning the meaning of the statistics.

Any comment about the apparent effectiveness of sham acupuncture, in what is normally a very resistant mediucal condition?

While I completely agree with the end argument here, you really butchered your statistical description:

You say: “CI stands for Confidence Interval: in 95% of hypothetical repetitions, the difference between true and sham acupuncture would be found somewhere between 6.4% and 45.6%. ” Not true! If 95% of repetitions landed in this interval, the difference between sham and acupuncture would be undeniable! What you can say is that in 95% of repetions the 95% CI (which would be different each time) would contain the true value.

You say: “there is no reason to believe, from this study alone, that the “center” of the Confidence Interval, 26%, is any more likely to be the real (population) proportion than is any other point in that interval” Not true! The true value is more likely to be near the center of the CI than near the fringes.

I also don’t think you should be so dismissive of the P-value. I think you are implying that the P-value (0.009) should not be taken to indicate anything about the quality of the estimated difference (26%), which is correct. However, it does tell us how well the null hypothesis holds up. If we ignore the methodological problems in this study, it tells us that if sham and real accupuncture were equivalent, we would be very unlikey to get this result.

I believe a fairer statement of your point would be to say that the P-value should only be considered as it relates to the null hypothesis. If you want to understand the quality of the estimated value, you need to look at the CI.

@BKsea

Thank you! You are absolutely correct about the confidence interval, and stating it correctly bolsters the case I was trying to make. I knew that somewhere in the recesses of my mind, but it was late…I will correct it with an attribution to you.

I don’t entirely agree with your other points, but no time now; more later.

KA

” The true value is more likely to be near the center of the CI than near the fringes.”

Once you’ve calculated a CI, it’s not correct to make probability statements about the point estimate of a parameter (such as the population mean)—that would imply that the point estimate is a random variable. The point estimate is a constant that either is or isn’t in the interval. All values in a confidence interval are equally plausible.

I am rusty on my stats, I will admit, but amen to this:

I believe a fairer statement of your point would be to say that the P-value should only be considered as it relates to the null hypothesis. If you want to understand the quality of the estimated value, you need to look at the CI.

that same language was used in the original post “highly significant.” It is my understanding that p value does not and cannot show magnitude. It is either significant or not.

And the small sample size of the study is concerning.

Although for the life of me, I cannot understand why people demand a RCT for something like this.

To liesandstats: Don’t want to get into a semantics argument, but the point here is not plausibility, but likelihood. Yes, every point in the CI is equally plausible. However, every point outside the CI is also equally plausible. The point is that there is a high likelihood that the center of a CI is closer to the true mean than the outer limits of the CI are. You are correct that you don’t want to consider the true mean to be a random variable, which my original statement could imply.

OK, I’ll attempt to explain my disagreements with BKsea. First, however, responses to what others have brought up.

@ Gordon and Peter Moran: yes, there is apparently either a fairly robust placebo effect (which the authors of this trial acknowledged) and/or the condition seems to get better on its own for many patients, at least over the short run. Evidence from drug trials is here, here, and here.

@ Scott: I agree about the red flag. I didn’t mention it because I believe that the report, even if accepted at face value, fails to support its own conclusion. In her article, Harriet mentioned a “concern…that studies from Asian countries are prone to the ‘file drawer effect’ where negative studies are filed away rather than submitted for publication. [etc.]” She was, presumably, referring to the findings of this study. Its conclusion about the file drawer effect, however, was pure speculation. It is possible that more direct methods for producing desired results were at work, and may still be.

@ Pattoye: I agree completely with your comment about the frequentist fallacy being pervasive in all of medical research. I still hold a glimmer of hope that the silver lining of the cloud of “CAM” research may eventually be that, by its very absurdity, it will have exposed this more general problem. First discussed on SBM here and in 3 subsequent posts.

@ Edgar: Yes, the p-value can ONLY relate to the null hypothesis, since it assumes the null hypothesis to be true. On the other hand, “statistical significance” is an artificial construct in two different senses: first, it is arbitrary, based on the predetermined “alpha,” or tolerance for “Type I” error. If alpha is set at .01, than an eventual P of .05 will be deemed not “statistically significant.” Nor, for that matter, will a P of .01001. But a P of .01 will be so deemed, just as a P of .05 would be if the alpha had been set at .05, as it usually is for medical intervention trials.

The second sense in which it is an artificial construct is what liesandstats asserted in his first comment, ie, that statistical significance is not the same as clinical (real world) significance. This becomes all the more obvious when you consider that almost any trial can be made to yield a “statistically significant” P value merely by making n, the sample size, large enough. Almost any intervention will differ from the null to some extent, even a very small one, and since standard deviation varies inversely with the square root of n, P can be made very small simply by making n very large. (By “simply,” I mean in principle, without considering the financial and logistical difficulties involved).

Now to BKsea: My original statement, asserting that the “center” of (this) Confidence Interval is no more likely to be the real (population) proportion than is any other point in the interval was probably an overstatement, and I have hedged in my restatement. Nevertheless, I think that your assertion that “The true value is more likely to be near the center of the CI than near the fringes” is also an overstatement in the following sense. While it is true for the family of CIs representing the entire, hypothetical sampling distribution—the “long run” (that is what the Central Limit Theorem tells us, after all), it is not necessarily true for any one sample—the “short run.”

Since we have only one sample, we can’t possibly know whether that sample’s mean is near or far from the population mean. It is thus hazardous to imply or to assume, without other evidence, that this sample’s mean is a reasonable approximation of the population mean. Nevertheless, this is what is routinely done with frequentist inference, and it is done by citing the P-value—of which I (therefore) continue to be dismissive.

I think of the long run vs. short run issue here—and I admit to being on shaky mathematical grounds—as analogous to the long run vs. short run issue in coin tosses. Thus the P-value fallacy is analogous, in an inverse sort of way, to the gambler’s fallacy (for those who don’t know: while it’s true that in the long run, tosses of a fair coin will yield 50% heads, it is not true that any deviation from that outcome can be used to predict the short run, ie, the very next toss: even if I’ve tossed 8 heads in a row, there is still exactly a 50% chance that the next toss will come up heads).

Here are some quotations from a Bayesian textbook* that, I believe, support my opinions:

“When we take our random sample and calculate (y-bar), there is nothing left to attach a probability to. The actual interval either contains the true value or it does not. Only we don’t know which is true…Our confidence comes from the sampling distribution of the statistic. It does not come from the actual sample values we used to calculate the endpoints of the confidence interval…No probability statements can be made about the actual calculated interval…Scientists often take the confidence interval given by the frequentist statistician and misinterpret it as a probability interval for the parameter given the data. The statistician knows that this interpretation is not the correct one but lets the scientist make the misinterpretation. The correct interpretation is scientifically useless.” (pp. 197-8)

*Bolstad WM. Introduction to Bayesian Statistics. 2004. Hoboken, NJ: John Wiley & Sons, Inc.

@BKseaon: The CI is a “random variable” of sorts (i.e., a random interval). The coverage probability is therefore defined for the entire interval. In other words, there’s no probability distribution across the CI, and therefore no one value is more probable of being close to the true value than any other. (As to semantics, we use the term probability here because the interval is the random quantity, whereas likelihood is reserved to probabilities associated with the parameter.)

“No probability statements can be made about the actual calculated interval.” Probabilities can only be taken with respect to random variables (or intervals). This doesn’t change the coverage probability, or confidence, and data will determine it’s size (because it will determine your sample variance, for example). The correct interpretation is not scientifically useless, but it does show the authors bias. I prefer how Andrew Gelman, a notable applied Bayesian statistician, once said that he uses whatever works best for the given task.

As to the Bayes factor, and before anyone gets too excited, I recommend reading the paper Bayes Factors: What They Are and What They Are Not, by Levine and Schervish (link). “Just because the data increase the support for a hypothesis H relative to its complement does not necessarily make H more likely than its complement, it only makes H more likely than it was a priori.”

In the case of a simple hypothesis, the P value can provide the same measure, as it is a monotone function of the Bayes factor. For a more complicated hypothesis, the authors have shown a logical flaw in using Bayes factors as measures of support (because Bayes factors are not monotone in the hypothesis). The point being that neither p-values nor Bayes factors are perfect, and their use and interpretation should be done with care.

I’m not sure about your underlying point, as it seems to imply that p values are not useful because they aren’t 100% accurate, while seems off. They are based on 95% confidence intervals, and with that in mind, they are very accurate and mathematically sound.

I do agree that the data presented does not support acupuncture as a cure for prostatitis. It does show a statistical difference between groups at one time point, but at all other time points they are similar. This single area of statistical difference is likely an alpha error (finding significance where there is none), and would not be there again if the study were repeated. Increases frequency of alpha errors is a risk of analyzing data at multiple time points, as at an arbitrary alpha of 0.95 you have a 5% chance of a false positive at each analysis point. We see this in this study.

If the authors were to conclude from this data that the intervention worked, that would be poor science indeed.

@ liesandstats:

Of course. But this is formally acknowledged only by Bayesian inference, not by frequentist inference.

It can provide the same measure, but it only does so when it is exceedingly small. My mention of the minimum Bayes factor in the text above was to show the difference between the two when derived from the same data, and the difference was striking. This is why, in Goodman’s second paper linked above, he advocated the Bayes factor–which is entirely objective–as a measure of evidence preferable to the P-value, which typically overstates the extent to which the alternative hypothesis accounts for the data (compared to the null).

@ Nicholas Fogelson:

P-values are not based on 95% CIs; it’s the other way around: 95% CIs are calculated from P-values (and variances; more precisely, CIs are calculated from sample means that are also translated to P-values, as explained below). The simplest way to understand the underlying point is to observe that although we can be “95% confident” that a 95% CI generated from the data in a trial contains the true population mean (or proportion), we don’t know where, in that interval, the true mean lies. That’s because the entire exercise is based on imaginary, infinite repetitions of the same trial that never take place (we’ve only got one trial). The 95% CI is mathematically sound because the “real” 95% CI, ie, that of the normal or near-normal curve representing the frequencies of all possible means of such repetitions, with the true population mean at its center, must, by definition, contain 95% of all possible outcomes. Since that’s true, a 95% CI generated by any outcome (including our 95% CI, with our sample mean at its center) must have a 95% chance of containing the true (population) mean.

But does that make us 95% confident that our sample mean is the true population mean? Hell no! We don’t know what the true mean is, only that we are 95% confident that it exists within the 95% CI generated by our sample. That is a far cry from being 95% confident that

oursample mean is the true mean. Yet the way P-values are used implies exactly that: it is our sample mean, not the true population mean, that is the basis for the P-value, which is nothing more than where our sample mean sits on the null curve.Thus: CIs good; P-values bad.

I agree with the rest of your comment. Yup, the authors did conclude from these data that the intervention worked:

If anyone is still following this thread, please know that I have just made small but important changes to the comment above, rendering it more technically correct.

KA

I see what you are saying, though we are both saying the same thing in a different way. A p value represents a 95% confidence interval for repitition of the study – not the result of this one iteration. If the p is < 0.05 then there is a 95% chance that if the study were repeated you would get the same binary result for your study. I do not look at this as p values being bad or good. They are what they are. I agree that confidence intervals provide more useful information most of the time. If people do not understand what they are, then that is bad for them, not for the p values.

I think the biggest problem with p is that people fixate on 0.05, when 0.06 is not much different, nor is 0.07. A 0.07 is more likely to be a power problem / false negative than a true negative. People don't get that sometimes.

You’re trying to connect p-values to Bayes factors, but that doesn’t change what the Bayes factor is, or how it should be interpreted (and contrary to what many think, frequentist vs Bayesian is a false dichotomy). Many people misunderstand Bayes factors, which is why I gave the interpretation above (for emphasis).

It can be argued that the Bayes factor is an objective measure of the evidence, in how it will change your beliefs, but in general it is not correct to interpret it as support for one hypothesis over another. To measure support for a hypothesis, and therefore compare it to the p-value, you need to consider prior odds.

For example, you call the Bayes factor “a measure of evidence”, which should be interpreted as a measure of how the evidence changes your beliefs, but you then use it as support against the null. This only works in the context of two simple hypotheses, and it relates directly to the p-value. In general, however, you don’t have an inference property in a Baysian context without considering prior information.

“It can provide the same measure, but it only does so when it is exceedingly small.”

Celsius and Fahrenheit are related by a (strictly) monotone function (or one-one), and therefore provide the “same measure” of temperature–as long as you know how to interpret them. The same goes for the p-value and min Bayes factor for two simple hypotheses (although note that I’m not implying they are related by a linear function). By the logic presented in favour of the min Bayes factor, for two simple hypotheses, we should conclude that Celsius is a better measure of temperature (but this is arbitrary).

Table 2 in Goodman’s second paper is a perfect example of the one-one relationship between the p-value and min Bayes factor in this context. The point here is that you don’t need to calculate the min Bayes factor at all. You only need the categorization for “strength of evidence” provided–which are only rules of thumb, as there is no natural metric for interpreting the Bayes factor.

Beyond the case of two simple hypotheses, interpreting the Bayes factor is more complicated, as is connecting it to the p-value, and should not be considered as support for a hypothesis. What Lavine and Schervish have shown is that p-values and Bayes factors suffer the same logical flaw–they are not coherent measures of support (they are not monotone in the hypothesis). This doesn’t mean they aren’t useful, only that you have to be careful in how you interpret them.