## The Plausibility Problem

From the very outset, the founders of Science Based Medicine have have emphasized the importance of plausibility in the critical evaluation of scientific claims in medicine. What exactly does “plausibility” mean, and how should we apply it in science? My simple definition of plausibility would be “the likelihood that a premise is true.” The application in science is a little more complicated.

Consciously or unconsciously, we all consider plausibility in interpreting events in our lives. For example, if one of your coworkers showed up late for work and grumbled about a traffic jam, you would likely accept his story without question. If, instead, the same coworker attributed his tardiness to an alien abduction, you would not be so charitable. In each case, he has provided the same level of evidence: his anecdotal account. You are likely to accept one story and reject the other because of a perceived difference in the plausibility. The skeptic’s mantra “Extraordinary Claims Require Extraordinary Evidence” expresses this concept in a qualitative way.

Evidence-based medicine has traditionally ignored plausibility when interpreting the evidence for a medical intervention. Science-based medicine, as envisioned by the creators of this blog, includes plausibility when making these judgements.

Since experiment research employs rigorous controls, and statistical criteria, you might assume that plausibility is not an issue, however, this is not entirely true. An article written by John Ioannidis entitled “Why Most Published Research Findings Are False” is cited frequently as a reference for the impact of plausibility on the interpretation of research results. This article enumerates numerous factor leading to erroneous research conclusions. Most of them have been dealt with on this blog at one time or another. To me, the most eye-opening aspect of the paper was a quantitive approach to the influence of plausibility in interpreting positive research findings. I was never taught this approach in medical school, or in any other venue. When it comes to implausible hypotheses, the traditional P-value can be very misleading.

As good as Ioannidis’ article is, it is not easy reading for the statistically or mathematically challenged. What I attempt to do in this post is to demonstrate the importance of plausibility in graphic format, without a lot of complex math. If you can grasp the concepts in this post, you will have an understanding that many researchers, and consumers of research, lack.

**Research vs Reality**

When scientists design research studies, they hope to obtain a result that leads them to a better understanding of reality. In medicine, studies are frequently undertaken to investigate the effectiveness of a treatment strategy.

The result of such a study can intersect with reality in 4 ways:

**True Positive**(TP): The treatment works and the study demonstrates this effectiveness**False Negative**(FN): The treatment works, but the study fails to demonstrate this effectiveness**False Positive**(FP): The treatment does not work, but the study shows it to be effective**True Negative**(TN): The treatment does not work, and the study suggests it is not effective.

We can display the possible study results in a simple grid:

We would hope that all our results would be True Positives, or True Negatives. If that were the case, all research studies would be conclusive. Unfortunately, science is never that simple. There are ways to reduce, but not eliminate False Negative, and False Positive Results. As we go through the elements of a hypothesis, and study design, we will modify the table above in a way that the area of each of the 4 cells in the grid, is proportional to the likelihood of each particular outcome.

Lets start with a very basic study design. We are exploring a new treatment for disease X. We would consider the study positive, if the treatments is associated with improved outcomes at some pre-specified level of statistical significance.

Now lets look into the study details that can influence the relative likelihood of the 4 outcomes: TP, FN, FP, and TN.

**Study Power**

In designing a study, it is valuable to look at hypothetical results for a treatment that is effective, and for a treatment that is not effective.

We can first consider the possible results if, in reality, our treatment is effective. In this case we will obtain one of two possible results: **True Positive** or **False Negative**. These can be shown in a simple table.

There are design considerations that can increase the probability of obtaining a true positive result. The likelihood that a trial of an effective treatment will yield a TP result is called the study **Power **(also known as **Sensitivity**). Power can be expressed as a number between 0 and 1 (if you are more comfortable working in percentages, just multiply the Power times 100%). The higher the sensitivity, the more likely it is that we will confirm the effectiveness of our treatment. We can modify the table to include the **Power** of the study. The height of the TP and FP bars represent the relative likelihood of each of the 2 possible outcomes.

The Table above shows a study designed to find a TP result 80% of the time for an effective treatment. We would say that this study has a Power of 0.8 . In medical research 0.8 is considered quite a powerful study, and would usually require a large number of patients (large sample size).

**Specificity and P values**

In the previous section, we looked at possible outcomes for an effective treatment. Now we will look at possible outcomes for an ineffective treatment. If our treatment is ineffective, we will obtain one of 2 possible outcomes: **True Negative** or **False Positive**.

Researchers are careful to avoid false positive results. The parameter that defines the likelihood that a study will obtain a negative result if the drug is ineffective is known as the **Specificity**. Like power, specificity is expressed as a number between 0 and 1 (or between 0% and 100%). A .95 specific study will obtain a **True Negative** result for an ineffective treatment 95% of the time, and will obtain a **False Positive** result 5% of the time.

It is more common for researchers to talk about the false positive rate rather than the specificity. Researchers usually refer to the critical P-value rather than the Specificity. P-value and specificity convey exactly the same information, and can readily be derived from one-another. **Specificity = 1 – P**

When designing a study, the investigators select a critical P value. In order to accept the results of study favoring the study treatment, the P value must be at or below the predefined critical P value. For most studies in medicine, the critical P value is .05 (Specificity .95). This means that if they are studying an ineffective treatment, there is a 1 in 20 chance that they may declare this ineffective treatment effective. The lower the critical P value, the less likely a study is to find a false positive, but the tradeoff is that it becomes harder to find a true positive. The table below shows depicts a study with a critical p value of .05. The relative size of the boxes represents the likelihood of FP and TN results when a study premise is false.

Now lets combine the possibilities for effective and ineffective treatments. We can combine the 2 tables into one.

In the table above, the cells representing the TP, FN, FP, and TN outcomes are approximately equal size. Below is a more realistic representation of a study with a Power of .8, and a critical P of .05.

Now we have done our study, and analyzed the results. We have achieved our target P value of .05. What then is the likelihood that this positive study is a true positive? Since our P value is .05, we can be 95% sure that our study is correct, right? WRONG! Remember, the P value tells us something about the False Positives. **The P value alone tells us nothing about the chances of a True Positive result! **This is a very critical concept, and one that is poorly understood by many people.

Since our study power was .8, then we can be 80% sure that our result is true, right? Unfortunately no. The power tells us something about true positives relative to false negatives. We already know that our study is positive, but we want to know whether it is a True Positive or a False Positive.** **

**Plausibility, Prior Probability, and Positive Predictive Value**

We need a parameter that can help us distinguish between a true positive and a false positive result. Such a parameter is known as the **Positive Predictive Value** (PPV). The PPV considers both the True Positives and False Positives, and calculates the probability that our positive result is a True Positive. As with the other parameters, PPV is expressed as a number between 0 and 1, and calculated by the expression:

PPV = TP / (TP+FP)

But there is one more critical piece of information necessary to calculate the PPV of our study.

The **Prior Probability** is the likelihood, prior to beginning the study, that our premise is true. Prior Probability is a quantitative assessment of plausibility. Like sensitivity and specificity, prior probability can be expressed as a number between 0 and 1. A prior probability of 0 means that there is zero chance that the premise is true. A prior probability of 1 means that the truth of the premise is a certainty. In our table, we can display prior probability as the relative widths of the columns labeled “Treatment effective” and “Treatment ineffective”

For the study of our drug, lets assume the drug has a prior probability of .5, meaning that going into our study, there is a 50/50 chance that the drug is effective. This is probably a reasonable estimate for a drug with a valid preclinical foundation, and promising phase 2 data. In the table, a prior probability of 50% is reflected by the fact that the columns for “Treatment Effective” and “Treatment Ineffective” are of equal width.

Just to establish myself as Über Geek among SBM contributors I have created a Positive Predictive Value Calculator widget. The values are pre-populated for the problem above, but you can enter any values you wish (provided they are between 0 and 1). Remember: Sensitivity = Power; and Specificity = 1-P.

In truth we can rarely, if ever, calculate an accurate prior probability for any given premise. Even though we cannot come up with a truly accurate prior probability, it is instructive to explore the interaction between prior probability and positive predictive value.

When it comes to low prior plausibility, homeopathy is the go-to treatment ; a status it has earned for very good reason. For homeopathy to work, numerous principles of chemistry and physics would have to be wrong. Not just wrong, but spectacularly wrong. Established principles of physiology and pharmacology would also have to be wrong. Most of us would estimate the prior probability of any clinical claim of homeopathy for any indication as infinitesimal (irony intended).

Lets assign a very generous prior probability of .05 (5% likelihood that homeopathy is effective). If we factor this into a study with a power of .8 and a critical P value of .05, we come up with a table that looks like this. (note, not drawn to exact scale to preserve readability).

By adding prior probability we have changed the width of the columns to reflect the difference in probability of the various outcomes. Remember, the columns on the left represent the possible outcomes of an effective treatment, and the columns on the right represent possible outcomes of an ineffecive treatment. The area of each cell represents the relative likelihood of that result, given the parameters specified. A study with the parameters shown above, if it is fair, and free of bias has a low chance of finding any positive result. If the result is positive, however, the areas of the TP and FP cells are approximately the same size, meaning that they have a similar likelihood.

The positive predictive value of this study is reflected by the area of TP divided by the area of TP+FP. If we calculate the PPV this study we come up with a value of 0.46. So given a powerful study, and a VERY charitable prior probability for homeopathy, a positive clinical trial result for homeopathy has positive predictive value a little worse than a coin toss. If you lower the prior probability to more accurate .01 (1%) the positive predictive value becomes .14.

**The Big Picture**

Everything we have discussed can be summarized in the figure below. This depicts the relationship between Positive Predictive Value and Prior Probability for research studies of various powers. Notice that at the left side of the curve, where prior probabilities are very low, the positive predictive value approaches zero. Even for a powerful study design, when the prior plausibility is low, the positive predictive value is low. For less powerful studies, the positive predictive value is even weaker. We do not need agree on an exact value for the prior plausibility of homeopathy to see that any single positive clinical trial is virtually meaningless.

*The horizontal axis represents the Prior Probability. The vertical axis represents Positive Predictive Value. The blue, green, and red curves represent results for studies with powers of .8, .5. and .25, respectively. All results assume a critical P value of .05.*

The take-home message of the graph above is this: even for a well designed, powerful study, if the premise is highly unlikely, a positive result does not give us convincing evidence that the premise is true. For studies with weaker power, the results are even less persuasive.

So why do extraordinary claims require extraordinary evidence? Because for implausible claims, ordinary evidence is highly unreliable. A single positive study with a P value of .05 is ordinary evidence. For a very implausible hypothesis, a result of this sort is quite likely to be a false positive.

**Posted in:**
Basic Science, Clinical Trials

Very nicely explained. I was particularly glad to see the probabilities being correctly described in terms of the probability an experiment will yield result X given that reality is Y. So many scientists view it as the probability reality is Y given a particular experimental result X. When of course reality isn’t a random variable, but a fixed (albeit unknown) fact.

Good post. Reality is complicated by the fact that no treatment is either “effective” or “ineffective,” there are degrees. In oncology, for example, a treatment may extend survival by zero months, or by some positive number of months, or even by a negative number. We might pick some number from that spectrum and design the trial so that we have 80% power for that length of extension (or longer), but the fact is that there is some extension for which the power is near 100%, another for which the power is 50%, etc. The number we pick is totally arbitrary.

We also have to let this line of thinking inform our interpretation of results. Think of our prior belief about the length of survival in an oncology trial. Suppose the mode of that distribution is at 6 months, because that is the median survival that we expect in controls. That is, our prior belief is that the treatment is no different from control. If we then observe a median survival of 8 months, our posterior estimate needs to be between 6 (our prior) and 8 (our observation) whether we achieved statistical significance or not. We need to learn not to simply believe the null hypothesis just because we failed to reach of p-value of .05 .

Excellent and very informative post. You have certainly expanded my understanding of the p-value. I have one typographical correction to make which I think is important. Several on the charts list the FP values as 0.5 although the article states that it is .05. Just thought I would mention it.

Nicely done.

These concepts are actually taught in med school, just in a slightly different context – that of the diagnostic value of tests.

Basically, if a test for a disease comes up positive, but the prevalence of the disease is very very low (pre-test probability, or plausibility here), then it’s more likely that it is a false positive test than that you’ve found that one incredibly rare case of the disease.

p.s. for some reason, Bayes theorem often confounds people, which is too bad, because it has applications all over the place.

Lol about Bayes theorem–I think lawyers certainly don’t want people to learn to evaluate evidence in light of what is already known. That would be the undoing of our jury system. In setting prior probability of guilt, how should a juror think about the so-called “presumption of innocence?”

Law is not science. For example it might be statistically valid to consider the crime rate among a defendant’s ethnic group or economic class in considering how likely he is to be guilty of a crime, but it is not legally acceptable, based upon standards of justice as opposed to statistics. Our principles of justice compel us to evaluate individual human beings as just that, not as members of a class. Hence, we are obliged to judge people based upon a uniform low “prior probability” of guilt — the presumption of innocence.

Several decades ago, I got tangentially involved in a similar calculation. The issue arose during a illicit drug court case. In this case, there was plenty of information for the percent of the public using that particular drug. It was low – maybe 1/100 or less. We also knew that the drug test result was positive; otherwise, the person wouldn’t have even been on trial.

So the question at hand was a conditional probability: Given a positive test result, what was the probability of a false positive? Like the example in that article, it came out to be about 50%.

So, that’s what we told the lawyer. I never did hear how this played out in court.

Thanks for the clarification. The juror has to “judge” a “low” prior probability. If it’s low enough, it would take a LOT of evidence to move it very much.

For the study of our drug, lets assume the drug has a prior probability of .5, meaning that going into our study, there is a 50/50 chance that the drug is effective. This is probably a reasonable estimate for a drug with a valid preclinical foundation, and promising phase 2 data>>>>>

I read somewhere the FDA only approves about 20% of drugs submitted. Is that because they give some drugs submitted a lower prior probability or is because some work but have a high risk of side effects ?

Art Malernee dvm

“I read somewhere the FDA only approves about 20% of drugs submitted.”

Its much lower than that, but my statement was specifically about drugs that make it to phase 3 trials. Those have a about a 50% success rate. Some fail due to lack of efficacy, and some due to safety concerns.

http://web.ebscohost.com/ehost/pdfviewer/pdfviewer?sid=ed4446fe-d234-45da-a4bd-4eb39cb17eba%40sessionmgr15&vid=2&hid=17

EBM argued that prior plausibility was generally unavailable by quantitative measures, and so represented subjective bias. David Weinberg’s discussion gives an excellent example of why subjective prior probabilities do not have to be that accurate to be useful for interpretation of experimental results. The plausibility of homeopathy is magnitudes less than .05. Even though .05 is magnitudes greater than the probability of homeopathy (think closer to 0.00005), use of the subjective overestimate would lead to better decisions from data than exhibited by many EBM advocates, including the Cochrane collaboration.

Your framing of the argument can also be used to show that if the investigator refuses to consider the prior probability in the decision, then by default they have assigned a prior pobability of .5. The strongest argument for Bayes description of probability is that it will affect the outcome whether or not a prior probability is included for estimation of positive predictive value. Failure to include it in the calculation is the same as assigning it a prior probability of .5.

“if the investigator refuses to consider the prior probability in the decision, then by default they have assigned a prior pobability of .5″

Good point.

But in many cases of crime there is all kinds of evidence of a probabilistic nature, for example faint traces of DNA or DNA that is degraded. In the first case the DNA may be the result of casual contact, in the second case the probability that it was someone else may not be one in zillion, but merely 1 in 500. If the defendant was picked from a database of millions of people on the basis of DNA found at the crime scene, then the ‘one in a zillion’ argument also weakens considerably.

Also ‘eye witnesses’ may be mistaken. If the evidence used to convict a defendant were so absolutely sure that calculating the odds would not make much of a difference, then how come so many innocents are convicted of serious crimes?

One of the reasons for false convictions is that some defendants have confessed, usually during a police interrogation. In the good old times of actual witch hunts a confession was all that was needed. And torture was also legal. But even nowadays there is a possibility of false confession.

In fact probabilistic arguments play a big role in courts, I think, but in a very primitive form, namely where ‘almost sure’ is taken to mean ‘sure’, and where improbable (at least in the mind of the police, the presecution, the judge and the jurors) is equated to impossible, even though a prudent judgement would have to consider the ratio of two small probabilities.

Going back to the case of diseases. Suppose that someone who is not in any known risk group is tested for HIV, not because he or she has any reason to fear infection, but for example because a life insurance company asks a HIV-test.

Now the person tests positive. What can we conclude? The answer is: not much. We are dealing with a very low prior probability (nobody knows how low, but maybe 1:10,000) and a very low rate of false positives (including administrative errors through mixing up samples or through contamination, maybe 1:10,000). The posterior odds are then basically the ratio of two unknown numbers that are only 0,0001 or less.

I find these calculations somewhat confusing. May I try to present a somewhat simpler calculation, that doesn’t need an online calculator?

As before

power = TP/(TP+FN)

it’s easier to not use to specificity (a tonguebreaker for me) but its complement the p-value:

p-value = FP/(FP+TN)

I abbreviate P = TP+FN the total number of items/persons that are in reality positive in the total population tested.

Similarly N = FP+TN the total number of negatives

I’ll come to the meaning of the “total population” later.

So we have

power = TP/P ; p-value = FP/N

So the quotient of both is

power/p-value = (TP/FP) / (P/N)

rearranging terms

(TP/FP) = (P/N) * (power/p-value)

the three bracketed terms have their own name:

P/N = prior odds

TP/FP = posterior odds

power/p-value = Likelihood Ratio or RL.

So:

posterior odds = prior odds times likelihood ratio

This is actually the Bayes formula in easier form.

Example 1. power = 0.80 p-value = 0.05 , so the likelihood ratio is 0.80/0.05 = 80/5 = 16

If prior to the test the probability of ‘P’ was 0.5, then the prior odds are 0.5/0.5 = 1

Hence the posterior odds equal 16 (ratio of positive to negative); if you want to convert the odds 16:1 into a probability, you get of course 16/17.

Generally working with odds (ratio between probability of yes to probability of no) is easier in this context.

If the posterior odds are converted to probability you get the PPV.

Example 2. (Imaginary homeopathy as in the text). The prior probability is 0.05, so the prior odds are 0.05/0.95 = 1/19. The posterior odds are (with the same power and p-value as before) evidenty 16/19, and expressed as a probability 16/(16+19) = 16/35 = 0.457… (that’s the 0.46 of the text).

Example 3. (Real homeopathy) The prior probability that homeopathy is correct equals the probability that the total of chemistry and physics and pharmacology and biology and medicine of the last two centuries has a serious and large gap, hence is 10^-n where n is incalculably large. Now we do an experiment with p=10^-100 and a power somewhere between 0.001 and 1. The posterior odds are 10^-m, where m is only 100 or maybe only 97 less than the humungous number n. Evidently m is still humungous.

Objection: these prior odds express of course that we consider it unthinkable that homeopathy works. However I cheated a little. As a staunch frequentist I don’t like to call anything a probability that cannot be somehow measured in practice or deduced from practical measurements consisting of repeating a chance experiment (throwing dice or coins for example). We evidently cannot rerun history 10 to the power humungous times and count how often a scientific development like our own one manages to leave an enormous gap. Or maybe we should rerun the whole history of the cosmos and see in how many universes homeopathy works for whatever life forms develop in them. So assigning prior odds to homeopathy is a kind of sleight of hand that I don’t really like.

Now what should we think when a homeopathy experiment comes out that strong (it happens!)? I conclude that the experimenter must have made a gross error or committed active fraud. That a homeopath does something wrong is far more likely than an accidental outcome like that from an impeccable experiment, and Hume’s principle of the least miracle applies. (That principle says that in case of a report of a miracle you must consider what is more probable: the miracle itself or the possibility that the reporter or witness is mistaken, or just is lying.)

Now about testing medicines. We don’t know much about the prior odds. A reasonable guess might be obtained if we know how often that company has produced a comparable medicine that came to that stage and then finally was approved and not withdrawn later. But if the medicine is based on a totally new idea, then these odds may not apply.

Back to the ordinary testing of treatments. I simplify a bit further and I assume that the power is 0.50, i.e. a 50% chance that if the treatment really is somewhat useful the test will succeed. The likelihood ratio is 0.5/0.05 = 10.

So each independent test

multipliesthe prior odds by 10. Many such successful tests together multiply by as many factors 10. So even when the original prior odds were low, the final posterior odds would be very high, say 1000 (i.e. a 1000 to 1 odds that it works at least a bit).I once read an utterly ridiculous book by Steven D. Unwin. He assigned to the existence of God prior odds of 1 (evidently he hadn’t been doing experiments to check that universes with and without God are roughly equally probable) and then he proceeded to convert various arguments pro and con into LRs equal to 1, 2, 1/2, 10 and 1/10. Multiplying all these numbers drawn from thin air (why 1/10 rather than 0 for the argument of unnecessary suffering?) he got posterior odds of 2 (all the time pretending that it was advanced mathematics with complicated formulas) and then claiming he had proved that the probability of God was 2/3.

For me it was the ultimate proof that Bayes can lead to humbug.

@Jan Willem Nienhuys

“For me it was the ultimate proof that Bayes can lead to humbug.”

Humbug in, humbug out.

It’s ironic and unfortunate if you really are a “staunch frequentist”: you’ve correctly recognised that the article’s PPV calculation is (almost) a Bayesian OR calculation and you’ve demonstrated that your inferential intuition is good (unlike most people who’ve been exposed to Frequentism) by identifying the correct interpretation of a homeopathy experiment. Perhaps the stumbling block is the physics thing. You say:

“I don’t like to call anything a probability that cannot be somehow measured in practice or deduced from practical measurements consisting of repeating a chance experiment (throwing dice or coins for example).”

but in that case you probably shouldn’t like to call anything at all a probability! [See Jaynes, esp. chapter 10, for the reasons.]

I included dice and coins, because the symmetry of these objects is a better guarantee for their behavior in chance experiments than actual experiments.

And there is a difference between something that at least in practice can be counted experimentally on the one hand and priors for which there is not even in principle a way to do any kind of experiment or counting at all on the other hand.

“Chance experiment”? It’s not really clear exactly what you’re trying to say but you do seem to be making the artificial and inappropriate distinction between classes of physical objects/experiments which is behind the “mind projection fallacies” which Jaynes warns against and which readers of ‘dice’, ‘coin’ and ‘playing card’ filled probability and statistics theory textbooks are understandably susceptible to. The already accumulated physical evidence – the stuff used to construct prior probabilities – leads me to believe that the symmetry of a typical homeopathic sugar pill and ordinary sugar pill pair is *more* exact than the symmetry of a typical ‘two-sided’ coin. That in turn leads me to construct a semi-quantitative prior for “homeopathic remedy works” which is closer to zero than the one I construct for “heads up more/less frequent than tails up”. I can test and count the former hypothesis just as well as the latter – although of course trying to use an RCT to do so would be absurd and futile cargo cult science.

“an RCT” –> “a CT (clinical trial)”

For homeopathy, you can count the total number of clinical trials that have been performed, and the number of times the results of a clinical trial have lead to a change in a principle of physics or chemistry. That is your prior probability for a clinical trial of homeopathy.

JMB wrote:

How do you figure?

David, thank you for this very clear explanation of prior plausibility. We touched on this subject a little last semester in my first year of med school, but this really helped clarify the subject. I know we will be talking about this more in second year, and am bookmarking your post for future reference and to share with classmates. Thanks again!

In David Weinberg’s discussion he presented carefully drawn diagrams that represent certain basic concepts. The basic concepts of true positive, true negative, false positive, and false negative can be applied in the frequentest view and the Bayesian view. I believe the basic divergence between the two approaches in the above discussion occurs when prior probabilities are introduced in the calculation of true positive, false positive, true negative, and false negative fractions. If you compare the calculation of the fractions based on the power of the test and the selected p value, you will notice the fractions are the same after the prior probability 0f .5 is used in the calculation. That is a demonstration that if you ignore the step of incorporating the prior probability in a statistical decision, then by default you have assigned a prior probability of 0.5.

There are many other ways to describe the difference between the frequentest view, and the Bayes view.

I cannot quite follow that. Given the numbers of the example, the test increases the prior odds by a factor 16 (=0.8/0.05). A decision might be ‘I henceforth assume that this treatment works, because posterior odds better than 1:1 is enough for me.’

In that case apparently it is implicitly assumed that the prior odds exceeded 1:16.

Of course a decison might (incorrectly) be ‘there is only 5% chance that I’m wrong if I assume that this treatment works, that’s good enough for me, I decide that the treatment works, i.e. can be used)’

(The reason it is wrong is that the real outcome was ‘supposing that the treatment did

notwork, then a similar result would occur in 5% of this type experiments’. Because about 99% of all scientists render this asthere is a chance of 5% that this is due to chancethis incorrect interpretation sounds very logical to those who think the italicized sentence makes any sense.)(Mutatis mutandis if it is not a treatment that is tested on efficacy, but a person that is tested for a disease.)

In this case one can translate the decision into: posterior odds of 19:1 are reason enough for deciding to accept the test result.

This translates into the implicit assumption that the prior odds exceeded 19:16, i.e. 1: 0.842 .

If a person decides on a course of action based on the outcome of certain chance experiments, then this is usually done because certain limits are exceeded.

Simply the information that the prior odds are ignored, is not enough to tell you what the person’s limits were for taking a certain course of action. Only if you know these limits, you can try to translate these into implicit assumptions about prior odds. I don’t see how one can arrive at the blanket implicit assumption ‘the prior odds are 1:1′ for

anydecision that ignores prior odds.Let’s start with comment 1 by Scott “So many scientists view it as the probability reality is Y given a particular experimental result X. When of course reality isn’t a random variable, but a fixed (albeit unknown) fact.”

That’s not at all how a Baysian is taught to think. Say it is before a 100 meter fly race. You don’t actually believe Michael Fred Phelps has a “true” time for completing the race do you? His time to finish is not “fixed”. Uncertainty makes it perfectly reasonable to end up with a probability function for his time. Consider coin tossing. Is the result of my next toss fixed? What on earth do I gain by thinking that it is?

Tom S, comment 2: “The number we pick is totally arbitrary.” This was about the effect size used to study the power of an experiment. I quibble that “totally” is a bit too strong. What actually happens is that power is a function of effect size. Often there is a minimum effect size below which we are not very interested, and typically you highlight what the power is at that point (as well as some more optimistic points).

This 50% prior stuff: if you have no prior, you have no posterior at all. Why discuss the properties of a thing that does not exist? If we instead are really talking about comparing classic and Bayesian decision making (and I think we are), we are missing some essential gear: the loss function.

Both of which are still examples of the outcome of an experiment being random, while reality is fixed – the probability distribution of Phelps’ results, and that an ideal coin will be 50/50. You’re not measuring a single number in these experiments, but it is still a fixed reality, which your experiment probabilistically gives a particular result for.

Going back to the example of the article, homeopathy either works or it doesn’t. There isn’t a random distribution of whether or not it works. I suppose one could pedantically assert that the distribution is either 1 at “works” or 1 at “doesn’t work,” but that’s trivial. Reality is one or the other; we just don’t know which with absolute certainty. There is only a random distribution about

what we know aboutwhether it works.I’ll add the famous example for coin tossing. Baysian goes in with a beta prior, which is acting like they observed A heads out of A+B tosses, before they observed any tosses at all. (de Finetti proved that is your only admissible option, which may surprise you.) Say classicist guy places bets on next toss using the observed proportion of heads so far. It’s like a beta distribution where A and B go to zero. Shape rather like 1/(p*(1-p)). They are nearly certain p is either 0 or 1. After one toss, they would be willing to place any odds on the second toss being the same as the first. Such a strategy is admissible: as a baysian betting against them, I can not adopt a strategy that is sure to make them loose money. However, if you feel such a strategy is stupid, good, since it’s taken a “know nothing” attitude to crazy extremes. Before any tosses the classicist is willing to give equal odds, but after that his decisions are not at all like a baysian going in with a prior A=B=10, who would also give even odds on toss #1.

The defacto priors we deduce classicists to be using for effect sizes are also disturbing. For any given effect size, they are almost sure it is actually larger. Hi Ho.

“There isn’t a random distribution of whether or not it works.” Baloney. We were just talking about what that distribution looks like. Remove random from the quoted sentences. What changed about how we act or learn?

I have a beta prior on the chances of a coin landing heads. I will have a beta posterior after tossing it one or many times. These are distributions. Your claim (I think): they are not random distributions. My claim: you aren’t saying anything that alters the math one whit. It’s just words.

The distribution is either a delta function at “doesn’t work” OR a delta function at “works.” Neither is a random distribution in any meaningful sense. Any given result has a random distribution. Reality does not.

This is really a critical bit of understanding the statistics – it’s very important to realize what is actually variable. What we KNOW has randomness in it. What the underlying truth is, does not.

When you claim that there is a random distribution of whether homeopathy works, you are claiming (by explicit definition) that sometimes when you try it it really does work, and other times it doesn’t. In reality, every single time you try homeopathy it doesn’t work – but sometimes it seems to. Reality has no random distribution, but the test does.

Here’s another example. Let’s say I measure the mass of an electron. My result is 509 +/- 5 keV. Does this mean that the actual mass of the electron is a Gaussian? No, it does not. The actual mass of the electron is a particular number. I just don’t know exactly what it is, and express my uncertainty in the form of that random distribution.

And whether we call it “random” or just a probability distribution function, or any of the other terms for the same thing, matters not a whit. The laws of the universe do not vary from experiment to experiment.

Oh, and to be explicit…

I’ve never taken the trouble to work out exactly where applying such a fundamentally broken concept of where the randomness lies will give wrong answers.

“Belief in the existence of “stochastic processes” in the real world; (i.e. that the property of being “stochastic” rather than “deterministic” is a real physical property of a process, that exists independently of human information) is another example of the Mind Projection Fallacy: attributing one’s own ignorance to Nature instead.” –E.T. Jaynes

JMB wrote:

What I think you are saying is that if the prior probability of the hypothesis is 0.5, then the posterior probability that a positive result is false is approximately equal to the prior probability that a positive result will be false—0.05 in David’s example—and, therefore, that if the investigator ignores the effect of the prior probability of the hypothesis and acts as if the p-value is the posterior probability of a false positive, he is, in effect, behaving as if his prior probability for the hypothesis is 0.5. (We could, of course, make an analogous statement about the probability of a true positive.)

While your observation is interesting, it is not generally true. It only worked out this way because of the values for the power and critical p-value David chose in his example. Plugging David’s values into Bayes’ Theorem,

(.05)(1 – .5) / [(.05)(1 – .5) + (.8)(.5)] = .06 ,

shows that the posterior probability of a false negative is .06, which is indeed close to the prior probability for a false negative of .05.

However, if we redo the calculation using a power of .4,

(.05)(1 – .5) / [(.05)(1 – .5) + (.4)(.5)] = .11 ,

we get a posterior false positive probability of .11, more than double the prior probability of a false positive. Ignoring the effect of the prior probability of the hypothesis in this case, and treating the critical p-value as if it were the probability of the positive result being false, is equivalent to having a prior probability for the hypothesis of 0.7:

(.05)(1 – .7) / [(.05)(1 – .7) + (.4)(.7)] = .05 .

So, it’s data dependent; not a general principle.

Oops, meant to write “false positive” instead of “false negative” in the sentence under the first equation above.

Tom S wrote:

The problem is exactly the opposite: p-values systematically overstate the evidence against the null hypothesis. Therefore, we have to learn to discount the importance of results that are significant at the 0.05 level by only a small margin.

Consider the results of a paired t-test, say from a crossover trial, comparing two treatments on 400 subjects. Say that the result of the test was t=2.18, p-value (2-sided) = 0.03. The web applet at Jeff Rouder’s website reveals that this result actually favors the null hypothesis over the alternative hypothesis by a factor of about two, as indicated by the calculated Bayes factors, in spite of the result being “statistically significant.”

In general, marginally significant p-values provide weak evidence against the null hypothesis or, as seen from my example, actually provide evidence in favor of the null hypothesis. Non-significant p-values provide even less evidence against the null (or more evidence for it).

To compare the frequentest approach (without priors) to the Bayes approach with priors, you have to extend the frequentest approach outside of its realm to consider prior and posterior probabilities. That does not mean that the p-value becomes equivalent to the posterior probability. The p-value becomes one factor in calculation of the posterior probability. I don’t think you can use the p-value as the posterior calculated probability to derive a different prior probability.

Quite right, JMB. The Bayesian hypothesis testing approach is mangled just as badly (more so, really) in this PPV diagnstic test analogy. It’s just an analogy.

Rork wrote:

This has been addressed by Scott. I’d like to give an example that is worked out somewhat more. I have tossed a ‘die’ whose sides were in the ratio 3:3:2 .

I tossed it 200 times (I actually did!) It fell on one its two 3×3 sides 163 times. Just for brevity, I’ll call such a throw a flat.

Now I think it is plausible that such a die has a precisely computable chance to fall on its flat side. But I am too stupid to compute that chance, I wouldn’t even know how to. It is possible that the chance crucially depends on the initial amount an direction of rotation if the die is released and if after that it doesn’t bounce often enough before it comes to rest.

What is that chance? I can compute that if the ‘real’ chance is about 0.754 I would find ’163 flats or more in 200 throws’ only 2.5% of the time. Similarly when the ‘real’ chance would have been 0.866, then ’163 flats or less in 200 throws’ would have a chance of happening only 2.5% of the time. In this manner I translate the result of the chance experiment ’163 flats in 200 flats’ into an interval [0.754 , 0.866] . Another result, for example 162 flats or 164 flats – which I might obtain just as easily, I guess – would produce another interval. We call an interval like that an 95% confidence interval. It’s a bit of a funny calculation, because the boundaries are not defined in the same way. I think it is roughly correct to say that in 95% of the experiments one does, the calculated confidence interval will contain the calculated real probability of a flat. I think this is not entirely correct in this case, because in most of these kind of computations it is assumed that in all cases the variances are the same and here they are not. All the same the confidence interval is the outcome of a chance experiment. As Scott put it: when you measure the restmass of an electron and you get a 95% confidence interval of 500 keV to 520 keV, then that interval summarizes your measurement and its degree of randomness. It is not the charge of the electron (as a matter of fact 519.9989… keV) that is the random variable but all the random effects that influence the measurement. It’s that interval that has a certain probability of falling over the real value.

How would the Bayesian approach be to the question: what does this 163/200 tell us about the real flat chance? The Bayesian approach needs to have prior probabilities for all possible ‘flat chances’. That looks like a serious problem, because where do we get those prior probabilities?

Let’s make a table with pattern

real flat chance = p ; prior probability that p is correct = x ; probability P of getting the result 163/200, assuming p; posterior probability that p is correct, knowing that the result 163/200 has happened = xP

Now I can produce a P for every p but in no case I know x. So I simplify my table and give only p and P.

p= 0.915 P=0.0000004

p= 0.865 P=0.01

p= 0.815 P=0.07 (0.815 = 163/200)

p= 0.765 P=0.02

p= 0.715 P=0.0003

p= 0.665 P=0.0000001

Now if those unknown x (the prior probabilities) don’t vary too wildly the combined table of the Px would still have a peak near 0.815 . If I would do 2000 tosses rather than this measly 200 the peak would be even more narrow. This illustrates the bayesian philosophy that if you repeat a chance experiment often enough, the fraction of successes converges to the real chance and the prior really doesn’t matter much. But then Ockham kicks in who would think that this prior is one of the entia we can easily do without.

A prior is OK when it refers to for example the prevalence of a disease, or when you have some kind of statistics about it. But when it is merely a thing in the mind, a plausibility based on gut feelings, well, ….

I really don’t know how this whole story would have to be repeated if it was told with 1 toss (‘heads’) with one symmetric coin. It would not give you any information at all.

What use is this information? Well, it tells me that there is no easy way to make a guess it the probabilities for an A:B:C die. Any simple argument like the relative area of the sides, or area divided by distance as indicative for the chance of coming to rest on that side cannot explain the factor 8 or 10 in chance of falling on a single large or small face of my die. Of course such a siomple argument

mightbe correct after all, but then the result 163/200 would be an extremely freakish accident.I would like to point out that homeopaths (especially my compatriot A.L.B. Rutten who may have parroted from someone else) think that Bayseian statistics can help them.

Theyestimatetheirprior not from the history of science in the past 200 years, but from the experience of homeopaths in their consulting rooms. Need I say more?Jan Willem Nienhuys wrote:

But no one knows the true prevalence of the disease; we only have estimates, so our knowledge of the prevalence is just “a thing in the mind,” too. You use the fact that it is our knowledge that is “random” (the correct term is “uncertain”) or ignore it, depending on the point you’re trying to make.

It is impossible to determine what “it” refers to in your sentence above. My best guess is that you are trying to make the point that the prior would be useless. If so, then you’ve got it completely backward. What is almost completely useless would be the single coin flip. A frequentist approach to estimate the probability

pof “heads” for an ordinary coin that uses a very small number of trials fails miserably. For the example of a single flip, the maximum likelihood estimate ofpwill be either 1, if the coin lands “heads,” or 0, if the coin lands “tails.” Either result is obviously a terrible estimate. Why do we know this? Because we have a lot of prior knowledge about ordinary coins. And our estimate ofpwill clearly be improved by incorporating this prior knowledge into our estimate, as I now show:What is a reasonable model for my prior knowledge about

p? I doubt thatpis precisely .5. An ordinary coin probably has alittlebias in it, but I have no idea in which direction it is. This suggests that the distribution I choose to model my prior should be symmetrical, narrow, and centered at .5. Let’s say I’m 95% confident thatpis in the interval (.48, .52). This suggests that the beta(1300,1300) distribution would be a reasonable model for my prior beliefs—it is symmetrical, has a mean of .5, and has 95% of its probability between .48 and .52. Additionally, it has the weight of 2,600 coin flips, reflecting the fact that I have a lot of prior knowledge aboutp, and therefore, if I do an experiment with a small number of trials, the weight of my prior information will dominate the weight of the experimental evidence in my computed estimate ofp. In order to overcome the weight of my prior beliefs aboutp, I would need to conduct a large experiment—that is, gather a lot of knew evidence aboutp.This is exactly how rational inference should work when our prior knowledge is strong.So, let’s say we conduct our 1-flip experiment to estimate

pand it lands “heads.” The frequentist estimate—the maximum likelihood estimate—is p=1. The Bayesian estimate is quite different. Combining our new experimental evidence and our prior knowledge results in the posterior distribution forpbeing the beta(1301,1300) distribution. Taking the mean of this distribution as our best estimate ofp, we estimate p=1301/2601, which is .500 to three significant digits. The Bayesian estimate correctly reflects the fact that flipping a coin one time adds relatively little new information to our knowledge ofpFrequentist estimate: p=1

Bayesian estimate: p=.5002

You be the judge.

JMB, you completely misunderstood my post, but, perhaps, I completely misunderstood you, as well.

My weakness is that I don’t look at the issue from a theoretic viewpoint (in which we are trying to compare apples and oranges), but from a practical viewpoint. The practical problem that colors my discussion is how do you let published reports (which may report frequentest statistics or Bayes statistics) inform your practice of medicine. By the very nature of the way I have stated the problem, I am using a Bayes view of probability. So really I am looking at how to incorporate a report using a frequentest approach into my Bayesian approach. The answer to my dilemma does not answer any theoretical questions. It only leads me to a better strategy of incorporating frequentest statistics in my Bayesian approach.

JMB wrote:

That’s right. More precisely, for a test one often knows the ratios TP:FN and FP:TN but one doesn’t know the ratio P:N (where in the context of the test P=TP+FN and N=TN+FP).

If the subject is sick people and some kind of test, maybe a screening test, then the ratio P:N (or P/(N+P), from which you can compute P/N) is the prevalence. I am quite willing to believe that exact numbers for the prevalence are difficult to get, and they may vary wildly between subgroups of the population.

If the subject is a treatment and the ‘test’ is a clinical trial of some sort, then the fraction P:N (the prior odds) are the odds that the treatment works, based on your entire knowledge of for instance (1) the track record of the company or organisation that has invented the treatment (2) the plausibility from a point of view of basic science including the results of in vitro research and what is known about the applicability of in vitro results to clinical use (3) what is known about earlier attempts to establish the effectivity of the treatment or of ‘similar’ treatments and (4) how well the referees of the journal may have checked the paper.

And apart from this one should (regrettably) also consider something that is not covered in probability theory at all. That is (5) the possibility that the ‘test’ is faked. That the persons who did the ‘research’ have been looking at many different ways to make the result look favorable, and selected the best. That the ‘blinding’ was poor. That the ‘placebo’ control was not really comparable. That the results of similar unfavorable research have been hidden from view. That inappropriate methods were used. That confounding factors were overlooked. And so on.

I think that it is difficult to put these items 1-5 into a form that allows for a neat calculation. Trying to do so would produce an illusion of exactness that simply isn’t there. I have always understood that the philosophy of SBM compared to EBM is that more attention should be paid to (2). This blog often discusses so-called CAM, and so (5) should also be given more weight. Personally I am convinced that a lot of research is going on in altmed that wouldn’t stand scrutiny from the (5) point of view.

JMB wrote:

Biomedical research findings are almost always reported using frequentist statistics. A frequentist hypothesis test is conducted, and the results are reported as some combination of an effect estimate (eg, improvement in the active treatment arm minus improvement in the placebo arm), hazard ratio, test statistic (eg, t or chi-squared value), standard error or confidence interval for the effect estimate or hazard ratio, and p-value from the test. In order to incorporate such frequentist test results into rational medical decisions, we must convert them into Bayesian statistics. This is easier to do than you might think.

Our aim is to convert a frequentist hypothesis test into a Bayesian hypothesis test. To do so we need to compute the

Bayes factor, which is, in a sense, the Bayesian analogue to the frequentist’s p-value. The Bayes factor, BF, compares the likelihood of the data under two competing hypotheses: the null hypothesis (no effect; eg, the drug does not differ from placebo) vs an alternative hypothesis (eg, the drug is more (or less) effective than placebo). The Bayes factor, which ranges from zero to infinity, tells us how much the study results should alter our prior odds that the null hypothesis is true. Multiplication of the Bayes factor by our prior odds gives us the posterior odds that the null hypothesis is true:(prior odds) × BF = (posterior odds) .

Bayes factors greater than 1 imply that the data favor the null hypothesis over the alternative; less than 1, that the data favor the alternative hypothesis over the null; and equal to 1, that the data do not favor either hypothesis over the other.

Note that when using the Bayes factor, the prior and posterior values must be expressed as odds, not probabilities. Odds are defined as follows:

Probability / (1 – Probability) .

For example, if we have a prior probability of 0.8, we have prior odds of .8/.2=4. Like the Bayes factor, odds range from 0 to infinity, with odds of 1 being equivalent to a probability of 0.5.

After we calculate our posterior odds, we can, if we wish, convert the result back into a probability as follows:

Probability = Odds / (1 + Odds) .

So the question is, how do we convert frequentist hypothesis test results reported in biomedical research papers into Bayes factors? If the statistical test in the paper is a t-test or a result from linear regression, we can use the appropriate applet at Jeff Rouder’s web site. Or, if the test result is reported as a hazard ratio, we can use the method I’ve described on my website. Don’t be put off by the math. Most of it is there for background purposes. The Bayes factor is calculated using Equation 15, and Equations 12–14 are used to calculate numbers needed by Equation 15 if they are not reported explicitly in the paper. The web page includes a worked example under the heading “Example: Red Meat and Mortality.”

If you have any questions you can post them here or as a comment on my web site.