Articles

Interpreting the Medical Literature

The science in science-based medicine includes all of science, but relies most heavily on the biomedical literature – published studies that collectively represent our scientific medical knowledge. The scientific basis of medicine is only as good as this body of knowledge and the manner in which it is interpreted and put into practice.

We often discuss on this blog how to evaluate individual studies- the need for blinding, randomization, the importance of study size to meaningful statistical analysis, and other features that distinguish a reliable study from a worthless one. This is important, but only half of the equation. We also at times discuss the medical literature as it relates to a specific medical question or set of related questions – does homeopathy work or are statins beneficial for cholesterol reduction, for example. This requires not only the ability to judge individual studies, but a higher order analysis of the overall pattern of evidence among all relevant studies. Failure to do this, by focusing only on individual studies, results in the failure to see the forest for the trees.

It is this higher order analysis that I wish to discuss in this entry.

The evolution of medical evidence

First, it is important to realize that confident medical judgments or conclusions rarely emerge from single studies – confidence requires a pattern of evidence over many studies. The typical historical course for such evidence is first to begin with clinical observations or plausible hypotheses that stem from established treatments. Based upon this weakest form of evidence preliminary or pilot studies are performed by some interested researchers to see if a new treatment has any potential and is at least relatively safe. If these early studies are encouraging then larger and larger studies, with more tight designs are often performed.

In this early phase of research results are often mixed (some positive, some negative) as researchers explore different ways to use a treatment, different subsets of patients on which to use the treatment, varying doses of medication, or other variables. Outcomes are also variable – should a hard biological outcome be used, or more subjective quality of life outcomes? What about combinations with existing treatments? Is there an additive effect, or is the new treatment an alternative.

It takes time to sort out all the possible variables of even what sound initially like a simple medical question – does treatment A work for condition X. Often different schools of thought emerge, and they battle it out, each touting their own studies and criticizing the others’. This criticism is healthy, and in the best case scenario leads to large, well-designed, multi-center, replicable consensus trials – trials that take into consideration all reasonable concerns where both sides can agree upon the results.

What I just described has often occurred in medical practice, but is also what happens when the system works. There are also many false turns and blind alleys. Sometimes fraud will confuse the literature until it is rooted out by later studies, or a vested interest will skew the studies in one direction, delaying a more definitive and accurate result from emerging from the collective literature.  But it is important to realize that even when the process works well, it is very messy, and it takes years to play out.

That is why medicine is an inherently conservative profession – we cannot allow medical practice to blow in the wind of every new study or fad. It takes time to achieve confidence by working through all the messy variables.

But there’s more.

Systematic Reviews vs Meta-Analysis

Perhaps the most useful type of medical paper is the systematic review.  These take a great deal of time and work, but when well-done are invaluable. Perhaps the most significant contribution of the Cochrane collaboration – an institution of evidence-based medicine – is to provide standards for systematic reviews and to publish many of them.

A good review will look at all high-quality studies, assess their strengths and weaknesses, and look at the overall pattern of results. For example, such a review may find that there are no studies of sufficient quality to make a recommendation – therefore the evidence is still lacking. Or, it may find that the better a study of a certain treatment the smaller the effect, and the best studies tend to be negative – a pattern consistent with a lack of a real effect (despite the fact that there may be individual positive studies). Such reviews may also point out inconsistencies in the literature indicating the need for further research to resolve the conflicts.

Systematic reviews are not without their pitfalls, however. The reviewers still have to decide which studies to include, and this creates the potential to introduce bias in the results.

Even more concerning – systematic reviews are only as good as the literature it analyzes, and there are systematic problems with the literature itself that I will discuss further below.

A meta-analysis is not a systematic review. Rather, this is the statistical technique of combining multiple studies into one large study by pooling their data. This technique is fraught with multiple sources of bias – which studies to include and how to standardize different populations and outcomes being the two most significant. It is important to realize that a meta-analysis is not new data, but simply a re-analysis of old data with new variables potentially introduced.

A 1997 study published in the NEJM looked at published meta-analyses and compared them to later large well-designed trials and found that they failed to predict the outcome of the later trials 35% of the time. This is not very good, considering that 50% would be random chance. The probability of false positive and false negative were roughly equal in this study.

Biases in the Literature and the work of John Ioannidis

The literature itself has higher order structure – meaning that it is not simply a collection of individual studies. There are patterns in which types of studies tend to get published, and therefore added to the literature. The most basic such pattern is called publication bias, or the file-drawer effect. It refers specifically to the tendency to publish positive studies over negative studies. Researchers are more likely to submit positive studies, and journal editors are more likely to accept them. A recent analysis, for example, also showed that pharmaceutical companies are much more likely to submit studies favorable to their drugs for publication than negative studies. Similarly, it is likely that proponents of any treatment or system are more likely to submit for publication studies that support their specialties or favored modalities.

This has a huge effect on the literature. It affects every systematic review and meta-analysis, which typically will analyze only published studies. Cochrane reviews also encourage looking at unpublished data and paper presentations at meetings, but this is only done in the most complete of reviews. If the collection of published studies is skewed toward the positive, then reviews will likewise be skewed positive.

This is further supported by evidence that initial studies of new treatments tend to be positive and have large treatment effects, and as the literature on a particular question evolves there is a distinct trend toward shrinking effect sizes, with some disappearing completely. Taken at face value, what this means is that the medical literature simply cannot be trusted until we get to the mature phase of large, multi-center, randomized, and well-controlled studies – the most definitive study designs.

There are more subtle problems with the medical literature, as discussed by researcher John Ioannidis. In 2005 he published what has become a seminal paper on why most published studies are false. In the summary he writes:

In this framework, a research finding is less likely to be true when the studies conducted in a field are smaller; when effect sizes are smaller; when there is a greater number and lesser preselection of tested relationships; where there is greater flexibility in designs, definitions, outcomes, and analytical modes; when there is greater financial and other interest and prejudice; and when more teams are involved in a scientific field in chase of statistical significance. Simulations show that for most study designs and settings, it is more likely for a research claim to be false than true.

The reason for this is actually simple – science is about discovering new things, and most attempts at guessing what will turn out to be true in science are wrong. If most ideas are wrong, then even with a reasonable study design and threshold for statistical significance, you may get more false positives than true positives – in other words, most published studies will be wrong.

It is worth pointing out that this effect gets worse the more implausible the hypothesis. When studying truly speculative or unlikely treatments, the false positive to true positive ratio increases dramatically.

However, it is also worth pointing out that this does not mean that we cannot trust the medical literature at all – after all, Ioannidis is comparing the medical literature to itself. What this really means is that, assuming later large definitive trials to be correct, earlier smaller trials are likely to be false, and more often than not falsely positive. The assumption here is that we can trust mature, well-designed and large trials.

Recently Ioannidis published another paper, this one more of a thought experiment than analysis of the literature. In it he writes:

This essay makes the underlying assumption that scientific information is an economic commodity, and that scientific journals are a medium for its dissemination and exchange. While this exchange system differs from a conventional market in many senses, including the nature of payments, it shares the goal of transferring the commodity (knowledge) from its producers (scientists) to its consumers (other scientists, administrators, physicians, patients, and funding agencies).

This is an interesting idea – thinking of the medical literature in economic terms. I think Ioannidis makes a convincing case that there is some insight to be gained from this approach.  He concludes that positive studies are much more valuable as a commodity than negative studies and therefore this should contribute significantly to the bias in the literature toward not only positive studies, but ones that make bold claims – the very claims that are most likely to later turn out to be incorrect. What this adds up to is yet another reason to be highly suspicious of early studies or bold new claims.

Of course, this analysis is only as good as the stated underlying assumption – is scientific information an economic commodity. One aspect of publication that Ioannidis did not address is peer-review. Journals submit papers to review by experts, who have their own reputations to consider and may not gain from someone else publishing a hot new study. High end journals use this process, and their editorial discretion, ostensibly to filter out bad studies. Yes, they want positive, high-impact publications – but their long term reputation also depends upon these studies not being routinely overturned. This is analogous to the reputation of a judge who is consistently overturned on appeal.

How these competing interests (the desire for exciting press releases while wanting published papers to hold up over time) play out is likely different for each journal, depending upon the editorial competence and temperament.  I do hold out, perhaps naively, that the best medical journals value peer-review and reliability over headlines.

Conclusion

All of these trends and patterns in the published medical literature tend to point toward the same conclusion -  early on in the research of any new idea in medicine, published studies are highly unreliable and have a huge bias towards being positive and exaggerating effect sizes. However, over time the research often works itself out, and highly reliable studies can be achieved (although no question in science is ever closed).

The work of Ioannidis and others also reinforces one of the core premises of SBM – that plausibility is an extremely valuable quality to take into consideration when evaluating any new idea or treatment. The less plausible a treatment, the less reliable early, small, or poorly controlled studies are likely to be, and the greater the bias towards publishing positive outcomes.

There is also a bright spot in all of this – the literature is not hopeless, and all of the problems discussed above have solutions. We need greater transparency in medical research, for example. This could be achieved by having a central registry of all clinical studies involving human subjects that requires the publication online of the study results. This would make all the data (not a biased subset of it) available to researchers for systematic reviews.

Journal editors need to publish more negative studies, and they need to be especially skeptical of new bold claims. Plausibility needs to be factored more prominently into evidence-based medicine – which is precisely what we mean by science-based medicine. More resources should be allocated to high quality systematic reviews.

The medical literature is in many ways our most important tool and resource. Understanding it at its highest level of organization is crucial to reaching reliable medical conclusions that can be applied to practice – which of course is the ultimate purpose of medical research.  Further, it is worth our time and effort to continue to study the literature at this top level with the goal of improving the quality and effectiveness of this critical tool.

Posted in: Clinical Trials, Science and Medicine

Leave a Comment (8) ↓

8 thoughts on “Interpreting the Medical Literature

  1. delaneypa says:

    Tahnks for the recap, the link “why most published studies are false” is broken.

  2. Meadon says:

    Great piece Steve… How, however, does on incorporate plausibility into one’s analysis in a rigorous manner? Homeopathy is clearly less plausible than, say, acupuncture but how does one express this numerically, i.e. statistically? For example, Bayes theorem (as you no doubt know) discounts evidence for a hypothesis h by the antecedent probability that h is true. So Bayes tells us how to go about rigorously analyzing plausibility in abstract but what exact figure to you assign to, say, homeopathy or acupuncture? Or should we take plausibility into account, for the lack of a better term, vaguely?

  3. Peter Lipson says:

    This can be problematic. For ideas that are completely implausible, well…

    But, take pulmonary embolism, for example. Over the years, this hard-to-diagnose condition has had a number of evidence-based algorithms applied to it.

    In many studies, clinical criteria (such as Wells’ criteria) were developed based on looking at previous data, and then tested further in other studies.

    E.g., in PIOPED I, prior probability (pre-test prob) was assigned based on clinical criteria, and the likelihood of PE was determined by combining pre-test prob and V/Q scan results, compared with pulmonary angiography. This allowed for a pretty accurate assessment of the prediction tools. IOW, rational clinical predictors were chose, then tested against a gold standard.

  4. Meadon says:

    Sorry about the typos… I really should re-read my comments.

    Anyway, thanks Peter, that’s interesting – I didn’t know about that method. And I can also see how to deal with utterly implausible claims – e.g. homeopathy – but how about acupuncture? It’s certainly not impossible that sticking needles into skin has some sort of (probably nonspecific) benefit, so what prior probability would you assign? And what about herbal medicines? Clearly it’s plausibly that some of them work (more so than acupuncture) so what figure do we assign there?

    Basically… while I agree entirely with the arguments offered here and elsewhere for taking account of plausibility, I worry about practical implementation.

  5. Nick Barrowman says:

    An interesting post. As a statistician who has worked on a number of systematic reviews and meta-analyses, I want to make a few comments.

    A good review will look at all high-quality studies …

    In most systematic reviews, all studies of a specified design (usually randomized controlled trials) are included, regardless of quality. There are several reasons for this, one being that it is very difficult to say just what we mean by quality. (It brings to mind the famous quote from a rather different context, “I can’t define it, but I know it when I see it!”) Various quality scales have been proposed, but there isn’t much consensus on the whole issue. The Cochrane Collaboration now recommends consideration of individual “risk of bias” items. But how can we then talk about “the best studies”? Many systematic reviews assess the quality of studies but then make little use of this information.

    [Meta-analysis] is the statistical technique of combining multiple studies into one large study by pooling their data.

    Note that most meta-analyses use summary statistics from published reports of studies rather than raw data. Occasionally so-called “individual-patient data” meta-analyses are performed, in which raw study data are pooled, but this is the exception rather than the rule.

    In 2005 [John Ioannidis] published what has become a seminal paper on why most published studies are false.

    Certainly Ioannidis’ paper has been widely cited, but not everyone agrees with his analysis. The online responses to his article are worth a look, particularly the response by Steven Goodman and Sander Greeland (who have written a paper disputing some of Ioannidis’ claims). Note that Ioannidis in turn provides a rejoinder. I don’t think the jury is in on this one.

Comments are closed.