Evidence in Medicine: Experimental Studies

Several weeks ago I wrote the first in a brief series of posts discussing the different types of evidence used in medicine. In that post I discussed the role of correlation in determining cause and effect.

In this post I will discuss the basic features of an experimental study, which can sere as a check-list in evaluating the quality of a clinical trial.

Medical studies can be divided into two main categories – pre-clinical or basic science studies, and clinical studies. Basic science studies involve looking at how parts of the biological system work and how they can be manipulated. They typically involve so-called in vitro studies (literally in glass) – using test tubes, petri dishes, genetic sequencers, etc. Or they can involve animal studies.

Clinical trials involve people. They are further divided into two main categories – observational studies and experimental studies. I will be discussing experimental studies in this post – studies in which an intervention is done to study subjects. Observational studies, on the other hand, look at what is happening or what has happened in the world, but does not involve any intervention.

Experimental Studies

The primary advantage of experimental studies is that they allow for the direct control of variables – in the hopes of isolating the variable of interest. Results are therefore capable of being highly reliable, although good clinical experiments are difficult to design and execute. When assessing a clinical trial here are the features to examine.

Prospective vs Retrospective

A prospective trial is one in which the treatment and the outcomes are determined prior to any intervention being done. Experimental trials are almost by definition prospective. A retrospective trial is one in which the data is gathered after the fact – taking patient records, for example, and looking at treatments and outcomes.

In a retrospective study you can try to account for variables, but you cannot control for them. It is therefore much more likely that there are confounding factors and the results are not as reliable.Also, retrospective studies can be biased by the way information is obtained – there can be a bias in the way patients are identified, for example.

Prospective trials are therefore considered superior to retrospective trials, which are at best preliminary in their conclusions.


Not all prospective trials are placebo-controlled, however. A non-controlled trial might identify potential subjects, give them all a treatment, and then see how they do. Such open-label single arm trials cannot control for placebo effects or experimenter biases, and again results should be considered preliminary.

Open or uncontrolled trials are not useless, however. The outcome of subjects in such trials can be compared to historical controls, and if a significant result is apparent (along with safety) can be used to justify a larger and more rigorous trial.

Controlled trials have one or more comparison groups in the trial itself – different groups of subjects receive different treatments or no treatment. All subjects can be followed in same manner. Control groups allow the experimenter to make sure that all the subjects have the same disease or symptoms, that they receive known treatments, and many variables (such as other treatments they may be receiving, severity at inclusion, age, sex, race, etc.) can be accounted for.

Controlling for variables

With controlled trials the experimenter can start to control for variables. If the question is – does treatment A improve outcome in disease X, a controlled prospective trial can attempt to isolate treatment A from other factors that may affect outcome.

One method for controlling variables is stratification – the study protocol can place subjects in different treatment groups so that the groups end up with the same proportion of different sexes, ages, races, and other known variables that may be pertinent. Stratification can control for known or obvious confounding factors.

But of course there can always be unknown confounding factors. The only way to deal with these is through randomization and large study size. If a large number of subjects are randomly assigned (once stratified for age, sex, etc.) into the different treatment groups, then any unknown variables should average out. Of course, this requires sufficient numbers – small studies are always suspect because the groups may be significantly different by chance alone.

Randomization is important because when patients select their own treatments this opens the door for selection bias. For example, sicker patients may opt for more aggressive therapy. They will do worse because they were sicker to begin with, making the more aggressive therapy look less effective.


A randomized prospective trial can control for many variables, but the only way to control for placebo effects and the bias of the experimenters is with blinding – meaning that participants don’t know who is getting the real treatment and who is getting a different treatment or a placebo.

A single-blind study is one in which subjects do not know which treatment they are getting. A double-blind study is one in which the experimenter does not know either – until the study is done and the “code is broken.”

When subjects are blinded, placebo effects should be the same. It is often difficult, however, to fully blind subjects. Medications may have obvious side effects, and subjects who experience the side effects know they are getting active medication.

Physical interventions, like acupuncture, surgery, massage, or physical therapy, are difficult to impossible to blind. A person knows if they have been massaged or not. For these studies creative blinding techniques may need to be used. Or, “sham” procedures can be used for placebos.

Studies may also assess how successful the blinding was – by asking subject if they think they received the placebo or the treatment.

Experimenters also need to be blinded to eliminate placebo and biasing effects. This is easy for drug trials, but may be impossible for physical intervention trials. However, a study can be partially double-blinded if there is a blinded evaluator – an experimenter whose only involvement with the study is to assess the subjects, while carefully avoiding any information that would clue them in as to which treatment arm each subject was in.

But the best studies are ones in which everyone involved is completely blinded until the results are completely in.

Outcome measures

Deciding how to determine if an intervention “works” is not always trivial. Outcome measure need to be a good and reliable marker of the disease or syndrome you are following. For example, in a diabetes study, do you follow HgA1C, random glucose checks, glucose tolerance tests, end-organ damage, need for medication, or some other biological marker?

In addition to being a good marker for what you are studying, the outcome should be meaningful. Do we care if a cholesterol lowering drug lowers total cholesterol, or if it prevents heart attacks and strokes? And if it prevents heart events, does it prolong survival (or just reduces angina without affecting survival)?

Outcomes also need to be free of confounding. For example, early stroke trials looked at stroke incidence, which may seem reasonable. However, if more subjects on a treatment died of heart attacks, they would not be around to have a stroke, so the treatment reduces stroke but only by allowing more heart attacks. So stroke-free survival is a better outcome to follow.

Outcome measures also vary on how objective or subjective they are. Just asking patients how they feel is not a very reliable outcome measure. You can pseudo-quantify this by asking them to put a number on their pain or other symptoms, but it is still a subjective reports. Measuring the volume of lesions in the brain, however, is an objective outcome measure, and is therefore more reliable.

Many studies will follow several outcomes – some subjective but important, and others objective and quantifiable if an indirect marker rather than a direct outcome we care about.

Statistical analysis

I won’t go into statistics in any detail, as that is a highly technical area and any reasonable treatment would be much longer than the rest of this post. Here even medical professionals rely upon statistical experts to make sure we get it right.

But it is good to understand the basics (as long as you don’t rely upon basic knowledge – then it is easy to be fooled by fancy statistical tricks).

The most basic concept of clinical trials is statistical significance – is there an effect or correlation that is probably greater than chance. Most studies rely upon the P-value, which is a measure of the chance the result occurring if the the null hypothesis (no effect) is correct. A P-value of 0.05 means (roughly) that 5% (or 1 in 20) probability that the outcome is due to chance alone, and not a real effect. P-value of 0.05 is commonly used as a cutoff of statistical significance, but it is important to realize with this cutoff 1 in 20 studies of worthless treatments will appear positive due to chance alone. Lower P-values, such as 0.01, are more significant.

But P-value isn’t everything.  A poorly designed study can result in an impressive P-value. Also, the size of the effect must be considered. You can have a low P-value for a tiny effect (if there are large numbers of subjects in the trial) – the effect may be clinically insignifcant, and small effects are more likely to be due to hidden biases or confounders.

Therefore, we generally are only impressed when a clinically large effect also has a low P-value.

In addition to P-value, the number of subjects in the trial is very important. Even though these are related, the larger the study the more impressive the results, as random fluctuations are less likely to play a role.

One common trick to look out for is multiple analysis.  A study may, for example, look at 10 variables (or one variable at 10 different points in time), and find statistical significance for one, and present that as a positive study. However, this is equivalent to taking 10 chances at that 1 in 20 chance of hitting significance. Proper statistical analysis will account for multiple comparisons.

Other factors to look out for

There are features that are important to consider is evaluating a clinical trial. What was the dropout rate? If half of the subjects dropped out, that unrandomizes or biases the groups, because drop outs are not random. For example, subjects that do not respond to treatment may drop out, leaving only those who do well.

Not all controls are equal as well. Sometime the control group is not an inactive placebo but standard care. What if the standard treatment is too effective, or what if it is not effective at all. You need to know what the study treatment is being compared to.


When a new clinical trial is being promoted in the news as evidence for or against a treatment – run down this list. Is it a randomized, controlled, double-blind trial, is the blinding adequate, are the outcome measures objective and relevant, is the effect size robust, how large is the study, what variables are actually being isolated, and what was the drop out rate?

And of course, no one study is ever the definitive last word on a clinical question. Each study must be put in the context of the full scientific literature, which means considering plausibility or prior probability. That is the essence of science-based medicine.

Posted in: Science and Medicine

Leave a Comment (21) ↓

21 thoughts on “Evidence in Medicine: Experimental Studies

  1. Zoe237 says:

    Okay, so what exactly is logistical regression? One always reads that the researchers have “adjusted” for confounders- age, race, income, whatever. How reliable is this? And why do some scientists hate the P- value?

    Nice laymen explanation.

  2. Robin says:

    This a great post for those of us without a science background. Thank you.

  3. snfraser says:

    Oversimplification follows….

    Logistic regression is used when the outcome is binary, typically does or doesn’t develop disease. Statistically adjusting for confounders essentially equates the sample on that confounder so that one can say the influence of sex or age has been removed. A better solution is to improve one’s methodology, but this isn’t always possible. Anyway, the analysis gives you coefficients for each ‘predictor’ (e.g., BMI, sex, presence of a particular risk factor) that are odds ratios. A significant odds ratio over 1 says that there is an increased chance of developing disease, say. Below 1 means less chance. Typically, an odds ratio of 2 or higher is possibly a big deal. One always needs to consider the raw rates of disease however, but that is another story.

    Anyway, some researchers hate p-values because p is a function of sample size. Basically, one can count on finding a significant result if there is a large enough sample. Researchers are therefore (and they should be) more interested in the effect size, as noted in this article. A better alternative to p-values is to report a confidence interval. This is what you often see in the reporting of logistic regression results.

  4. Scott says:

    “One common trick to look out for is multiple analysis.”

    An interesting facet of this is that not all scientists understand the statistics properly, so may not have been adequately careful. Multiple comparisons is one of the easiest places to very honestly get it wrong.

  5. edgar says:

    Multiple logistic regression is a statistical test in which you test for significance of one variable after others are in the model.

  6. twaza says:

    Stephen, thanks for a brilliant concise summary.

    A few suggestions:

    The statistical analysis should provide confidence intervals (usually 95% CIs) as well as p values for comparisons.

    The conclusion should discuss, for each outcome measure, the effect size, its uncertainty (confidence interval), clinical importance, and the risks of bias and error (you gave a nice check list).

  7. edgar says:

    clinical importance,

    I would say as well positive predictive value and negative predictive value, if applicable. This is more clinically relevant than confidence intervals and statistical significance.

  8. rlee says:

    Nice overall summary. However, it is NOT true that P-value ” is a measure of the chance that the null hypothesis (no effect) is correct.” This is a common misconception, but the P-value is actually the probability of getting a result that is the same as or more extreme than the observed value, IF THE NULL HYPOTHESIS IS TRUE. Since the truth of the null hypothesis is assumed, the P-value cannot be a measurement of its probability. See Steven Goodman’s discussion of this issue in Ann Intern Med 1999; 130: 995.

  9. JerryM says:

    apologies, I was in a pedantic mood

    “which can sevre ”

    Observational studies, on the other hand, look at what is happening or what has happened in the world, but does not involve any intervention.


    Many studies will follow several outcomes – some subjective but important, and others objective and quantifiable if an indirect marker rather than a direct outcome we care about. <- an 'is' is missing there somewhere i think

    that a 5% (or 1 in 20) probability that

    to consider is evaluating

    pls delete this post :)

  10. BoulderEric says:

    I have to strongly echo rLee’s point. The distinction rLee makes
    is not a pedantic one. A vast majority of practicing biomedical
    scientists think (erroneously!) that the p-value is the
    probability that a particular result arose due to chance alone,
    and much bad science has arisen from their misconception. It is
    incumbent on every science blogger not to further propagate the

    When you quote a p-value of 0.05, what you are really saying is “either the null hypothesis is incorrect OR there has
    been a low probability (<0.05) statistical fluctuation." The
    question is, which of the disjoint branches actually happened? If
    the opposite of your null hypothesis is some really outlandish
    claim "eating french fries for breakfast every morning drives
    breast cancer into remission", well most of us would agree
    that on the face of it that hypothesis is a lot less than 5% likely
    to be true, thus the more likely branch of the disjunction is
    actually "there has been a low probability statistical
    fluctuation." I explicitly work through the math for two
    examples, one in which the p-value vastly underestimates the
    probability of the null hypothesis being true, and one in which
    it gets it about right, in my two-part blog post,

  11. Ed Whitney says:

    The volume of lesions in the brain may well be more objective than some scale in which patients report their ability to do activities of daily living or the frequency of their headaches, but that does not necessarily make them preferable as outcome measures.

    Researchers like outcome measures that have numerical distributions that lend themselves to statistical hypothesis testing. But for EBM to work it needs to make sure that the outcomes that are convenient for researchers to measure have something to do with what is bothering the patient. The volume of brain lesions may or may not be correlated with symptoms, analgesic use, or other self-reported functional capacity.

    Patient-centered outcomes are the preferred measure of effectiveness in most clinical trials under most circumstances.

  12. I understand the distinction about P-value and changed the text to make it more clear. Thanks for the feedback.

    Ed – I agree, but there is a tradeoff. More objective outcomes may be easier to measure and quantify, but then we need to consider clinical relevance. Quality of life measures are very relevant, but may be hard to measure and quantify.

    Which is why many studies use both kinds of outcomes simultaneously.

    Also – the question is important. If a study is being used to see if an intervention with unknown mechanism can potentially have a benefit (a proof of concept) then objective measures are better. If two established treatments are being compared for overall outcome, then quality of life, cost effectiveness, type measures are better.

  13. Tom S says:

    And, in the best of cases, all the statistical analysis is done in a blinded fashion as well. One might find, for example, that treatment A reduces the risk of death within one year by 10%, relative to treatment B. After you are convinced you have it right, you break the code and see what is A and what is B.

    All the discussion about p-values goes far to answer zoe237’s question as to why people dislike them. If you really articulate what it means, people just say “huh?” One of the problems with p-values, as I see it, is that they make statements about the data, not about the true values that we are trying to determine, such as difference in survival between treatments. And, since they express the probability, assuming the null hypothesis is true, of getting data as deviant OR MORE DEVIANT from expectation than what we actually got, they are a function of data that weren’t even observed! Why should our inference about this trial depend on what we DIDN’T see?

  14. Adam_Y says:

    Well technically there is another statistic that actually validates the p-value called statistical power. It essentially is the odds that you would get a false positive or false negative given a population.

  15. BoulderEric says:

    I have to say that the sentence in the amended post,

    “A P-value of 0.05 means (roughly) that 5% (or 1 in 20) probability that the outcome is due to chance alone, and not a real effect.”

    still makes me squint, for the reasons I discuss in my comment above and in the blog post I linked to. It must be said, though, that almost every working biomedical researcher I’ve ever talked to believes that that is what the p-value means, and many of them are excellent, productive scientists, so you are at the very least in good company!

  16. edgar says:

    “I agree, but there is a tradeoff. More objective outcomes may be easier to measure and quantify, but then we need to consider clinical relevance. Quality of life measures are very relevant, but may be hard to measure and quantify.”

    I think this is where the importance of QUALITATIVE data comes into play…And it is often discounted, but it does matter.

  17. snfraser says:

    As a statistical consultant, I feel your pain BoulderEric. This is also another reason why researchers hate p-values. In fact, the logic of the statistical hypothesis test is confusing and counterintuitive, and essentially opposite of the research hypothesis. I think this is another source of confusion for researchers. For practitioners the emphasis should be on clinical significance rather statistical significance, i.e. effect sizes.

  18. Zoe237 says:

    Thanks for the explanations. The p value of bullshit was great. It’s been awhile since my stats classes in college.

  19. Kristen says:

    Thank you for the informative article. These are things I have had a general grasp of, but now I believe I understand the differences more.

Comments are closed.