Articles

The Power of Replication – Bems Psi Research

I love reading quotes by the likes of Karl Popper in the scientific literature. A recent replication of Bem’s infamous psi research, Feeling the Future, gives us this quote:

Popper (1959/2002) defined a scientifically true effect as that “which can be regularly reproduced by anyone who carries out the appropriate experiment in the way prescribed.”

The paper is the latest replication of Daryl Bem’s 2011 series of 9 experiments in which he claimed consistent evidence for a precognitive effect, or the ability of future events to influence the present. The studies were published in The Journal of Personality and Social Psychology, a prestigious psychology journal. All of the studies followed a similar format, reversing the usually direction of standard psychology experiments to determine if future events can affect past performance.

In the 9th study, for example, subjects were given a list of words in sequence on a computer screen. They were then asked to recall as many of the words as possible. Following that they were given two practice sessions with half of the word chosen by the computer at random. The results were then analyzed to see if practicing the words improved the subject’s recall for those words in the past. Bem found that they did, with the largest effect size of any of the 9 studies.

Needles to say, these results were met with widespread skepticism. There are a number of ways to assess an experiment to determine if the results are reliable. You can examine the methods and the data themselves to see if there are any mistakes. You can also replicate the experiment to see if you get the same results.

Replicating Bem

Bem’s studies have not fared well in replication. Earlier this year Ritchie, Wiseman, and French published three independent replications of Bem’s 9th study, all negative. Of further interest is that the journal that originally published Bem’s article had declined to publish Ritchie et al’s paper claiming that they don’t publish replications. This decision (and editorial policy) was widely criticized, as it reflects an undervaluing of replications.

It’s good to see that the journal has relented and agreed to publish a replication. Galak, LeBoeuf, Nelson, and Simmons should be commended, not only on their rigorous replication but their excellent article, which hits all the key points of this entire episode.

The researchers replicated experiments 8 and 9 of Bem (they chose these protocols because they were the most objective). They conducted 7 precise replications involving a total of 3,289 subjects (Bem’s studies involved 950 subjects). Six of the seven studies, when analyzed independently, were negative, while the last was slightly statistically significant. However, when the data are taken together, they are dead negative. The authors concluded that their experiments found no evidence for psi.

Lessons from Bem

A Bayesian Analysis

At Science-Based Medicine we have advocated taking a more Bayesian approach to scientific data. This involves considering a claim for plausibility and prior probability. Bem and others are fairly dismissive of plausibility arguments and feel that scientists should be open to whatever the evidence states. If we dismiss results because we have already decided the phenomenon is not real, then how will we ever discover new phenomena?

On the other hand, it seems like folly to ignore the results of all prior research and act as if we have no prior knowledge. There is a workable compromise – be open to new phenomena, but put any research results into the context of existing knowledge. What this means is that we make the bar for rigorous evidence proportional to the implausibility of the phenomenon being studied. (Extraordinary claims require extraordinary evidence.)

One specific manifestation of this issue is the nature of the statistical analysis of research outcomes. Some researchers propose that we use a Bayesian analysis of data, which in essence puts the new research data into the context of prior research. A Bayesian approach essentially asks – how much does this new data affect the prior probability that an effect is real?

Wagenmakers et al reanalyzed Bem’s data using a Bayesian analysis and concluded that the data is not sufficient to reject the null hypothesis. They further claim that the currently in vogue P-value analysis tends to overcall positive results. In reply Bem claims that Wagenmakers used a ridiculously low prior probability in his analysis. In reality, it doesn’t matter what you think the prior probability is, the Bayesian analysis showed that Bem’s data has very little effect on the probability that retrocausal cognition is real.

Galak et al in the new study also perform a Bayesian analysis of their own data and conclude that this analysis strongly favors the null hypothesis. They also specifically cite Wagenmakers’ support for the Bayesian approach.

Criticisms of Bem

It’s possible for a study to “look good on paper” – meaning that the details that get reported in the published paper may make the study look more rigorous than it actually was. There may also be alternate ways to analyze the data that give a different picture Ritchie, Wiseman, and French outline several criticisms of Bem’s methods. They mention the Bayesian issue, but also that an analysis of the data shows an inverse relationship between effect size and subject number. In other words, the fewer the number of subjects the greater the effect size. This could imply a process called optional stopping.

This is potentially very problematic. Related to this is the admission by Bem, according to the article, that he peeked at the data as it was coming in. The reason peeking is frowned upon is precisely because it can result in things like optional stopping, which is stopping the collection of data in an experiment because the results so far are looking positive. This is a subtle way of cherry picking positive data. It is preferred that a predetermined stopping point is chosen to prevent this sort of subtle manipulation of data.

Another issue raised was the use of multiple analysis. Researchers can collect lots of data, by looking at many variables, and then making many comparisons among those variables. Sometimes they only publish the positive correlations, and may or may not disclose that they even looked at other comparisons. Sometimes they publish all the data, but the statistical analysis treats each comparison independently. In short what this means is that if you look at 20 comparisons with a 1/20 chance of reaching statistical significance, by chance one comparison would be significant. You can then declare a real effect. But what should happen is that the statistical analysis is adjusted to account for the fact that 20 different comparisons were made, which can potentially make the positive results negative.

Finally there was a serious issue raised with how the data was handled. Subjects occasionally made spelling error when listing the words they recall. The result may have been a non-word (like ctt for cat) or another word (like car for cat). Researchers had to go through and correct these misspellings manually.

The authors point out that these corrections were done in a non-blinded fashion, creating the opportunity to fudge the data toward the positive by how correction choices are made. Bem countered that even if you removed the corrected words the results would still be positive, but that is still methodologically sloppy and is likely still relevant, for reasons I will now get into.

Researcher Degrees of Freedom

As we see, there were many problems with the methods and statistical analysis of Bem’s original paper. Bem argues that each problem was small and by itself would not have changed the results. This argument, however, misses a critical point, made very clear in another recent paper – one that was also cited and discussed in the Galak paper.

Simmons et al published a paper demonstrating how easy it is to achieve false positive results by exploiting (consciously or unconsciously) so-called “researcher degrees of freedom.” In the abstract they write:

“In this article, we accomplish two things. First, we show that despite empirical psychologists’ nominal endorsement of a low rate of false-positive findings (! .05), flexibility in data collection, analysis, and reporting dramatically increases actual false-positive rates. In many cases, a researcher is more likely to falsely find evidence that an effect exists than to correctly find evidence that it does not. We present computer simulations and a pair of actual experiments that demonstrate how unacceptably easy it is to accumulate (and report) statistically significant evidence for a false hypothesis.”

In my opinion this is a seminal paper that deserves wide distribution and discussion among skeptics, scientists, and the public. The paper discusses the fact that researchers make many decisions when designing and executing a study, and analyzing and reporting the data. Each individual decision may have only a small effect on the final outcome. Each decision may be made perfectly innocently, and can be reasonably justified.

However, the cumulative effect of these decisions (degrees of freedom) could be to systematically bias the results of a study toward the positive. The power of this effect is potentially huge, and likely results in a significant bias towards positive studies in the published literature.

But even worse, this effect can also be invisible. As the authors point out – each individual decision can seem quite reasonable by itself. The final published paper may not reflect the fact that the researchers, for example, looked at three different statistical methods of analysis before choosing the one that gave the best results.

The authors lay out some fixes for this problem, such as researchers disclosing their methods prior to collecting data (and no peeking). But another check on this kind of bias in research is replication.

The Power of Replication

The degrees of freedom issue is one big reason that replicating studies, especially precise replications, is so important. A precise replication should have no degrees of freedom, because all the choices were already made by the original study. If the effect being researched is real, then the results should still come out positive. If they were the result of exploiting the degrees of freedom, then they should vanish.

There are also the other recognized benefits of replication. The most obvious is that any unrecognized quirky aspects of study execution or researcher biases should average out over multiple replications. For this reason it is critical for replications to be truly independent.

Another often missed reason why replications are important is simply to look at a fresh set of data. It is possible for a researcher, for example, to notice a trend in data that generates a hypothesis. That trend may have been entirely due to random clustering, however. If the data in which the trend was initially observed is used in a study, then the original random clustering can be carried forward, creating the false impression that the hypothesis is confirmed.

Replication involves gathering an entirely new data set, so any prior random patterns would not carry forward. Only if there is a real effect should the new data reflect the same pattern.

Proper Skepticism

Prominently displayed at the top of the Society for Psychical Research’s website is this quote:

“I shall not commit the fashionable stupidity of regarding everything I cannot explain as a fraud.” – C.G.Jung

Clearly that quote reflects the prevailing attitude among psi researchers of external skepticism of their claims and research. Every skeptic who has voiced their opinion has likely been met with accusations of being dismissive and closed-minded.

But this is a straw man. Skeptics are open to new discoveries, even paradigm-changing revolutionary ideas. Often I am asked specifically – what would it take to make me accept psi claims. I have given a very specific answer – one that applies to any extraordinary claim within medicine. It would take research simultaneously displaying the following characteristics:

1 – Solid methodology (proper blinding, fresh data set, clearly defined end points, etc.)

2 – Statistically significant results

3 – Absolute magnitude of the effect size that is greater than noise level (a sufficient signal to noise ratio)

4 – Consistent results with independent replication.

Most importantly, it would need to display all four of these characteristics simultaneously. Psi research, like most research into alternative medicine modalities like homeopathy and acupuncture, cannot do that, and that is why I remain skeptical. These are the same criteria that I apply to any claim in science.

In addition, I do think that prior probability should play a role – not in accepting or rejecting any claim a priori, but in setting the threshold for the amount and quality of evidence that will be convincing. This is reasonable – it would take more evidence to convince me that someone hit Bigfoot with their car than that they hit a deer with their car. There is a word for someone who accepts the former claim with a low threshold of evidence.

You can convince me that psi phenomena are real, but it would take evidence that is at least as solid as the evidence that implies that such phenomena are probably not possible.

It is also important to recognize that the evidence for psi is so weak and of a nature that it is reasonable to conclude it is not real even without considering plausibility. But it is probably not a coincidence that we consistently see either poor quality or negative research in areas that do have very low plausibility.

Conclusion

The least important implication of the recent paper by Galak et al is that it provides further evidence against psi as a real phenomenon, and specifically against the claims of Daryl Bem. Psi is a highly implausible hypothesis that has already been sufficiently refuted, in my opinion, by prior research.

The paper is perhaps a milestone, however, in other important respects:

- It is an admission of sorts by The Journal of Personality and Social Psychology of the importance of precise replication and a reversal of their prior decision not to publish such research.

- The paper highlights the role, and possible superiority, of Bayesian analysis as a method for looking at experimental data.

- It highlights the role of researcher degrees of freedom in generating false positive data, and replication as one solution to this problem.

- This large, rigorous, and negative replication establishes that studies published in peer-reviewed journals with positive and solid-appearing results can still be entirely wrong. It therefore justifies initial skepticism toward any such data, especially when extraordinary claims are involved.

The insights provided by this excellent paper reflect many of the points we have been making at SBM, and should be applied broadly and vigorously to alternative medicine claims.

Posted in: Neuroscience/Mental Health, Science and Medicine

Leave a Comment (18) ↓

18 thoughts on “The Power of Replication – Bems Psi Research

  1. geo says:

    re: “It highlights the role of researcher degrees of freedom in generating false positive data, and replication as one solution to this problem.”

    Isn’t there some tension between this point, and a desire to create a greater role for the assumptions researchers make about plausibility?

    If those researching a particular area share certain prejudices, and this then leads to the generation of false positive data, then the problem could become self-reinforcing. To some extent, this will depend upon the area being studied, the ease of gathering meaningful objective data, etc… but there are areas where medicine touches upon important moral and political issues, and vested interests will have strong desires for data to be interpreted in particular ways (I mentioned in another comment the biopsychosocial reforms to disability benefit taking place in the UK).

  2. rork says:

    Not fond of the researcher “degrees of freedom” language, since it’s nothing like what the term is really about, and I don’t want the pain of hearing docs say it who don’t know about that reality, which will happen if the term becomes trendy. The meaning in synergy, gene/locus, and predictive have all already died the death at the doctors hands. Can you leave us something left. Flexibility is a good word.

    We can restart an experiment, try the same experiment multiple times (RT-PCR – it’s easy to do it until you get an error in your favor), ignore some samples that are against us (declaring them to have problems), get a few more data-points to see if it helps, and try 7 different transforms and statistical tests, all while telling ourselves we aren’t lying, somehow.

  3. Janet Camp says:

    Thank you. Another tool in my growing kit of poor-quality study detecting. I now have a place to go where all the debunking techniques are archived in one place.

    I think this blog may be more widely read than I thought. A couple of woo-ish acquaintances have stopped writing/calling after I mentioned this blog as my source for rebutting their claims of scientific support for things such as acupuncture or even (!) homeopathy. The price of using one’s real name. Oh well–I can only hope they will keep reading.

  4. phayes says:

    “Replication involves gathering an entirely new data set, so any prior random patterns would not carry forward. Only if there is a real effect should the new data reflect the same pattern.”

    A real effect but not necessarily the hypothesised real effect.

    “All experiments in psychology are not of this type, however. For
    example, there have been many experiments running rats through all
    kinds of mazes, and so on–with little clear result. But in 1937
    a man named Young did a very interesting one. He had a long
    corridor with doors all along one side where the rats came in, and
    doors along the other side where the food was. He wanted to see if
    he could train the rats to go in at the third door down from
    wherever he started them off. No. The rats went immediately to the
    door where the food had been the time before.

    The question was, how did the rats know, because the corridor was
    so beautifully built and so uniform, that this was the same door
    as before? Obviously there was something about the door that was
    different from the other doors. So he painted the doors very
    carefully, arranging the textures on the faces of the doors exactly
    the same. Still the rats could tell. Then he thought maybe the rats
    were smelling the food, so he used chemicals to change the smell
    after each run. Still the rats could tell. Then he realized the
    rats might be able to tell by seeing the lights and the arrangement
    in the laboratory like any commonsense person. So he covered the
    corridor, and still the rats could tell.

    He finally found that they could tell by the way the floor sounded
    when they ran over it. And he could only fix that by putting his
    corridor in sand. So he covered one after another of all possible
    clues and finally was able to fool the rats so that they had to
    learn to go in the third door. If he relaxed any of his conditions,
    the rats could tell.All experiments in psychology are not of this type, however. For
    example, there have been many experiments running rats through all
    kinds of mazes, and so on–with little clear result. But in 1937
    a man named Young did a very interesting one. He had a long
    corridor with doors all along one side where the rats came in, and
    doors along the other side where the food was. He wanted to see if
    he could train the rats to go in at the third door down from
    wherever he started them off. No. The rats went immediately to the
    door where the food had been the time before.

    The question was, how did the rats know, because the corridor was
    so beautifully built and so uniform, that this was the same door
    as before? Obviously there was something about the door that was
    different from the other doors. So he painted the doors very
    carefully, arranging the textures on the faces of the doors exactly
    the same. Still the rats could tell. Then he thought maybe the rats
    were smelling the food, so he used chemicals to change the smell
    after each run. Still the rats could tell. Then he realized the
    rats might be able to tell by seeing the lights and the arrangement
    in the laboratory like any commonsense person. So he covered the
    corridor, and still the rats could tell.

    He finally found that they could tell by the way the floor sounded
    when they ran over it. And he could only fix that by putting his
    corridor in sand. So he covered one after another of all possible
    clues and finally was able to fool the rats so that they had to
    learn to go in the third door. If he relaxed any of his conditions,
    the rats could tell.” http://www.lhup.edu/~DSIMANEK/cargocul.htm

    IMHO, the failure of many medical scientists and psychologists to recognise the pathologies and the irrationality in the frequentist methods and ‘system’ of inference is an error which pales into insignificance next to the failure to understand why claims to have established retrocausality etc. would need something more than just impressively significant and replicable results before they could be taken seriously.

  5. cervantes says:

    - This large, rigorous, and negative replication establishes that studies published in peer-reviewed journals with positive and solid-appearing results can still be entirely wrong. It therefore justifies initial skepticism toward any such data, especially when extraordinary claims are involved.

    Indeed. Here is a cautionary tale. You may have heard that obesity, divorce, and even loneliness can spread through social networks. The obesity claim was published in the New England Journal of Medicine. Mathematician Russell Lyons completely debunked it, along with the rest of these claims, which were made by researchers at Harvard. But NEJM refused to publish his rebuttal. He had great difficulty getting it published at all, finally got it into an obscure journal. Christakis and Fowler, who made these claims, have sailed on unscathed, despite the established fact that their entire corpus of work in this area is total bullshit. NEJM has not printed any correction or retraction, and evidently never will.

  6. phayes says:

    That (cervantes’ cautionary tale) reminds me of sTeamTraen’s recent debunking work (in collaboration with Alan Sokal!):

    http://www.badscience.net/forum/viewtopic.php?f=3&t=26673
    http://www.badscience.net/forum/viewtopic.php?f=3&t=26673&start=25#p843471

    which I hope won’t be rejected.

  7. daedalus2u says:

    In reading this, I just realized that the inability of other researchers to “replicate” Bem’s result (submitting a manuscript to a certain journal and getting it published) demonstrates that what they tried to replicate (publishing in that journal) was not repeatable and so was not the result of a “scientific process”.

  8. Harriet Hall says:

    We covered the Bem study at the recent Skeptic’s Toolbox workshop in Eugene, Oregon. Ray Hyman told us he had a lot of experience as a reviewer of articles for psychological journals, and he was amazed that this one got by the peer reviewers and editors. He said he had made a list of 100 flaws in Bem’s paper, any ONE of which was grounds for rejection.

  9. BKsea says:

    I think the FDA drug approval system is a useful model of replication. As I understand it, a drug goes through 3 phases: Phase I demonstrates relative safety, establishes dosage, etc. Then in Phase II, some degree of efficacy is demonstrated. If successful, Phase III is started to see if efficacy is replicated in a large trial. As a result of having to prove efficacy twice, ineffective drugs rarely get approved (of course, some may sneak through or adverse events may be missed, etc., but these are relatively rare)

    In comparison, typical alternative medicine techniques are usually only held to phase I standards (“this has no harmful side effects!”). Occaissionally, they are held to phase II standards (“In this small, poorly designed study, we saw an effect, p=0.049″). I don’t know of any alternative medicine techique actually getting the phase III treatment. Of course, if they were successful in Phase III, we’d probably call them medicine.

    This, in my mind, is the fundamental difference between science-based medicine and woo. True medicine has a high bar for evidence. If you want your brand of woo to get respect, you have to play by the same rules.

  10. ConspicuousCarl says:

    BKsea:

    And perhaps most important, those 3 phases all have to test the same hypothesis. You can’t say “Our drug improved maze performance in 3 out of 5 lab rats, and then improved breathing in 7 out of 12 asthmatics, therefore we want to test it on 1500 diabetics because obviously there is something special about this drug.”

  11. qetzal says:

    Actually, FDA’s default standard is “at least two adequate and well-controlled [clinical] studies, each convincing on its own, to establish effectiveness.” (see Section IIA, paragraph 2 here) But that’s not a hard and fast rule.

  12. Pyfagorous says:

    How would your interpretation of Bayesian statistics go about assigning probabilities – by committee? I.e., “We think this idea if very unlikely, therefore we need an alpha of 0.001. But this idea we think is likely, and therefore we only need an alpha of 0.3.”? Is this what you’re suggesting? It sounds like it (more or less); you might as well create an Inquisition.

  13. BillyJoe says:

    Prior Probability is assigned on the basis of the results of previous trials and on the basis of whether or not well established laws of physics, chemistry, physiology, and biology would need to be wrong I order for the therapy to work

Comments are closed.