Discover more from MOD 171
Hampshire College Butchers the Psychology Classics: A Retrospective
Teaching undergrads to replicate classic psych studies in two weeks flat
Prince Wen Hui’s cook
Was cutting up an ox.
Out went a hand,
Down went a shoulder,
He planted a foot,
He pressed with a knee,
The ox fell apart
With a whisper,
The bright cleaver murmured
Like a gentle wind.
Like a sacred dance,
Like “The Mulberry Grove,”
Like ancient harmonies!
“Good work!” the Prince exclaimed,
“Your method is faultless!”
“Method?” said the cook
Laying aside his cleaver,
“What I follow is Tao
Beyond all methods!”
— Chuang Tzu’s “The Dexterous Butcher”, Trans. Thomas Merton
Last semester (S23) I taught a class titled CS-0232: Hampshire College Butchers the Psychology Classics.
The name was inspired by the Yo La Tengo album, Yo La Tengo Is Murdering the Classics. This is an album of recordings from a fundraiser for the station WFMU, where listeners who called in and pledged money could request any song, which Yo La Tengo would then attempt to play live on air, without any time to practice, look up the lyrics, etc. Or possibly the course title was a subconscious reference to a gag from the 2015 Strongbad video, The Ocelot and the Porridge Maiden.
In any case, the fact that I was allowed to teach a course with this title tells you nearly everything you need to know about Hampshire College. Our creativity. Our open-mindedness. Our indomitable spirit. Though the dean did email me to ask about it. “It's an odd course name,” she said.
The course title refers to the fact that, like Yo La Tengo, we would try to re-create some classics. In this case, classic psychology studies. And it refers to the fact that, because we would replicate each study in only a few weeks, I expected that we would butcher most or even all of them.
The name is silly, and that’s intentional. If you say "we're going to replicate some psychology classics", this will make students nervous. They’ll want to do a good job on the replications; they will want to do it right. But if you say "we're going to attempt to replicate some psychology classics, but we are probably going to butcher it", they'll be much more relaxed, and will get more out of the course.
And if they do butcher any of the replication attempts (can't collect a big enough sample size, forget to record an important variable, etc.), that's ok. Failure is a good learning experience, and one aspect of teaching, all too often neglected, is giving students the opportunity to fail
safely with style.
Here’s the course description:
Replication is a cornerstone of science. If you've discovered something, other researchers should be able to try your design and see it for themselves. No one should have to take your word for a finding.
But modern academic science often has no time for replications. This was one of the causes of the Replication Crisis - when psychologists finally checked to see if their work could be replicated by independent teams, they discovered that much of it could not.
In this course, we'll start by reading about the history of replication. Then, working as a class or in teams, we will replicate several classic studies in psychology. Because we will try to replicate several studies in one semester, a breathless pace in the world of research, we will probably butcher some of them. And that's ok. No research experience required.
That description might look weird to you, but it didn’t scare off the students. The course cap was set at 23, and 23 people signed up for this class.
And here’s what I told students on the first day of class:
This approach has three main benefits.
First, you get the full experience of running a psychology study, every part of the process except coming up with the idea. This makes the course a good introduction to research methods. You should come out of this course with the ability to run studies, write research papers, and critically examine the research of others.
Second, replication is a cornerstone of science. By attempting to replicate these studies, we are engaging with the findings in the way that scientists should. If the original findings are real, we should have a shot at replicating them in this class, and we can see if we're able to confirm these classics or not. Scholarship involves trust, but we shouldn't have to take anyone's word for their findings.
Third, I think this will give you a greater insight into the studies you read for your other classes. Hampshire is better than other schools because you get to read primary source research articles for many of your classes, but those papers can still seem very vague and abstract if you don't have any experience with their methods. All those papers were done by real, specific people, in specific places, at specific times. They had to make decisions about how to design their materials and how to recruit subjects and how to write up their findings. Maybe they argued about the results; maybe they made some mistakes! By the end of the semester you will have seen the whole process firsthand, and I hope this will help you see the authors of other scientific papers as your peers, rather than as strange visitors from the literature dimension.
How to Butcher a Psychology Study in Two Weeks Flat
I crunched some numbers and settled on the following schedule.
We spent the first few weeks getting familiar with replications — what they are, why they’re important, what’s up with the replication crisis, and so on.
We then spent about another week learning how to use some basic research tools — Qualtrics, Google Forms, Google Sheets, and the statistical programming language R.
Then we replicated three classic psychology studies, spending two weeks on each replication. This was the main course, so to speak.
Finally, students split up into groups and replicated a classic study of their choice (from a list of ten classic studies I had drawn up) in the three weeks that were left.
If you’ve done any psychology research, then you’re aware that this is an insane pace. Two weeks to come up with a design, create all the materials, recruit participants, analyze data, and write an entire manuscript. In a normal research cycle, each part of this process can take months. I wasn’t sure it would work.
But I was optimistic. I figured that we had a lot of students working together, and that many hands make light work.
I picked studies with simple designs, and studies where most of the original materials were publicly available, so students wouldn’t have to re-invent the wheel.
I knew that many of the designs could be run as online surveys, which can speed things up a lot.
I had three fantastic teaching assistants, all with previous research experience, who I knew I could count on to guide the process. I really couldn’t have done it without them. I don’t know if it’s appropriate to name them here but — you know who you are.
And we made use of 21st century communication techniques — which is to say, I had my TAs set up a Discord server. Most college students are on Discord already, so it was a convenient way for them to make plans, quickly respond to crises, and share memes. I think this helped cut down on friction across every part of the projects.
I also introduced a pretty strict structure for running the studies, which helped us stay on track. In each two-week replication, the homework for the first Monday was to read the original paper we would be replicating. On Monday 1, we would discuss the paper and come up with a design. Homework for Wednesday 1 was to start drafting materials, which had to be finished in class by the end of the day.
Over the weekend, students would collect data, and come back on Monday 2 with the results, which we would discuss and start analyzing in class. Homework was to finish the manuscript. On Wednesday 2, we would discuss the project and they could make final touches based on anything that came up in conversation. The final manuscript was whatever they had managed to put together by the end of class that Wednesday.
We could have gone at a slower pace, but I thought it was important to replicate more than one study. Running multiple studies lets students triangulate, lets them see what studies have in common, and what is different. As we all know, you need three points for a triangle.
I also wanted students to get experience with different kinds of studies, and different kinds of replications. So when it came time to pick the studies we would replicate, these were my criteria:
One study where the original effect would replicate. For this, I chose Jacowitz & Kahneman (1995), Measures of Anchoring in Estimation Tasks. People have replicated this study over and over again, and it has one of the largest effect sizes in all of psychology. I was confident that this effect was real and that students would be able to find it, even given only two weeks to do so.
One study where the original effect doesn’t replicate. For this, I chose Caruso, Vohs, Baxter, and Waytz (2013), Mere Exposure to Money Increases Endorsement of Free-Market Systems and Social Inequality. Several attempted replications have found little evidence for the claimed effect, and this is a typical example of the kind of “social priming” study that routinely fails to replicate. I don’t think the claimed effect is real (or at least, it’s nowhere as large as originally claimed) and I expected that, like the other replication attempts, we would find no evidence for it.
Finally, I wanted at least one study where they had to run participants in person, rather than through an online survey. For this, I chose Study 2 of Bargh, Chen, & Burrows (1996), the famous “elderly priming” study, where people who solved a word search full of terms related to old people reportedly ambled down the hall at a slower pace when leaving the experiment (because old people walk slower, and they had old people on the brain). I don’t much care if this effect is real or not (though I don’t think it is) — I chose this study because it can’t be run online. To replicate this study, you have to run the replication in person: reserve study rooms, time participants as they walk to the elevator, the whole nine yards. And I chose it because it’s a real classic, both of social psychology and of the replication crisis. It seemed paradigmatic.
Replication 1: ⚓
For the first study, the whole class — me, three TAs, and 23 students — worked on the replication together. Most students started out with no research experience at all, so I figured we would need everyone pulling together to make the study work. And this early in the process, it would be ok if students who were more comfortable with research (for whatever reason) took the lead, and students who were less comfortable just helped out, or even just watched.
I also took a more active role in this project than I would in the later replications. For example, I knew that students were not very familiar with Qualtrics (our survey software) yet, so when we were designing the survey, I put my screen up on the projector and I “drove” the software for them. But I didn’t make any design decisions — I asked them what they wanted the survey to look like (e.g. “do you want demographics questions”) and added elements as they shouted them out, as they debated what should go in the study, how the questions should be phrased, and all the lovely details. They called the shots.
The study we were replicating was Jacowitz & Kahneman (1995), a powerful demonstration of what is called anchoring. In this kind of study, you might be asked, “Are more or less than 100 babies born per day in the United States?” Now, 100 is obviously too low, so of course you say “more”. Then you’re asked, “How many babies do you think are born in the United States each day?” You give your best guess.
Meanwhile, a different group of people are asked, “Are more or less than 50,000 babies born per day in the United States?” This is clearly too high, so they say “less”. Then they are also asked to give their best guess for the actual number.
That number in the first question, 100 babies or 50,000 babies, is called the anchor. The anchoring effect is that people tend to guess higher numbers when the anchor is high, like 50,000 babies, and guess lower numbers when the anchor is low, like 100 babies. This is one of the most robust effects in all of psychology, and it replicates quite readily.
Jacowitz & Kahneman’s original paper included 15 anchoring questions. My students decided to replicate just five of these questions, to help keep the survey relatively short. They also insisted on adding three new anchoring questions relevant to Hampshire College, which were invented and added in a mad dash in the last 15 minutes before the end of class. We ended up with a total of eight target questions.
Students shared the survey around campus, and collected a total of 94 participants before class on Monday. They used a number of techniques to get people to participate, including going from table to table in the dining hall, peer pressuring students one by one to take the survey, and putting highly inventive posters with QR codes, like the following, all over campus:
Results came in over the weekend, and they successfully replicated the original effect — most questions showed a strong anchoring effect, just like they did for Jacowitz & Kahneman (1995).
The questions that didn’t show strong anchoring were questions that students included to deliberately examine the boundaries of the effect. They had noticed that one of the questions in the original paper (“Was Lincoln's presidency before or after the 17th / 7th presidency?” ; “What number do you think Lincoln's presidency was?”) showed no anchoring effect at all, so they decided to include this question because they wanted to see if it would also show no anchoring effect in their replication.
They suspected that this question didn’t show an anchoring effect because many people happen to know that Lincoln was exactly the 16th president. If you know the exact answer to a question, an anchor should have essentially no effect, because you’re not guessing.
And even if you are guessing, they thought that a more informed guess should show less anchoring. To test this, the new questions they added to the mix included some items that Hampshire students might already know about, like the number of faculty at the school.
Sure enough, the Lincoln question showed almost no anchoring effect, and neither did their question about the number of professors at Hampshire. “Our hypothesis suggests that prior knowledge has an adverse influence on anchoring effectiveness,” they wrote, “and our analysis corroborated this, where we found that current Hampshire students were less influenced by anchoring [on the question of the number of professors] than other groups. Comparing the two anchor indexes supports this statement: 0.21 for current students (61.7%) versus 0.47 for other groups (38.3%).”
When we discussed the final manuscript that Wednesday, I told them I was very disappointed. The replication was great, they didn’t butcher it at all. Not even a little!
Replication 2: 💵
Given the success of the first replication, for replication two I had students split up into three groups, with each group led by one of the three TAs.
This wasn’t the original plan, but it worked really well. In their self-evals at the end of the semester, students mentioned that splitting up into groups for the second replication gave them the chance to practice skills they had merely observed in the first replication. One student said:
The first big group was a little overwhelming, I tried to help with stats and I did some work on the manuscript but the goliaths of the class were so competent it didn't leave a lot of room for the new people like myself to work. I appreciated moving into smaller groups during our replication of money priming because I started to learn what I was good at and I got some footing. I was able to actually apply the new knowledge I had which was pretty cool.
So I think that splitting into smaller groups after the first replication is a good idea, and I would do this again. Making the groups even smaller might be better, but I only had three TAs.
As the student hinted, this replication was all about the money. Specifically, Caruso, Vohs, Baxter, and Waytz (2013), who did a series of five studies where briefly exposing people to pictures of money, or asking them to unscramble phrases related to money (henceforth “money priming”), led them to endorse “free-market systems and social inequality” more than people who did a control task.
The original plan was to replicate Experiment 1 from this paper, but since we were splitting up into groups, I gave them the opportunity to replicate any of the five studies. Despite this, all three groups chose to replicate Experiment 1.
The three groups ran their studies in parallel under the supervision of the TAs, producing three manuscripts: “Money, Money, Money”, “Ain’t it Funny”, and “In a Rich Man's World”. This naming convention was their idea, I prefer Pink Floyd.
“Money, Money, Money” recruited 54 participants, and found no effect of money priming on system justification.
“Ain’t it Funny” recruited 66 participants, and also found no effect, with a trend in the opposite direction as the Caruso et al. original.
“In a Rich Man's World” recruited 51 participants, though due to an overlooked issue in the survey software, they were not able to tell which participants were in which condition, and couldn’t complete their analysis. Here is the entirety of the text in their results section: “lol”
The sample in Caruso et al.’s original study was “thirty adults from a university study pool”. Not to make too fine a point of it, but each of my groups had a larger sample size than the original, almost twice as large on average. Despite this, and despite three largely independent attempts at replication, they did not find the same effect as was reported in the original study.
Replication 3: 🧓🏽
The third replication was, by design, a study that students would have to run in person. Online surveys are fine, but students should know how to run an in-person experiment if they need to. This is a lot more labor intensive than running a study online, so for the third replication the whole class came together again and replicated Experiment 2 of Bargh, Chen, & Burrows (1996).
In the original study, a true classic of social psychology, people completed either a scrambled-sentence task full of words related to elderly stereotypes (like “wrinkled” or “Florida”) or a scrambled-sentence task with neutral words. Then they were directed to walk to the elevator, at which point a confederate secretly timed how long it took them to reach the end of the hall. Bargh et al.’s finding was that people who did the scramble with elderly stereotype words took longer to reach the end of the hall than those who did the scramble with neutral words, the idea being that when you’re “primed” with the concept of being old, you do old person things like walk slow (in their words, you “act in accordance with the trait concepts that participate in that stereotype”).
This design was much harder to replicate than our previous projects, but by this point students knew what they were doing. The TAs found and reserved rooms so that the study could be run in person. A protocol was drawn up — students took this very seriously, insisting that it should be completely double-blind. People volunteered to spend part of their weekend running the experiment (and covered for each other when some people couldn’t make it). Students even bought candy bars with their own money (without telling me beforehand, I didn’t sign off on this) to help recruit participants!
Part of the reason this project hummed along so well is that I started introducing specialization. The whole class worked on the same study, but I put one TA each in charge of 1) designing the study materials and protocol, 2) running the protocol and collecting participant data, and 3) doing the analysis and writing the majority of the manuscript. Students volunteered to work on whichever part they liked, and we ended up with a very even three-way split.
This got a little messy at points — the group in charge of analysis had very little to do in the first week of this study, the group in charge of materials had very little to do the second week, and the group running participants had a huge crunch over the weekend in between. But it did mean that each part of this much more complicated project was fully accounted for.
And once again, they completed the replication almost perfectly, barely butchering it at all.
With only a few days to complete recruitment for an in-person study, they got 27 signups, of whom 14 showed up to the study. Like I told them, participant no-shows are a fact of life. And something always goes wrong in data collection, so in the end, they were only able to use data from 8 participants. Given the time constraints, this is still very impressive.
While eight people is too small of a sample size to give us much confidence in the results, we can at least say this: they found only 0.12 seconds of difference in the walking time between conditions, and it was the control condition that walked slightly slower than the elderly priming condition, the opposite of the original finding.
In the three weeks left at the end of the class, students split up into groups of their choosing and replicated a project on their own.
All class periods from this point forward were dedicated to working on final projects. I approved all the designs and was available to answer questions, but aside from this, these replications were completed with minimal supervision from me and the TAs. Nevertheless they were very successful — here’s a quick look at how a few of them turned out:
One group chose to replicate Carney, Cuddy, and Yap (2010), the original “Power Posing” study. This was quite ambitious, because power posing can’t be run as an online survey; it has to be run in the lab, and several parts of the original study (like the salivary hormone analysis) were beyond the scope of what we could do in class. But this group came up with their own version of the protocol, and ended up with a final sample size of 16 people after exclusions. They found no evidence of a power posing effect, though of course their sample size was too small to detect anything.
Another group submitted a manuscript titled “Memory Wiggle”, a replication based on Experiment 1 of Roediger & McDermott’s 1995 paper, “Creating false memories: Remembering words not presented in lists”. This experiment was originally run in a classroom, but they designed a version of the false-memory paradigm that could be run as an online survey instead, so they could get more signups. They collected 25 participants, and despite the small sample size and notable difference in study design, they successfully created false memories of having seen the words “chair” and “sleep”, replicating Roediger & McDermott’s original results.
The final group I’ll mention replicated Greene et al. (2007), “Cognitive Load Selectively Interferes with Utilitarian Moral Judgement”. The original study was quite long, so they cut it down somewhat, in hopes that more people would be more willing to complete a shorter survey. They also had to come up with a new cognitive load manipulation, since the manipulation from Greene et al. was too technically complex for them to implement in just a few weeks. I suggested a simple “remember an 8-digit vs a 3-digit number” manipulation. In the end they recruited 88 participants, 6 more than the original study, and found no difference in the mean reaction time for utilitarian responses from the high load and low load conditions (p = .954). Their design was somewhat different than the original study, but the final analysis has reasonable statistical power. If the original effect is real, this suggests that it is not very robust.
In conducting these replications, students created something of real value. Working closely with me and the TAs, they were able to independently confirm and extend the results of Jacowitz & Kahneman (1995). And with almost no support at all, one of the final projects confirmed the findings of Roediger & McDermott (1995), in a conceptual replication with a notably different design. This suggests that both of these effects are extremely robust.
Whether or not the replications that found null results are evidence against the original findings is more of a judgment call. But we can at least note that in many cases, my students’ replications had larger sample sizes than the original studies, and they corroborate other failed replication attempts from other independent teams.
I also can’t overstate the value to their education. It’s one thing to read about a landmark study in a textbook; it’s something else to confirm it (or fail to confirm it!) for yourself. It’s one thing to puzzle over someone’s methods by reading their paper. Why did they do that? But when you replicate it for yourself, it often becomes obvious why the authors designed things the way they did.
Students took these projects and made them entirely their own. The manuscripts were their creatures. Cracking jokes in the methods section, running around campus to chase down participants… one group even asked if they could leave class early and take a field trip to drive around to other colleges in the area and put up more recruitment posters. Naturally I said yes. In the words of one student:
This class had the most people speaking up, asking questions, answering questions, making jokes and being active participants out of all my other classes in the past year. It was one of the most enjoyable classes I’ve ever taken … The model of the class, in my mind, was also successful. We were always doing and making and planning, there was never a dull moment. Each class had a clear agenda and, for the most part, each group did a fantastic job meeting the deadlines and presenting on time. I was genuinely impressed with how much we were able to get done, especially with such an alternative approach to a classroom. But it was a huge success! I learned so much.
It was interesting to see how people with no preconceptions about research, or at least fewer preconceptions than the average academic, go about writing a manuscript.
Their manuscripts are dazzlingly short, which is refreshing. It’s also very retro — their papers read a lot more like Science letters from 1903 than PNAS papers from 2012. I admit I was a little concerned that students didn’t spend more time in their introductions helping the reader understand the context of each replication. On the other hand, I appreciate that there was very little throat-clearing.
Students felt strongly about open materials, and all of the manuscripts included appendices crammed with their study documents. This often included every single question from their survey, verbatim.
I suspect this came from being frustrated with the fact that in some of the papers we tried to replicate, the original materials were very difficult or even downright impossible to find. Students appreciated when papers included any materials as an appendix, and I think they resolved to do the same in their own manuscripts. One student wrote, “I learned that you should always check to see how much material is stated (or linked to) in the original study. This can really mess you up if you want to conduct a faithful survey.”
Open data was not as much of a focus, maybe because most of them did not spend much time working with the data. A small number of students conducted most of the analyses, since only a few of them had any statistics background.
Even the students who had already taken statistics tended to find that their classes had not prepared them to do an original analysis from scratch. To help fill the gap, I offered some optional statistics lectures outside of class, and most of the analyses in their manuscripts came from skills that students learned in my extra lectures. Anyone who wants to lead undergraduates in conducting research should be aware that you will need to give extra support in the analysis.
One thing that we re-discovered together is that pilot testing is not optional. I had forgotten to include pilot testing as a focus of the class (in part because the pace of the replications didn’t leave much time for piloting), but after a few near-misses, we were reminded that the first draft of any study design always contains mistakes, and often these mistakes are so bad that they render a survey or its results unintelligible. So it’s always good to run through the design a couple times and make sure it’s producing intelligible data before recruiting your participants. In their course evaluations, when I asked them what they learned, students mentioned pilot testing a lot! Maybe the best way to learn why a quality check exists is to see it go wrong.
How to Butcher More Classics
This class was much more successful than I expected. I felt good about the main idea, but I also knew that it was a risk. I was open to the possibility that one or two replications might be total trainwrecks. No matter how talented and motivated students might be, two weeks is simply not a lot of time, and research is complicated. There is a lot that can go wrong.
Instead, almost nothing went wrong. All the replications went smoothly, and students with no prior experience conducting studies happily wandered onto the worksite, picked up some tools, and started re-working the masters.
This approach might not always be so blessed. A lot of our success comes down to the luck of having three amazing TAs, and the luck of having a few students in the class with a knack for research, who could steer the first few replications while the rest of the class were still finding their legs. Teaching this again with a new set of students would probably turn out very different — it’s a very personal course, and depends a lot on the personal qualities of the students and the professor. Even so, it is a proof of concept that this hare-brained idea can be a success.
There are a couple of things I might change next time around. The final projects went well, but students told me that their groups sometimes felt a little aimless. Next time I might assign a TA to each of the final projects so they get a little more supervision.
Though I think these replications were great and I recommend this process as a means to learn about replication, I do think there should have been greater management and supervision over groups (especially final project groups) by either the professor or teaching assistants. That way the onus of managing these projects and ensuring that everyone is doing their fair workload is not on students. Even if students are expected to do the project management aspect themselves, they should not be expected to be solely responsible for things like disputes over who did what work or breakdowns in communication.
I might also insist on the final project groups being somewhat smaller. Four or five people seems ideal; any larger and there starts to be real diffusion of responsibility.
You could specifically assign “stations” to the final project — this person does the analysis, this person designs the materials, this person recruits participants, etc. — but that feels artificial, and I think it’s better to let students self-organize. Another thing I learned is that apparently some students will discover a passion for project management!
I also wonder if this class format only works at Hampshire College, or if you could teach this course anywhere. My friend Adam Mastroianni told me, “I don't know if Harvard students could have done it.” (He taught at Harvard.) Ultimately, this is an empirical question. Someone should uhhhhhhhhhhhh try to replicate it!
Hampshire does have some advantages, and the big one is obvious: no grades. I have no idea how I would assign grades based on the work students did in this class. The course was intentionally chaotic, and it would be very hard to keep track of who deserves “credit” for work on each replication. And furthermore like, credit for what? A successful replication is not the real goal of the course. It would be ok if all the replications were trainwrecks, as long as it’s educational.
This was a very liberating course style. It really did feel as if all that mattered was what we turned in and the work we did, and without any superficial academic posturing.
If you had to assign grades, I would recommend something very simple, like grading based on attendance/participation — e.g., anyone who shows up and seems engaged most of the time gets an A.
Other than that, I’m not sure how things would go down if you tried this at another school. But I do know that if you show up on the first day of class and say, “Hi students, this semester I am trying something new; I think it is important, but I have no idea if it will work”, that will set the right tone. And if you can sneak the word “butchers” into the course title, you’re golden.