Hindsight is Stats 2020, Part III: Final-First Exams

Jul 31, 2023

[I originally wrote this in August 2020, when I was teaching courses as a PhD student at NYU, and I’m reposting it here for reference. This is Part III; Part I is here and Part II is here.]

Exams were my white whale for this course.

My design goals were clear (see Part II). Someone who knows their stuff should be able to prove what they know and walk out of the class. Students should be encouraged to learn as fast as they can, and they should be rewarded for getting ahead of the class if they want to. And there should be almost no consequences for failure, so that students can experiment without torpedoing their grade.

But exams are famously plagued with problems. Rescheduling exams for students who are sick or have to miss a day. Deciding who is allowed to do make-up exams. The endless questions about exam format — “professor, will this be on the final?” Somehow, we complain about all this but take it for granted. Why not come up with a way to make these problems a thing of the past?

1. Final-First Exams

These days, professors have gotten more comfortable experimenting with exam formats. Lots of exams are open notes, open book, or even take-home. Some classes let you drop your lowest exam score. I’ve even heard of professors giving five exams and dropping your worst two.

Dropping tests is cool, because it fixes some of the classic problems. Have to miss an exam? No problem, just drop that one. No need for make-up exams. If you bomb an exam, just drop it.

This is the right direction, but we can go further. What else can we tinker with, to make exams even better?

I thought back to the cumulative format, and why it doesn’t work for teaching. Why have cumulative exams, then? Doesn’t it just serve to obscure your expectations? My class format was fractal (see Part I), so that students could see what’s coming, know what’s expected of them. Why not use this approach with exams, too?

Dropping one exam isn’t cool. You know what’s cool? Dropping ALL the exams.

I call the format Final-First, because your first exam is a final exam. In fact, every exam is a final exam, meaning every exam covers all of the material covered in the whole course. The exams have nearly identical formats, differing only in the particulars. I swap out the numbers and some of the details on the questions, but once you’ve seen one final, you have a pretty good sense of all of them.

This course was six weeks long, and I gave them a final exam at the end of every week. This means they had a final exam at the end of Week 1, at the end of Week 2, at the end of Week 3, and so on…

Since these were all final exams, I didn’t expect most of them would do very well on the first exam. But that’s ok, because we dropped all their exam scores except for the best one. The exam grade, as it contributed to their grade for the class as a whole, was entirely based on their best exam. Other exam grades didn’t contribute at all.

If a student gets a 90% on the third final, it doesn’t matter how they did on the first two. Why should a student suffer if they get a 10% on the first exam but manage to nail it with a 90% later on? Clearly that student has done a great job and learned all the material we wanted them to, even though they struggled at first. In fact, isn’t that more impressive?

This format has some great features, which are beautifully in line with my design goals:

Good Incentives: If you understand the material quickly, you should be rewarded. Students who succeed are rewarded with more freedom. No one who has mastered the material should be forced to go through the motions. If you get a grade you’re happy with, you can choose to skip the rest of the exams with no downside.
Safety Net: Each exam offers a new chance to set a minimum threshold for your grade. Once you get a 85 on one exam, you can rest easy that your grade won’t go any lower. With this design there are no consequences for failure. You can bomb (or miss) as many exams as you want without any risk to your final grade.
Low Anxiety: Students who are able to get a good grade on one of the early exams will be able to worry about things other than cramming for the next exam. Maybe they’ll use it to study more, or maybe they’ll just go to the beach. I don’t care. If you can get an 80 on the final exam in week two of a six-week class, you deserve to go to the beach.
Transparency: With this format, there’s no more need for, “what will be on the test?” Once you have taken the first final, you will know (approximately) the format of all the other finals. This has the added benefit of:
Context: Seeing all the material at once will allow you to begin building a tapestry of ideas in your head. You will never be blindsided by new material, things you didn’t realize were expected of you. Once you’ve seen one final exam, you’ve seen them all, and being exposed to all the material early on will help you learn it better.
Feedback: You will be able to tell what skills you have mastered and which you need to work on. This will allow you to spend your study time wisely. Previous exams become a great tool for review. You can go over your performance with the TA or professor and be able to see exactly what you need to work on for the next exam, because the next exam is so similar.

I was really happy with this design. It hit all of my design goals, and it resolves a lot of the classic problems with exams.

Other people liked the idea too. I was on a date with another PhD student and we were talking about teaching, so I told her about this design. She said, “that sounds a bit insane upfront, but not so much when you think about it.”

Now there was nothing to do but try it out. For this class, I made the exam 50% of the final grade. Normally, making a single evaluation a huge chunk of the grade is unfair. But with this format, the exams are the best one of six evaluations, and besides, the exams test what I really want them to know.

1.1 The Results

Final-First exams worked really, really well.

I was worried that students would be confused by the format, or would be terrified when they failed the first Exam, but I actually got very few questions about it. Students seemed to understand what I was trying to do.

It really did solve all the usual exam problems. No one ever asked me for a makeup exam. Only once did I have to clarify what would be on the exam. When students wanted to meet to go over their answers, we were able to make real progress, because it was immediately clear to me what parts of the material they had mastered and what they were still struggling with. In many cases we could look back over two or three different exams and see the same thing tripping them up every time over multiple weeks.

Most people improved steadily over time. The average grade went from 60% on Exam 1 (this was by design; see below) to 85% on Exam 6. Students took the exams pretty freely. Some of them took every exam, but on average they took only 4 of the 6 exams.

A few students actually got their best grade quite early on. On the first final, at the end of the first week of class, the highest grade was an incredible 88% (!!!). This student kept taking exams, though, and was able to eventually beat her record with a 92.5% on Exam 5.

The student who got the second-highest score on Exam 1 got a 84%, again very high for having taken only three classes. This student chose to skip most of the other exams. He did take Exam 5, but only got a 75.5%, so in the end his final grade was actually based on his exam score from the first week of class!

I was a little surprised that more students didn’t try to get a solid grade early on. When I think about this format, one of the most exciting things to me is the idea that you can teach yourself all the material, get ahead of the class, get a great exam grade halfway through, and not have to show up to class anymore. But while a few students got great scores on Exams 3 and 4, that was the exception. It might be different in a semester-long class. Six weeks is just not much time to teach yourself, even if you really commit to it!

These are extreme cases of the safety net working as intended, but the design worked equally well for students with less extreme grades. To my surprise, only 26 of the 39 students took Exam 6, the final final exam. I think this means that by the end of the class, many of them were satisfied enough with their exam grade that they chose not to take this last final. Of those who did take Exam 6, only 18 got a better grade on the final final than on any previous final, which means that 8 people didn’t improve their grade at all on the final final.

The best exam grade in the entire course, a 97.5%, was actually earned on Exam 5. Perhaps unsurprisingly, that student chose not to take Exam 6.

These grades are really impressive, because the exams were not easy. I came in with specific expectations of what a student should know by the end of intro stats. These expectations were reasonable, but they were also pretty high. We expect too little of undergrads, and we underestimate what they are capable of doing and understanding.

I didn’t change my expectations at all during this course. Every student who earned a 90% on an exam met my expectations, and every student who did better than that exceeded my expectations. In my opinion, a good grade means that they mastered the material.

1.2 Student Opinion

Students really liked the exams. Some of the most positive feedback was about this part of the class. Take a look:

“This was one of my favorite aspects of the course because it genuinely did relieve a lot of stress. My biggest fears for this course revolved around completing it and not only doing poorly, but also learning nothing. I think the weekly exams allowed me to continually refresh and apply what we had reviewed without the anxiety of failing the course.”

“I thought the idea of getting graded based on the best exam was exceptional since we learn more as we continue taking the class.”

“To be honest, this is the best [exam] format I’ve ever taken! It really gives me the motivation to study harder each time without getting too stressed out.”

Other comments were much the same. As you’ll notice, the experience students had with the format was exactly the experience I was aiming for. A few other notes of interest were:

“I found myself studying ahead of time to supplement the material I have not learned yet”

“Towards the end it was fine, but the first few were pretty stressful for me.”

The one complaint, which I did see a few times, was that the Exams tested them on questions they didn’t recognize and hadn’t seen before. But of course, this was by design, because I wanted to see if they really understood the concepts.

Some students seemed to understand this, with one noting, “[Ethan] helped us prepare as best as we could without actually giving us the answers.” And once again I’ll point to their excellent exam grades as proof that the difference in format wasn’t actually a problem.

2. Exam Design

This format is certainly the most interesting part of the exams. But the design of the exams and the exam questions is worth discussing as well.

The Final-First exam format doesn’t work if you don’t pay close attention to the design of the exams. Exams need to be nearly identical, so that students always know what’s coming on the next one. But they can’t be too similar, or else students will memorize them by rote. You need to keep mixing it up.

I had a plan for the exams going in. As I argued in What You Want from Tests, exams should be used to test the knowledge that students carry around in their heads, the bits that an expert will internalize. That’s what I was aiming for in this class. Research reports would cover their ability to actually do stats, and exams would cover their memory and intuition for the most important concepts.

Then, of course, the whole course was forced online. I immediately knew this meant that exams would de facto be open book, open notes, and really, open Google. So I knew that I would have to pivot away from my original plans. I couldn’t just focus on internalized knowledge.

(I never explicitly told students that the exams were open notes, but I never told them not to look things up either.)

I actually think this ended up improving the exams. I stand by what I said in What You Want from Tests, but it can be more complicated than I imply in that essay.

2.1 Exam Structure

The structure of the exams mirrored the structure of the course — after all, every exam was a final. Each exam was 50 points in total. Of that, 15 points had to do with basic data skills, 15 points went to descriptive statistics, and 15 points were on the use and interpretation of inferential statistics. Just like the course, the exams were divided into these three sub-topics.

The remaining 5 points went to what I called “advanced topics”. These were questions about things I mentioned in lecture but that were slightly outside the scope of the class, more complex questions about the use of core concepts, or questions that tested their intuitions in ways we had hinted at, but hadn’t explicitly discussed.

An interesting consequence was that a student who mastered all the core material, but hadn’t yet achieved that deeper understanding, would only get a 90% on the exam, because the advanced section was the last 10% of the exam grade. A grade of higher than 90% means that a student understood not only all of the material at the expected level, but was making progress into understanding it more completely.

This is why I am so confident that the students who got above a 90% on their exam grade not only met my standards, they exceeded them. That last ten percent came from questions that were, by design, more difficult than an intro stats student should be able to answer.

2.2 Exam Difficulty

Maybe other teachers already know this, but something I had never realized before was that a teacher has a lot of control over the difficulty curve of an exam. I knew that a professor could make an exam more or less difficult, but I didn’t understand that you have a lot of control over the distribution of scores.

This was particularly important for a class using the Final-First exam format. In this system, most students take a final exam in Week 1, and of course most of them will bomb it. There’s a big difference in morale, however, between bombing an exam with 50% and bombing it with 5%!

I wanted to encourage students to do well. I wanted to make sure they felt like they could succeed from the very beginning. To make this happen, I designed the exam so that it was easy to get a decent score, but hard to get a great score. (For those of you who are statistically inclined, compare item response theory.)

(This is also how I asked Liz to grade the research reports. Make it easy to get a decent grade but hard to get a perfect grade, I said.)

I had already decided that 15 points, or 30% of the exam, was devoted to data skills. This stuff is pretty easy, and so I knew that most students would be getting a good chunk of points from this section right from the start. In the other two sections, I made sure to include a couple easy questions, to keep the baseline grade relatively high.

The fact that the average score on Exam 1 was 60% shows that I was successful. In fact, even in Week 1, the lowest exam grade was only a 40%. That doesn’t sound like much, but considering that we were only 17% of the way through the class, I think it’s pretty good.

I used some other tricks for this as well. One was that the exam was almost entirely multiple-choice. A classic problem with multiple choice questions is that students always have a decent chance to get the right answer by just guessing. For example, a student guessing on a multiple-choice question with four answers will get the right answer 25% of the time. An exam with nothing but 4-answer multiple choice questions has a baseline grade of 25%. It’s even worse for an exam that’s all true/false, which has a baseline of 50%. This is why up until 2016, the SAT took off 1/4 a point for each wrong answer. Statistically, it meant that a student who did nothing but guess would get a score of about zero.

But we can turn this same force to our advantage. To adjust the baseline score, I can change the number of answers I include for my multiple choice questions. This is exactly what I did. For the Data section, which I wanted to be a score-booster, all the multiple choice questions had only a few answers each. For the Advanced section, where I wanted students to earn points only if they really knew their stuff, most of the multiple choice questions had 8 or more response options! And for the other sections, which I wanted to land somewhere in between, I included a mix.

Of course, there are limits to how lenient we want to be. In particular, true/false questions seem too easy — a baseline of 50% just from guessing is way too high. One idea that I really like is True / False / Can’t Tell questions. At a shallow level, these are just true/false questions with three options instead of two. But at a deeper level, this encourages students to engage with the question in a new way. Instead of just determining which answer is right, they have to think about whether they even have enough information to make that call. It literally adds another dimension to the question. This is especially well-suited to statistics, which is all about making informed guesses based on limited information.

I used a similar approach in some of my short answer questions. I’ve noticed that in class, students are often much more comfortable telling you why something is wrong than trying to give you the right answer themselves. I translated this into “What’s wrong with…” questions. Students would be given a short paragraph that described some statistics. For the most part these were perfectly normal paragraphs, but I had always inserted at least one error. For example, sometimes I would say that a variable wasn’t skewed, but I would report a mean and median that were strikingly different (which is the classic sign of a skewed variable). Students would have to pick out the mistake and tell me why it was wrong.

This is a really important skill in real life. A big part of the practice of using stats as a scientist is noticing when something is wrong in an analysis, whether you’re checking your own analysis or looking over someone else’s work.

I included one of these questions in the Data section for almost every exam, since they are a good way to ask about data features like skew and range without just asking students to regurgitate the definitions. I also included a few in the Descriptive Statistics sections, and I think that added some nice variety. You know a student doesn’t understand correlation when you report r = 1.2 and they don’t catch it.

I realize now that I never included any of these questions about inferential statistics. This was a mistake, since catching errors in the reporting of tests is something that comes up all the time. If I taught this class again, I would put “What’s wrong with…” questions in all three sections of the exam.

Another way to control exam difficulty is with paired questions. You include two questions about the same topic, but one is easy, and one is harder. For example, in my descriptive statistics sections, I always included two questions where I described some data and asked students what plot or chart they should use to represent that data. By design, the first of these was always pretty easy, and the second was, while not exactly hard, a more sincere test of their understanding.

This has some great features. First, it helps raise their baseline score. A student who understands the idea even a little will usually get the first question right, and this will boost their grade. They essentially get partial credit on that concept, even though the question is multiple choice. (They say you can’t give partial credit on multiple choice questions, but what do they know?) But a student only gets full credit if they can answer the more challenging question. Again we see that the design makes it easy to get a decent grade, but hard to get a perfect grade.

Second, it helps with feedback. For any topic on the exam, if a student gets neither question right, they clearly do not understand the topic at all. If they get the easy one right but not the harder one, they understand the basics but haven’t quite got the whole idea. And if they get both right, it’s clear they understand it at the level I want them to. If they somehow get the hard question right and the easy question wrong, this tells you that they were probably guessing. You can look at the exam and see exactly how students are doing with each of the core skills.

2.3 Difficulty Over the Course of the Class

As important as the difficulty curve within an exam is, it’s also worth mentioning difficulty curves over time. Part of the reason to make an exam easy to pass but hard to ace is that this is good for student morale, while still being an accurate measure of their ability. With a Final-First exam, you also want to worry about difficulty over time.

Students shouldn’t get a good grade on the first final unless they really know their stuff. Early on, exam grades should be pretty low. But if exam grades go down with every exam, or even if they fail to go up, that’s bad for morale. It tells the students that they aren’t learning anything from the class. That shouldn’t be true, and even if it is, you shouldn’t be telling them that!

My recommendation is that your hardest exam should go first, and your easiest exam (while still staying true to what you want them to get out of the class) should go last, with the other exams in between in decreasing order of difficulty. And of course, for the reasons described above, your hardest exam should still be designed so that on average students do decently on it. If the average score on the first final is less than 50%, you’ve probably done something wrong.

One thing that I would like to do someday is create a way to generate exams automatically. These exams are formulaic by design, so it would be relatively easy to write a script that could mix & match components and spit out as many exams as you want. Not only could this make the exams more fair and regular, you could do things like share multiple practice exams with your students.

3. Exams Online

As with everything else, I was worried about exams being online. There were the concerns around cheating, as I mentioned in Part I, and also just around giving an exam remotely.

I was wrong. Holding exams online is one of the best things I’ve ever done for a class. It was so easy that I am seriously considering using online exams for in-person classes in the future.

I ended up running all my exams through Qualtrics, a survey software I use in my research. Qualtrics is flexible and it has a lot of nice features that are helpful for exams, but I suspect you could run online exams with other survey platforms.

Exams were run every week. Since my students were located all around the world, and since many of them had jobs or other responsibilities, I opened the exam for a full 24 hours. Lectures were Monday / Tuesday / Wednesday, and every week the exam was open from 5:00pm EST Thursday to 5:00pm EST Friday. Using the survey software, it was easy to leave it open all day and let them drop in whenever they wanted. I also liked how this didn’t cut into class time.

Qualtrics automatically records the time when a session is opened and when it is submitted, so I used that to time their exams. The exam would begin as soon as a student clicked on the link, since that prompted Qualtrics to record the session start. I recommended that they time themselves to ensure that they didn’t go over. We compared their start and their submit times to see if they followed directions. Some of them did go over by a little, but we were lenient, and graded those exams too. To my surprise, no one tried to sneak in a much longer exam session.

After some pilot testing with my sister, I ended up making the exam only 45 minutes long. This isn’t much time, but I figured it would be easy to add time later if I had to. I was worried that students would complain, and fully expected that I would have to bump it up to 60 minutes after the first few exams. But this ended up being unfounded too. I didn’t get any complaints about the exam length — students never mentioned it! — and so I kept it 45 minutes long for the whole course.

Short exams also fit my design goals. There’s no need to belabor an examination. As long as it’s accurate, it should be as short as possible. Once again, I imagined how it would be if, through some horrible clerical error, I was forced to take the class myself. I knew I would be able to ace the exam in about 15 minutes, so I wouldn’t be forced to waste more than a tiny amount of time. That’s how it should be.

Running exams online also gave us huge benefits on the backend. Exams were incredibly simple to grade. Once all the scores were in, I would take the exam myself, putting in all the right answers and writing ANSWER KEY as the student name at the end. Then, when Liz downloaded all the responses for grading, she could just use Excel functions to compare each of their answers to the responses I put for the answer key, and automatically assign points that way. There were always a few short-answer questions to grade by hand, but the majority of the grading, for every single student, could be accomplished in just a few minutes.

And unlike working with scantron or paper forms, there is no headache when it comes to digitizing the results. Answers and scores were in a spreadsheet from the beginning.

It was easy to make answer keys for the same reason. Admittedly I didn’t know this at first — all the credit goes to Liz. It turns out that you can make Qualtrics generate a PDF of all the answers given by a specific person, so all we had to do was get it to spit out the ANSWER KEY responses and, surprise, there was the answer key. Again your mileage may vary, but online systems can be very powerful.

The online format does offer students the opportunity to cheat. But as I already mentioned, I don’t think they did, and I don’t think it would matter either way. There are things you could do to help prevent this, if you were worried, like giving a narrower exam window or putting out multiple versions of the exam to prevent crosstalk, the sorts of things we already do in the classroom. You could make projects a bigger part of their grade. But I think it’s to everyone’s advantage to trust the students.

With a well-designed exam, it will be easier to learn the material than it will be to cheat. The same goes for open notes. If you make a good enough exam, students will actually find it easier to leave their notes closed.

5. What I Didn’t Get To

I got to put almost everything I wanted in this course, but there were a few things I missed.

I’ve always wanted there to be a bigger role for teams, but the teams in this class didn’t work very well. It seems like there should be ways to encourage students to help one another out, reward them for working together. But all the ideas that come to mind, like giving students bonus points for helping their teammates, have obvious problems. So while I want to incentivize teamwork and peer support, I haven’t come up with a way to make it happen yet.

Students would also really benefit from giving and watching presentations. I was able to do this for my RA, and it’s clear to me that she gained a lot from making the presentations and from getting feedback. Criticizing presentations and giving feedback is also good practice for statistical literacy, and it might be less intimidating for the average student.

But it would be difficult to have every student give a presentation. It’s probably impossible for large class sizes, and it doesn’t seem like it would work well online. During the semester, you might be able to do it in recitation, either for extra credit, or in small teams.

But the real problem is that giving a single presentation is like answering a single math problem. It’s just not that much practice. Unless the class size were very small, you probably couldn’t set it up so that every student got to present multiple times. This might be better suited to an advanced course. The breakout room activities, given that they include small and regular “presentations”, might be the best we can do here.

6. Concluding Remarks

I’ve heard a lot about the things you can and can’t do when teaching stats. I’ve heard that you can’t get students to pay attention. That you can’t make them care about the subject. That they’re all cheating on their assignments. That they aren’t smart enough to learn how to use statistical software on their own.

Things are bad in education today, but they’re not bad because of lack of funding, or because students are unmotivated. Things are bad because educators lack vision.

What else do you call it when everyone knows what the problems are, but no one manages to dream up solutions? We have the ability to make education work for us, and nothing special is required, just careful thought and patient experimentation.

In particular, there are huge gains to be had in developing approaches that let students and teachers stress less over the material and waste less time. This may free them to spend more time learning, but it may also free them to have a life outside the classroom. A class with more hours of homework, longer tests, and more fiendish questions is not a better class. In most cases it is a worse one.

What could be better than learning more, with less effort, and in less time? Let us celebrate academic laziness. Perfection comes not when there are no more assignments to add, but when there are no more assignments to take away.

Students have almost no control, of course, but it’s confusing how teachers continue to design classes with backbreaking grading loads for themselves. Just give fewer assignments, shorter assignments, assignments that are easier to grade. You can do this without making your class worse. In fact, you can do it while making your class better.

So many teachers teach classes that they themselves would hate. If you wouldn’t want to take your class, if you wouldn’t find it easy, then what are you doing? It seems unnecessarily cruel. Make your classes enjoyable. If you can’t make them enjoyable, at least make them easy. If you can’t make them easy, at least make sure they’re not a huge pain.

So many teachers are paranoid about students cheating, collaborating, or doing too well on tests. Are you a teacher, or a mall cop? When classes are fair, students don’t cheat. Even when classes are rigged, most students still refuse to cheat. Taking this approach creates a system where the most honest students are the ones who have the most to lose. I have seen too many honest students fail what should have been an easy class.

It’s August as I’m writing this, and online I have seen many examples of college professors sharing heavy-handed “how to be ok pages” or “COVID pages” that they plan to attach to their syllabi for the fall semester. These pages contain assurances that you can come to the professor with anything, that you can get extra time when you need it, and so on. Professors love these pages because it makes them feel like they’re doing something to make a difference. But these promises are hot air and all your students know it. If the structure of your class is cruel, this kind of statement becomes a sick joke. And if the structure of your class is kind, then you don’t need a page at the front of your syllabus trumpeting it. It’s the fundamental rule of communication: show, don’t tell. Put your good intentions in the structure of your class or not at all.

Just make a class that doesn’t suck.

Ambika Gopalan

Aug 2, 2023

This is awesome! Quick question -- I'm wondering how much of your lessons incorporated acknowledging the huge assumptions anyone who uses any kind of stats makes (and the fact that you can't just "check" if they're "true" or "false"). As a practicing statistician and social science researcher I find that it's common to take a very formulaic approach to stats -- just plug in numbers into formulas, accept and reject hypotheses, and you're done! -- which is psychologically pleasing but ultimately a bad approach. Perhaps another area to evolve stats education?

1 reply by Ethan Ludwin-Peery

1 more comment...

MOD 171

Discussion about this post

Ready for more?