Hindsight is Stats 2020, Part II: Design Goals & Grades
[I originally wrote this in August 2020, when I was teaching courses as a PhD student at NYU, and I’m reposting it here for reference. This is Part II of III; Part I is here.]
Grades are stupid. But at the end of the day, my university forces me to give everyone a final grade. And you do want to evaluate your students based on something, so they can know what they mastered and how they can still improve.
1. Design Goals
To begin with, I tried to work out my design goals. I started by thinking about the ways that classes normally fail and decided to work backwards from there.
One of the most blatant failures in the education system is when students are forced to take a class that they’ve already taken, or on a subject they already know. So my first goal was that someone who really knows the topic should be able to get a 100 with very little effort. There’s an easy way to check if this works: the course should be designed so that if, as the professor, I were to take it, I would ace it easily.
And not just ace it. Someone who really knows the material should, after demonstrating their knowledge, be able to walk out of the course entirely and never have to come back. Once you know the material, you shouldn’t be forced to waste your time regurgitating it.
A related problem is forcing students to waste time on concepts they already understand; or conversely, moving on to new material before a student is ready.
This is tricky because students really do learn things at different speeds. We can’t tailor the lectures to every student, but we can do things to help. Students should be given freedom to focus on the problems they find challenging. Once a student has mastered something, we should try not to bother them about it.
Similarly, most classes don’t incentivize students to learn things on their own. There’s no point getting ahead of the rest of the class. You’ll just be bored, and it might even hurt you, since it will be taking away from the time you could be using to cram the old material. This is a perverse incentive. If a student is ready to go further on their own, we should let them.
Basically, if a student wants to speedrun my class, who am I to complain? Let them do it.
Another classic way that classes screw up is by making students afraid of failure. With traditional grading, students have no room to experiment with different ways of learning, studying, and understanding. The class format pushes them to obsess about every evaluation, and encourages them to do the minimum amount required to get the grade, to take no risks. If they try something interesting and fail, their GPA plummets. This leads students to obsess over pointless minutiae, like what precisely is on the test, and exactly how to word their answers.
I wanted to save students the time they normally spend thinking about this nonsense. If they choose to spend that saved time studying, so much the better. If they don’t, then all we are losing is their anxiety. Either way, we should reward students for taking risks and attempting to go deeper with the material, not punish them.
In the end I came up with three ways to evaluate student progress.
First, I had a system to replace class participation and attendance, based off of small team activities, which counted for 30% of the final grade.
Second, I had students independently analyze two simple datasets of their choice, and write up a report about each. Together the two reports counted for 20% of the final grade.
Third, I invented a new exam format (covered in the next post), which counted for 50% of the final grade.
2. Teams & Breakout Rooms
I really hate attendance.
Taking attendance is undignified. It’s disrespectful of students, who are assumed to be incapable of making informed decisions about their education, and of the professor, who is implicitly supporting that assumption. Sometimes students get sick, have a family emergency, or need to go to the dentist, and they should be able to take care of these things without worrying about their grade. They shouldn’t have to send me an email with a doctor’s note. I don’t like getting those emails—just stay home if you’re sick—and I’m sure students don’t like sending them.
All of this is doubly true of online teaching. All the lectures are recorded. Students can watch and re-watch my presentations as many times as they want. Why should any of us care about them being “in class” when that means almost nothing in a virtual classroom?
When I taught Introduction to Psychology last summer (2019), I tried using a participation-based system. Rather than taking attendance, I had my TA mark down when students spoke in class. The idea was that this would encourage them not just to show up, but to participate in class discussions. I also hoped it would encourage them to do the assigned reading, which we discussed each day.
This didn’t work. Students would speak up even when they had nothing to add, just to get the grade. The quality of discussion suffered for it. Some very shy students didn’t speak at all, and lost points, despite the fact that they were doing great in the class otherwise. It was a huge pain for my TA to keep track of it all. This system didn’t do anything I hoped it would, and I think it was a failure.
We could just chuck attendance altogether. But on the other hand, it’s good to have some kind of incentive for students to show up to class. Recorded lectures are about as good as live ones, but if students show up to class most of the time, they can ask questions and I can get a sense of what they do and don’t understand. It would be good to encourage most of them to be there most of the time. Can we come up with a way to make this happen?
2.1 Enter the Zoom Room
One of the things that everyone learned early on in the pandemic is that video calls suck. Jumping onto a Zoom call is excruciating, and afterwards you feel drained of all will to live. Turning off your camera helps, but not by much.
At first this seemed universal. People speculated that it was something inherent to the Zoom platform. There were theories that the video latency, however subtle, was unnatural and jarring. But over time, I noticed two exceptions. The first was direct calls, with smaller groups. Hanging out with one or two friends over Zoom, while not as much fun as hanging out in person, didn’t make me want to tear my eyes out the way a Zoom call with several people did.
The other exception was playing virtual trivia. Early on in the pandemic, my friend Liz from my PhD cohort set up a virtual trivia night for students in our program. In virtual trivia, we would start off all together in one Zoom room. For each round, teams would be sent off into individual breakout rooms for 10-15 minutes to answer questions. Then we would all come back to the main room for scoring. We’d do this process every round, with a couple of trivia rounds each night.
This was infinitely better than every other group call I had been on, and it wasn’t just that we were a group of PhD students drinking late into the night. The breakout rooms were just as relaxed as being on a small call, and they broke up the evening in a way that made the main room much more fun, even though the full group was pretty large.
When I started thinking about how to run an online class, I knew I would have to include something like this.
(Liz also happened to be my TA for the stats course!)
I had been wanting to incorporate something about teams for a while, and this seemed like the perfect way to do it. Instead of sending teams off for rounds of trivia, I would send them off to do breakout room activities, and call them back to discuss the answers.
These activities took different formats depending on the topic we were covering each day, but most of them worked something like this. I put up a question or a task on the slides, and then sent the students into breakout rooms for about 10 or 15 minutes. When they came back, I randomly chose a couple teams to share their answers.
Getting the correct answer wasn’t the point. If the group provided an answer that engaged with the activity, the group got credit, even if their answer was incorrect. The only way to get no credit was to not engage with the question or to give no answer at all. If I didn’t call on a team, that activity didn’t affect their grade.
This seemed to be the perfect replacement for attendance. At least one member of every group would need to be there every day, while individual members could come and go if they needed to. But part of their individual success would come from helping to make sure that the whole team was successful, so it was still in their interest to show up and help out whenever possible. I didn’t need to keep track of who was there, I just needed to give activities and ask them for their answers. And I didn’t even need to grade their responses, just record if they made an attempt.
I also hoped that this would give them some level of social support for the class — the kind of friendship they would normally get from the students sitting next to them, and people to go to if they needed help or support.
Another benefit was that this broke up the huge lectures into smaller chunks. I had already added intermissions to break the 2.75-hour classes into two sessions of about 1 hour 15 minutes. With breakout room activities, days could end up being four sessions of about 30 minutes each, with activities and an intermission in between. That’s a lot better.
This was also meant to be a grade boost. A whopping 30% of their final grade came from their team grade, and because all you had to do was show up and try to answer the questions, I expected most teams to get 100%. I included this grade boost because I didn’t want them to worry about their final grade too much. This way, they would still have to work to get an excellent grade, but a student who did a decent job wouldn’t have to worry about failure. (As I mentioned earlier, I think that grades are kind of a joke.)
I shared a brief stats experience survey with my students the week before class, and I assigned them to teams based on their responses. I wanted to make sure that each team had a diverse collection of skills — that there was at least one student in every group who was comfortable with public speaking, at least one with decent math skills, and so on. The idea was that every team would have the skills they needed to succeed, and they would all have someone to turn to for help on any subject. I ended up with eight teams of five students each.
2.2 How did Breakout Rooms Work?
The grading worked just as planned. Seven of the eight teams got perfect marks on their breakout room activities. The other group missed one day (none of them showed up) and got about 90% on the team grade. But in general this provided exactly the padding I intended.
Or, almost. In retrospect, 30% was way too much. Students got really good grades anyways, and it wasn’t all thanks to the team grade — remember, more than 50% got an A! Making the team grades only 20% or even only 10% wouldn’t have changed their grades by very much, because they were all doing so well on other parts of the class. I think it should have counted for less than 30%, because it’s a shame that so much of their grade came from something unrelated to their understanding of the material. I am very happy so many of them got a 95 — I just think it would be better for them to get a 95 from nailing the assignments and exams than showing up and participating! It’s something I would do differently next time.
The activities worked really well. Lectures can be, let’s face it, pretty boring, and I think having these class exercises helped keep students from falling asleep. There’s also no better way to learn something than doing it yourself, so following each lesson with an exercise was a good idea. And it was nice on my end to take a quick break, wait a few minutes, and see how they had done when they came back.
You do have to be careful with the activities, though. Activities work well if they are a simple problem, something the students couldn’t do when they showed up, but can do now that they’ve seen the day’s lecture. This helps the lesson stick in their memory, and demonstrates why what they just learned is actually useful. Activities can also take a “don’t take my word for it, see for yourself” approach, and I liked this when I was able to use it.
No matter what though, the activities have to be easy. They aren’t a challenge or an exam; they exist to round out the lecture and serve as a teaching aid. It’s ok if students struggle with the details; it can be good for them to get a sense of their own limitations. But if they get stuck, can’t do the activity, or reach a dead end, then they don’t learn anything. The implicit message is that they can’t handle it, and that’s not the right message to send them. They can handle things that you’ve prepared them for; don’t give them assignments you haven’t prepared them for.
Students had mixed opinions of the teams. I got feedback like, “there was zero accountability for the breakout rooms … Most of the time, my teammates wouldn’t show up” and “as the days progressed, my group became unresponsive to the point where I was simply doing the work and presenting it on my own.” A few of them did have positive things to say about the teams, but that was clearly the minority opinion.
Most students liked the breakout room activities, though. “I was able to apply the material and then receive feedback (if called on) instantly. The breakout rooms presented a great opportunity to work through what was being discussed,” one student said. Another wrote, “Breakout rooms really allowed me to understand the application of concepts. I don’t think I would have been able to work through the research reports (or the finals) with as much ease had we not gone through related work individually and then as a class.”
The only complaint I saw about these activities was that I gave students too much time to work on them. I find this confusing, because I assumed students would be happy to have an extra 5-minute break to go and make a sandwich or something. Either way, I mark this idea as another success. It does seem like it helped the concepts and skills really stick with them.
Some students suggested that the activities be designed to more directly prepare them for the exams — basically, to have the activities be examples of the kind of questions that appeared on the exams. I can see why they proposed this, but I don’t like it. The exams are designed to try to see if students can generalize stats concepts to new situations. (And from their grades, it’s clear that by the end they could!) If I give them practice with questions of a similar format, I think that would defeat the purpose.
Obviously then, the problem is the teams, and it’s not clear to me what the solution is. Students suggested that I could have them do the work as a team but then call on individual students for the answers. That’s a little too invasive for my taste. One reason to have teams is to help less confident students — you know, the kind who would hate being called on.
I could imagine making the teams larger, maybe groups of 7-10. With more students, it’s more likely that some of them would show up. I could also make the teams smaller, maybe just 2 or 3 people per team. This would lead to less diffusion of responsibility. In either case, I’m sure there would still be slackers. Students don’t like having slackers on their team, but if everyone is getting a 100% on their team grades anyways, I don’t mind if there are a couple freeloaders. Maybe teaching this in person, if that ever happens, would change the dynamic and solve the whole problem.
If I were to teach this in a classroom rather than online, I would have them do more class activities, but have each activity be smaller/shorter. Sending people to breakout rooms on Zoom is a bit of a commitment. It takes a minute to send them out and to re-orient on coming back, so you want them to get their money’s worth. But teaching in person, it would be better to just give them more diverse tasks. Rather than giving them a 10-minute worksheet, I would do something like throw three histograms up the board and give them 3 minutes to tell me what values you could and could not reject from each.
3. Research Reports
About a year ago, I wrote an essay called What You Want from Tests, where I outline two kinds of knowledge that you need to have mastery over a skill. The first is the sort of things that every expert carries around inside their head, and this is what I argue you should try to examine with exams and quizzes. The other kind of knowledge is the ability to actually use the skill. Without the ability to use the skill, any knowledge is just trivia. You’re not an expert, you’re just a fan.
Statistics is a skill-based course, so the second kind of knowledge is really important. I didn’t just want my students to memorize a bunch of facts about statistics, I wanted them to learn how to actually use statistics.
A few years ago I was working with an undergraduate who had volunteered to be my research assistant. She was an exceptionally bright and curious student, who always asked remarkably insightful questions. She was also very diligent, and had already taken several stats classes before she started working with me. She had even taken some MA-level stats courses, which is unusual and impressive for an undergrad.
Despite all this, I discovered that she did not really understand stats. She didn’t understand many of the concepts. She had a hard time conducting even basic analyses. Despite her excellent grades, almost nothing from the classes had stuck with her.
I already knew that she was gifted, and I was aware of the shortcomings of the usual stats education approaches, so I reassured her that it was not her fault, and I offered to help her do something about it.
At this point I had already done a lot of thinking about how to do a better job teaching stats, and I realized that people always forget to teach this practical side of the skill, even though the practical side is what actually matters. Now, there’s no mystery about how to teach skills. I learned stats by struggling through real analyses for projects that were actually important to me, and everyone agrees that working on a project you genuinely care about is the best way to pick up a new skill.
But this doesn’t work in every situation. Even for me, it was a struggle, and this sink-or-swim approach is too harsh for the classroom. It’s also inefficient for beginners, because real data is messy and confusing. If students bring in a real problem, the correct approach might be too advanced for an intro class. And scale makes it impossible. Do we expect every student in an intro course to be able to bring in a project they’re thrilled about? They don’t know anything about the topic yet, so they don’t know what a good project would be.
I realized that all these problems could be fixed by using fake datasets. It’s easy enough to generate data, and you can make it look however you want. And unlike a real project, you can introduce concepts one at a time so that the student is always ready for them.
So that summer, I made a bunch of practice datasets for my RA to work with. I wrote a set of R functions that would automatically generate datasets to my specifications. At the start of each day, I would give my RA a short lesson on a stats concept, and then send her a couple datasets. Naturally, most of the datasets would be in some way related to that day’s lesson. She would work on them all morning, prepare some slides, and at noon, before we broke for lunch, she would give us a presentation on what she found out. I let my other RAs give feedback first (giving critique is great training as well), and then I would ask questions and give her feedback.
The first datasets were extremely simple, and they gave her no trouble at all. Once she was comfortable with conducting simple analyses on her own, I introduced complications, the sort of wrinkles one would expect to find in a real dataset. First I introduced the concept of statistical power, and gave her some critically underpowered studies, so she could learn to interpret those null results as inconclusive. Then we had a discussion of outliers, when and when not to exclude them, and the datasets for that day included different kinds of outliers. We covered causal inference, interactions, p-hacking, and many other concepts in the same way. The concepts in these lessons were cumulative. Once we had covered outliers, for example, I would sometimes put outliers in the datasets later on.
The datasets at the start of the semester were really easy. The datasets by the end were almost as tricky as real-world data. But at no point did my RA work on anything that was too hard for her. Each new complication was just one step up from something she had already mastered, so she was always prepared to tackle it.
3.1 Class Projects
I knew I wanted to do something similar for my class, to give them the same kind of practice with the practical side of things. In particular, I like this approach because for each dataset, you have to figure out what statistical test to run on the data. This is one of the stats skills you use most often in the real world, and it’s often the first question you ask when thinking about an analysis. Yet somehow, intro stats classes almost never teach this skill. At best, students get handed an extremely confusing flowchart. I knew I could do better.
Unfortunately the approach I used with my RA doesn’t exactly scale. I couldn’t give them the same kind of step-by-step training. I couldn’t have them all give a presentation on every dataset, and of course, many students are terrified of presenting to begin with.
Still, I figured I could come up with something that captured most of the benefits. I took several of the simpler datasets that I had made for my RA and I put them in a folder on the class website. Rather than having to analyze all of them, students were required to pick two of these datasets and write a research report about each of them. They could do these two reports at any point during the class, but since they weren’t taught how to do most analyses until about halfway through, I expected most of them to do these assignments during the second half of the course.
Students are taught to write long. This is a bad habit, especially when working with such simple datasets. I limited research reports to a maximum of one page long, including any graphs and/or tables. Students should learn to be concise, and besides, I didn’t want Liz to have to sift through dozens of extra pages when grading.
Each research report was 10% of the final grade, so these assignments were 20% of their grade in total. They were free to analyze the data however they wanted, but in particular we thought that R, SPSS, and Excel/Google Sheets were good choices, so I included one session for each of those approaches in the lectures.
This wasn’t much training, to be sure. A lot of people might have seen this as a big risk — you’re expecting them to use R or SPSS with barely more than an hour of training each? But I wasn’t worried about it. Somehow I knew that they were up to the task.
Originally, I was planning to let students do up to two additional research reports for extra credit. But in the week before class, one of the students suggested that instead of doing research reports for extra credit, we could let them re-do research reports that they weren’t satisfied with. This basically translated to “do 4 research reports, get your grade from the best two”.
I liked this for a couple of reasons. First, it let them make mistakes on early research reports without huge consequences, which was one of my design goals for the class. Second, students who were struggling would be encouraged to do additional reports, which would give them the extra practice they need, while students who didn’t need additional help wouldn’t be bothered.
I implemented this change, with the requirement that the do-overs would have to be on new datasets. Students would get feedback from Liz about how to do better, but they would have to apply those lessons in a new context. I limited them to two of these do-overs at most. I wanted them to be able to learn from their mistakes, but also I didn’t want each of them doing 10 reports.
The research reports were not really about the grades. They weren’t so much intended as evaluations. Really, they were more like practice, or lessons. What I really wanted them to get out of the research reports was, “I can do this and it’s not scary”, because I think it will help set them up to be confident when using these skills in real life (and on the Exams). It wasn’t about challenging or testing them, it was about giving them the opportunity to try things for themselves.
About halfway through the course, one student emailed me to ask for more guidance on how to format the reports. At the very least, she said, I should give them an example of what one would look like. I told her:
This assignment is designed to mimic what doing analysis is like in the real world. Data is emailed to you in a confusing format, and the file is poorly organized. The people who have hired you to conduct the analysis don’t know exactly what they want and can’t tell you what kind of test to conduct; after all, that’s what they hired you for. I’m trying to give you a controlled version of this experience — not nearly so confusing as real life, but where you are asked to exercise your judgment and the knowledge we’ve covered in class. Giving you any more guidance on how to conduct the analysis or write the report would defeat the purpose of the assignment.
To this student’s credit, she totally understood my point and ended up getting a 98 on both research reports.
A final reason to like the research reports is that they capture my “walk out of class once you’ve mastered the material” goal. If you already took stats but you were for some reason forced to take my class, or if you decide to teach yourself all the material in the first week, then you can just throw together two one-page reports, get an A+ on both of them, and forget about this part of the class entirely.
3.2 How did they do?
Students really surprised me on the research reports. When I first looked at the grades, I thought that maybe Liz had been too lenient. Almost all of them had gotten A’s! But when I looked closer, I saw that the students had earned them. The reports weren’t perfect, but they showed serious critical thinking and really creative engagement with the datasets. All very impressive for a subject they had been studying for less than six weeks!
When I looked back, I saw that on their first submissions, many students had gotten B’s and C’s. Liz wasn’t being too lenient at all. In fact, her feedback was intensely detailed. But this helped the students enormously. It’s clear that the students took that feedback and turned it around for their do-overs, and that’s what ended up earning them those A’s.
Some students, I was happy to see, didn’t need the do-overs. One student did her first two, got a 98 and a 99, and unsurprisingly, chose not to submit any more. Another student, who had said in class that she was terrible at math, gave it a shot and to her great surprise earned a 93 and a 90. She decided that was good enough for her, and didn’t send in another. The system works.
I especially liked how diverse the reports were. Students used all sorts of weird charts and phrased their results in all sorts of unusual ways. Not wrong per se, just the sort of thing an expert would never do. I think this demonstrates genuine understanding. Rather than just copying someone else’s approach, they had come up with their own, often slightly bizarre perspective, and then applied it. That’s what mastery looks like, folks.
How about the software? Some of them came to me or to Liz for help, but honestly, not as many as you might expect. For the most part they seem to have taught themselves.
When I was looking through the reports, I saw that most of them chose to use R for their research reports, and almost all of them did a solid job of it. This was a big surprise, but it’s very encouraging.
In conversations about how to teach stats, I’ve often heard, “It would be great if we could teach the students R or python. But you just can’t teach the average student a programming language in only one semester. It would take up too much of the lecture, and there would be too many questions for the TAs to handle. We should stick to SPSS worksheets and formulas for now, that’s the sort of thing that students can deal with.” I’m happy to have evidence that, in my opinion, proves this entirely false. Apparently students can learn the basics of R with almost no instruction, and in less than six weeks, as long as you give them the right environment for it.
(And I later heard from the person who TA’d statistics during the semester that this happened:)
I’m pretty happy with the research reports. Is there anything I would do differently next time? Well, one thing Liz pointed out to me is that while I gave them 24 different options to choose from, most of the reports people submitted were analyses of the same 4 or 5 datasets. These were some of the most straightforward datasets, and most of them involved analyses of correlation between two variables.
Now, as I said before, the research reports are not really about challenging students. I’m fine with them doing two easy reports, since doing any independent report at all is great for intro stats. But conducting correlation tests both times does slightly defeat the purpose of doing two reports.
A better system would be to break the research reports up into different bundles. Bundle A could be the easy ones and Bundle B could be more challenging. Or Bundle A could include all the correlation tests and Bundle B could include the others, so that every student would have to use at least two different tests. You could maybe include a Bundle C of advanced datasets. These could either give you extra points just for attempting them, or they could be strictly for extra credit. In any case, adding some more structure to the research reports would probably improve them.