Tempo experiment analysis: the bad

2010-10-09 05:45

The previous post discussed good things about the tempo analysis so far. This one talks about bad stuff.

I had a great idea about three weeks into the experiment (aka "too late") -- if I recorded the time that the metronome started, I could figure out how long somebody watched the metronome before starting to tap. I knew that the timer only gave me the number of seconds since flash started to run (i.e. when somebody viewed my webpage), but I had previously decided that this wasn't a problem, because I could easily adjust for the overall offset by looking at the tap-times. I was so focused on the tempo detection that I didn't think about other interesting data that I could gather.

Whoops. That was a nice opportunity missed. :(

I'm not certain whether to regret the lack of tracking individual players. I'm not takling about identifying specific persons, but just knowing that (30? 200?) people played the game; that person X played 80 games, while person Y played 3 games; etc.

Technologically this would have taken a day or two (either browser cookies or flash cookies; I've never dealt with them before, but it can't be all that hard). Not a big deal. I could have even simply recorded their IP addresses! I mean, the web server logs record those all the time; I wouldn't have had to do any extra programming to get those. Granted, the data would be a bit fuzzy if people played the game at home and work, or on a desktop and their mobile phone, or having multiple people playing at the same house... but even that kind of rough idea could be helpful.

However, it probably would have added about 10 hours of extra red tape. I had enough trouble getting the university ethics committee to accept the game as it is (i.e. an anonymous flash game, but without signed consent forms); the extra scrutiny / extra forms / extra emails that would be required to add a simple web tracking cookie (of the kind that virtually every website uses these days!) would be significant.

Also, I couldn't honestly argue that this information would be necessary for my research. I mean, yes, tracking individual users will be necessary for the rhythm grading -- but this experiment was about tempo detection, and all I need to know is "a human produced this series of taps, and judged the automatic tempo detection to be ok/not ok". Knowing that a particular set of tap-games all came from the same person would be neat, but not actually useful for the tempo detection.

However, by far the worst problem was the grading. A large mouthful of crow for me here. :(

This was a last-minute addition -- the first version of the experiment didn't have any grading at all! But when I tried it out on my first trial group (i.e. my Mom :), the feedback was quite negative: it wasn't fun, there wasn't any real reason to play the game, etc. (I also heard that the "relative mode" was too difficult to understand, so I dropped that entire type of tempo detection. Three weeks of work down the drain! Lesson learned: do some user testing as soon as possible.)

So I quickly hacked on the rhythm grading that I used for my Masters' project. I added the warning:

"[The rhythm grade] is an approximate grade for how well your rhythms fit into your tempo. This grade calculation is known to be incorrect in some circumstances..."

but the grade would give people something to focus on.

And focus on it they did. :|

In retrospect, I shouldn't have been surprised at the amount of interest in getting high grades -- I'm highly competitive, and my published paper that gets the most citations is all about music education with games (in particular, score-giving games!). So I really have no excuse for not expecting people to focus on their rhythm grade, instead of the tempo detection.

But wait, it gets worse! After I added the grading, I did another test run with a larger group of family and friends. Two people commented that the grading seemed overly generous, but a very quick test ("good taps -- ok, 98%. tap randomly -- 0%. tap with one tap in the wrong place -- 38%. seems ok!") didn't reveal anything horribly wrong. And since I was already two weeks into my experiment time (the ethics committee approval is for a specific range of dates, and I didn't start user testing until the beginning of this range. I should have asked for more time; I'm certain I would have gotten it!), I went ahead and sent it to the wider audience (including this blog). I mean, I'm using the same grading as my Masters, and I tested that one quite a bit, right?

Well... it was almost the same as the earlier grading. The difference is that the Masters version acted on frame sizes -- it looked at recorded audio, divided them into frames of 512 samples, then looked for claps in the RMS of those frames. Each frame is 512 samples / 44100 samples/second = 11.6 milliseconds long. The flash timer gave me seconds.

Now, I didn't forget to convert between milliseconds and seconds. But I did forget about the frame size. So the total "amount of error" in a tap-game was 11.6 times too small. And the grade was simply 100.0 - total_errors.

Ouch. What effect did that have? Well, compare these two graphs:

Ouch.

If you missed it, look at the scale of the X axis. The scores range from 96% to 100%. They were supposed to go from 60% to 100%.

NB: these are the scores for tap-games with absolutely no ambiguity. No missing taps, no extra taps, no incorrect rhythms, etc. The other 481 tap-games had much lower scores, due to penalities for missing/extra/incorrect taps. It wasn't quite a complete disaster -- if I were using the same type of grading algorithm that Rock Band and Guitar Hero seem to use ("if the event was within X seconds of the correct time, get full points; otherwise, get 0 points"), then all of those tap-games would probably have received 100%.

But even if it wasn't a complete disaster, it was certainly a huge screw-up. It doesn't affect the scientific outcome -- I have the raw data; I can test all my tempo detection stuff on those taps. And I can also experiment with different grading algorithms to see what kind of distributions they'd give.

However, it made the game less fun (or less educational) to the participants. I suggested that, in return for your participation in this experiment, you could see a reasonable grade of your rhythmic accuracy, for your amusement or education. That didn't happen, and I apologize.

On to the ugly, or back to the good.