Skip to main content

Tempo experiment analysis: the ugly

Previous posts discussed good and bad things about the tempo experiment. This one talks about the ugly stuff.

The fundamental "ugly" thing is that I'm not finished the improved tempo detection algorithm; I had really thought it would be done by now. I started playing with preliminary data halfway through the experiment, and I really thought that I could get everything ready, and just make some final tweaks once I had the complete data.

Two things interfered with this. First, our lab has been working on a simulation of a violin -- we have a set of equations (now in a series of C++ objects) which describe the motion of a violin (mainly the strings) when plucked or bowed. You can write a series of instructions (like "put a finger 0.206299 of the string-distance away from the nut (i.e. "a high second finger"), and bow on the D string with a force of 0.5 Newtons, with an upbow moving at 0.3 meters per seconds), and then the computer generates the resulting audio. I even added python bindings with SWIG! :) Very exciting, very fun... but sadly, very much not helping the tempo detection. :(

The second problem is that I noticed patterns in the data (well, "noticed" is too strong a word -- you'd have to be blind not to see the almost-normality!). That prompted a huge review of statistics. I'd forgotten literally everything from the second-year STAT 270 course I took back in 1998. I knew there was something called a normal curve, which had a bulge in the middle... but I didn't remember that it was also called a Gaussian curve, or what variance was (other than "something to do with the range of data"). So I've had a merry romp through kurtosis, the KS-test, the Shapiro-Wilks test, R, fitdistr(), Q-Q plots, scipy.stats, Cook's distance, the hat matrix, studentizing, etc. I worked in a combination of R, scipy, gnuplot, and matplotlib. I didn't actually use Rpy, although I should have.

But after going around in circles at least three times, I think I'm still back at where I started. I mean, yes, there is insufficient reason to reject a normal distribution for most of the taps-games... but for some games, an obviously good tempo produces residuals which can't reasonably be described as being from the normal distribution, while some other tap-games with an obviously bad tempo have tap-errors which appear to be quite normal. I think the whole statistical approach was a red herring. :(

On the plus side, I now have some practical experience in analyzing experimental data. I kind-of wish I did this in lab courses in 1st and 2nd year of my first degree... granted, philosophy doesn't tend to have many labs, but I spent a lot of my time taking courses outside my major anyway. I even did first-year physics and chemistry, but (regrettably) I only did the lecture courses, not the lab courses.

Well, that's life in an highly interdisciplinary area like music technology. Or at least, that's life in a highly interdisciplinary area if nobody is around to collaborate... I mean, in this field, you either find experts in other fields to cover your weak spots, or you have to learn everything yourself. So far I've been doing everything myself, which is not particularly ideal. :| I guess that once I'm a professor, it'll help me supervise students in the more experimental side of music technology.

Come to think of it, I had exactly the same problem with the violin equations -- I've never dealt with linear density or the second moment of area before, nor had I done any serious digital signal processing programming before. Simply doing things like correctly writing a mathematics equations into code like:

Y1[n-1] = -((w_n + r_n*r_n / w_n) * sin(w_n*dt)) * exp(-r_n*dt);

was a good, albeit time-consuming, experience. And dealing with all the off-by-one errors with math-indexing vs. C-array-indexing was fun... until I finally declared a standard and stuck with it. (I decided to start all for loops with n=1, but do all array indexing with [n-1]. Using n as a loop counter seems wierd to my CS-background, but this is for nodal synthesis, and all the papers I was working from used "sum over all n", and keeping your code syntax as close to the math as possible definitely reduces the chance of silly errors!).


Anyway, on to some pretty pictures. Although it would be nice to trust 971 tap-games (the tap-games where the player either agreed with the tempo, or specified their own), I'm still going through these manually and double-checking them all. So far (567 tap-games examined) I've found 7 tap-games where I don't believe the player's judgement. I'll be asking our local professor emeritus of music to look at those.

So, some data from the doubly-agreed 560 tap-games.

image0

Looks pretty normal, right? At least, it did to me. But none of the normality tests in R and scipy thought it had any reasonable chance of being normal, and upon closer examination I had to admit that the tails were a bit heavier than a normal curve.

But we can still see some interesting things. For example, even this collection of the "most easily identified" tempi (i.e. the most accurate tap-games), there's still a significant amount of taps that are 100 milliseconds away from the correct time. And remember that we're fitting the tempo to give the lowest possible errors -- if we had a set tempo, the errors would be worse. Granted, there aren't all that many taps that are 100 milliseconds away... but 50 milliseconds is certainly within reasonable bounds.

On the question of "why not look at a set tempo" -- due to technical limitations in flash, this is not possible. Flash 10 (and below) can't generate a stable timer, so the visual metronome was created as an animated gif. The flash app doesn't actually know when the gif starts flashing, so unfortunately I can't make any guess about user-tap-times vs. metronome-times. I might try to duplicate portions of this experiment using a controlled setup in the lab to investigate that question.

Oh, one thing that definitely is not interesting is that the residuals are evenly split between positive and negative ("have a skewness very close to 0" in stats terms). That comes from the linear regression. All the reasons against measuring these against a set tempo also apply to measuring them against a set offset (or "intercept" in stats-talk). Now, it would be really interesting to see the distribution of residuals... but again, that requires a controlled lab setup. And a controlled lab setup requires people to be physically present, probably with paid participants, etc.

Hey, I just had a nice thought -- I could set up one of my old netbooks as a rhythm testing platform. It would have a reliable timer, etc. Then whenever I travel places, I take it along with me, and whenever I find a willing victim, I whip it out and they play with it. I'd need ethics approval for a long-term experiment (say, a year?), with me probably finding two or three participants each month... but if each person plays it for 15 minutes, that should give me enough data to see whether people tend to be ahead or behind the actual times.


Anyway, just looking at the residuals isn't enough. Maybe there were just a few tap-games that were wildly incorrect, and that's skewing the data. That's not the case here -- I mean, yes, there were some tap-games that looked pretty wild, but those aren't included in this current data set. But it's a reasonably question to ask, and easily answered by looking at the "average" amount of error. In particular, the standard deviation (also called "root-mean-squared-error").

image1

So... it seems that 4 or 5 tap-games had an "average" error of less than 10 milliseconds, but the vast majority of these tap-games had a RMSE between 15 and 60 milliseconds. That's interesting to note -- I wasn't expecting it to be so high, and most people I've mentioned it to have been surprised that there would be so much variation. But even that figure is a low estimate -- I expect the actual range to be much higher once I include the entire dataset.

This is good to know for anybody working on rhythm grading -- if this experiment is an accurate representation of your target group, then you should expect the best people to have an RMSE between 15 and 25 ms. It's interesting to note that 15-25ms is the just noticeable difference of two sharp onsets. If you record a clap (or any abrupt sound), make a copy, and play the delayed copy 15 ms later than the first, most people will only hear one clap. If you play it 25 ms later, most people will hear two claps. I doubt that there's actually a connection between these two things, but it's interesting to note anyway. :)


Anyway, that's what I have so far. I'm going to force myself to take a break for a day or two, then tackle it again with a fresh outlook. Hopefully I'll be finished the improved algorithm next week, so that I can move on to the rhythm grading experiment! (and write this one up for publication, of course)

Tempo experiment analysis: the bad

The previous post discussed good things about the tempo analysis so far. This one talks about bad stuff.

I had a great idea about three weeks into the experiment (aka "too late") -- if I recorded the time that the metronome started, I could figure out how long somebody watched the metronome before starting to tap. I knew that the timer only gave me the number of seconds since flash started to run (i.e. when somebody viewed my webpage), but I had previously decided that this wasn't a problem, because I could easily adjust for the overall offset by looking at the tap-times. I was so focused on the tempo detection that I didn't think about other interesting data that I could gather.

Whoops. That was a nice opportunity missed. :(


I'm not certain whether to regret the lack of tracking individual players. I'm not takling about identifying specific persons, but just knowing that (30? 200?) people played the game; that person X played 80 games, while person Y played 3 games; etc.

Technologically this would have taken a day or two (either browser cookies or flash cookies; I've never dealt with them before, but it can't be all that hard). Not a big deal. I could have even simply recorded their IP addresses! I mean, the web server logs record those all the time; I wouldn't have had to do any extra programming to get those. Granted, the data would be a bit fuzzy if people played the game at home and work, or on a desktop and their mobile phone, or having multiple people playing at the same house... but even that kind of rough idea could be helpful.

However, it probably would have added about 10 hours of extra red tape. I had enough trouble getting the university ethics committee to accept the game as it is (i.e. an anonymous flash game, but without signed consent forms); the extra scrutiny / extra forms / extra emails that would be required to add a simple web tracking cookie (of the kind that virtually every website uses these days!) would be significant.

Also, I couldn't honestly argue that this information would be necessary for my research. I mean, yes, tracking individual users will be necessary for the rhythm grading -- but this experiment was about tempo detection, and all I need to know is "a human produced this series of taps, and judged the automatic tempo detection to be ok/not ok". Knowing that a particular set of tap-games all came from the same person would be neat, but not actually useful for the tempo detection.


However, by far the worst problem was the grading. A large mouthful of crow for me here. :(

This was a last-minute addition -- the first version of the experiment didn't have any grading at all! But when I tried it out on my first trial group (i.e. my Mom :), the feedback was quite negative: it wasn't fun, there wasn't any real reason to play the game, etc. (I also heard that the "relative mode" was too difficult to understand, so I dropped that entire type of tempo detection. Three weeks of work down the drain! Lesson learned: do some user testing as soon as possible.)

So I quickly hacked on the rhythm grading that I used for my Masters' project. I added the warning:

"[The rhythm grade] is an approximate grade for how well your rhythms fit into your tempo. This grade calculation is known to be incorrect in some circumstances..."

but the grade would give people something to focus on.

And focus on it they did. :|

In retrospect, I shouldn't have been surprised at the amount of interest in getting high grades -- I'm highly competitive, and my published paper that gets the most citations is all about music education with games (in particular, score-giving games!). So I really have no excuse for not expecting people to focus on their rhythm grade, instead of the tempo detection.


But wait, it gets worse! After I added the grading, I did another test run with a larger group of family and friends. Two people commented that the grading seemed overly generous, but a very quick test ("good taps -- ok, 98%. tap randomly -- 0%. tap with one tap in the wrong place -- 38%. seems ok!") didn't reveal anything horribly wrong. And since I was already two weeks into my experiment time (the ethics committee approval is for a specific range of dates, and I didn't start user testing until the beginning of this range. I should have asked for more time; I'm certain I would have gotten it!), I went ahead and sent it to the wider audience (including this blog). I mean, I'm using the same grading as my Masters, and I tested that one quite a bit, right?

Well... it was almost the same as the earlier grading. The difference is that the Masters version acted on frame sizes -- it looked at recorded audio, divided them into frames of 512 samples, then looked for claps in the RMS of those frames. Each frame is 512 samples / 44100 samples/second = 11.6 milliseconds long. The flash timer gave me seconds.

Now, I didn't forget to convert between milliseconds and seconds. But I did forget about the frame size. So the total "amount of error" in a tap-game was 11.6 times too small. And the grade was simply 100.0 - total_errors.

Ouch. What effect did that have? Well, compare these two graphs:

image0image1

Ouch.

If you missed it, look at the scale of the X axis. The scores range from 96% to 100%. They were supposed to go from 60% to 100%.

NB: these are the scores for tap-games with absolutely no ambiguity. No missing taps, no extra taps, no incorrect rhythms, etc. The other 481 tap-games had much lower scores, due to penalities for missing/extra/incorrect taps. It wasn't quite a complete disaster -- if I were using the same type of grading algorithm that Rock Band and Guitar Hero seem to use ("if the event was within X seconds of the correct time, get full points; otherwise, get 0 points"), then all of those tap-games would probably have received 100%.

But even if it wasn't a complete disaster, it was certainly a huge screw-up. It doesn't affect the scientific outcome -- I have the raw data; I can test all my tempo detection stuff on those taps. And I can also experiment with different grading algorithms to see what kind of distributions they'd give.

However, it made the game less fun (or less educational) to the participants. I suggested that, in return for your participation in this experiment, you could see a reasonable grade of your rhythmic accuracy, for your amusement or education. That didn't happen, and I apologize.


On to the ugly, or back to the good.

Tempo experiment analysis: the good

Sorry for the delay for this analysis; sickness and wanting to have everything finished interfered. It's not completely finished, but I have enough interesting things to discuss for now. This blog post deals with things that went well.

The best thing that happened was that you guys actually played the game! :) I got much, much more data than I was expecting -- in the end, there were 1041 tapping games played, with a total of 14,342 taps. I was hoping for 200 or so tap-games, so this gives me five times as much data as I was hoping for in my wildest dreams!

Of those tap-games, 882 resulted in the player accepting the computer's tempo, 89 resulted in the player giving their own tempo, and 70 resulted in a disagreement about tempo but no other tempo given. 635 tap-games used an audio metronome, while 361 used a visual metronome, and 45 used no metronome at all.


Most tempi (plural of "tempo"; no, it's not "tempos") were very close to the metronome -- looking at tap-games with user-agreed or user-specified tempo, and excluding tap-games with no metronome, we find that 82.1% of them had a difference of less than 5%, and 41.7% exercise had a difference of less than 1%.

image0

The distribution of tap-games played per level shows an unsurprising majority of level 1, but a surprising amount of interest in level 9. I'm guessing that this is because the game was too easy (see next blog post).

image1

Returning to main point of this experiment, 560 tap-games have their tempo detected trivially with linear regression via ordinary least squares. The player tapped the correct rhythm, with no note more than a tenth of a quarter-note beat away from the strictly-metronomic position. I call those the "boring" tap-games... I mean, if the obvious solution works, then it's no fun! :)

Happily, this leaves 481 "interesting" tap-games which require more complication solutions. I'm still working on this part; my latest algorithm correctly identifies most of the tempi, but there's still a bunch left, and I think my current approach isn't going to pan out. I need to take a break for a day and return to it with a fresh mind.

Oh, on the topic of "boring" exercises -- judging from the data I've been looking at manually, I estimate that approximately two thirds of the tap-games can be handled with simple linear regression. The missing hundred tap-games (between "two thirds of 1041" and the "560" mentioned above) come from wanting to avoid false positives. I'm being paranoid about false positives; I would rather have the computer spend more time analyzing the taps instead of producing an incorrect tempo estimate.

On to the bad, or even the ugly.