Convolution is magical!

Digital Signal Processing can be so awesome it's scary. That's true of any branch of mathematics, of course... but since I just reached this point with convolution to aid the violin physical model, that's what I'll talk about. With audio examples!

We've been working on a program that simulates a violin (called a "physical model" in this the jargon, but it's not physical, and only scientists use the word "model" in that fashion). We have a series of equations that describe how the violin reacts when you pluck it, bow it, put finger(s) on string(s), etc. We're using a lot of approximations and known-to-be-inaccurate physics in this program; we're not trying to push the boundaries of science in this area. We just want something that behaves (and thus sounds) vaguely like a violin, so that people can hear the difference that various bowings make.

In more precise terms, we give the computer a series of instructions about the physical actions of a violinist, and the computer does them. For example:

time  action  extra
0.0   finger  D string  from-nut 0.109
0.0   bow     D string  from-bridge 0.12  -0.1 m/s  0.4 N
This starts playing an E on the D string with a slow upbow (0.1 meters per second) with moderate pressure (0.4 Newtons).

After summing the forces that the four violin strings exert onto the bridge, we get the following sound:

twinkle-plain.wav.mp3
Why does this sound so bad? Well, the approximations we're making will hurt the sound quality to some extent, but the biggest reason is that I'm not doing anything fancy in terms of the playing (yet). I mean, even the most accurate physical simulation can produce horrible noises. Don't believe me? Give a violin to my father. A real violin is the most accurate physical "simulation" imagineable, but it won't produce nice sound unless it's played by a skilled violinist!

But I digress. Right now I'm excited about convolution of audio with an impulse response.


One neat trick I learned while doing my music degree at UVic was that you can make audio sound like it's played in a room by taking it's convolution with the impulse response of a room. Want to pretend that you're singing karaoke in the Sistine Chapel, or the Met opera hall? All you need to do is go to those places and record a clap. That's it -- a single clap ("an impulse") is enough to make any piece of audio sound like it was played in that environment.

When I heard that, I was like "woah!" [sic -- I'm trying to write like a UVic music undergraduate].

But it didn't really sink in, because I hadn't actually tried doing it myself. Like so many parts of mathematics, it doens't make sense until you do it yourself. (which makes this whole blog post rather pointless, of course)

The really cool thing, though, is that this applies to small things, not just big rooms. In particular, it applies to the violin body.


So my supervisor and I went down to the basement, wandered past all the liquid nitrogen (or whatever was in those tanks) and hopped in the anechoic chamber. After plugging in my supervisor's eee 901 netbook into the microphone that somebody left in the room, I hit the violin a few times with a dirty tea-spoon that we grabbed from our lounge.

Yeah. In the Centre for Music Technology, we're just that hardcore. A week ago previous, we ended up measuring the weight of the violin bow using a flat piece of metal (the leg of a speaker stand) as a lever, a ball-point pen as fulcrum, and a slightly-drank pint of milk as the counter-balance to the bow.

We went back upstairs, isolated one of those taps, and filtered it at 80 Hz to get rid of some background noise. (hey, there's noise everywhere, even in an anechoic chamber!)

Here's the resulting sound:

impulse-2048

If you think you missed something, nope -- that tiny click is all there is. It's 2048 samples long (slightly longer than 46 millisceonds), and that's actually way longer than it needs to be. Now we just need to convolve those two pieces of audio together.


The C code that does the magical convolution was simplicity in itself:

double ViolinInstrument::body_impulse(double sample)
{
    body_ringbuffer[body_write_index] = sample;
    double result = 0.0;
    unsigned int bi=body_read_index;
    for (unsigned int ki=0; ki < PC_KERNEL_SIZE; ki++) {
        result += body_ringbuffer[bi] * pc_kernel[ki];
        bi++;
        if (bi >= PC_KERNEL_SIZE) {
            bi -= PC_KERNEL_SIZE;
        }
    }
    update_body_pointers();
    return result;
}

... ok, it you're not a programmer, that probably doesn't look simple. The only important thing to note is that we're multiplying two numbers together, then repeating that process, and adding up all the results together. And also note that the two numbers we're multiply together are simply the raw samples.

(if you know the math and you're worried about the reverse+shift... don't be! That's done by the physics. The second function isn't actually the recorded "bonk" noise; the "bonk" noise itself is already the reverse+shifted version of the function)

The end result of the multiple+add is this:

twinkle-convolution.wav.mp3
Sounds much more like a violin, right? I think the mic placement wasn't ideal, and it was a really cheap violin, but we can make more records later with other violins. The important thing is that the basic system is working... and that multiplying a simple tea-spoon-bonk can magically change the sound in this way.


Oh, and I absolutely cannot resist adding a link to my favorite (and sadly no longer updated) web-comic, talking about convolution: a magical superpower. (the clown is the supervisor -- this will be obvious to anybody in academia)

Posted at 2010-10-14 02:01 | Permanent link | Comments

Tempo experiment analysis: the ugly

Previous posts discussed good and bad things about the tempo experiment. This one talks about the ugly stuff.

The fundamental "ugly" thing is that I'm not finished the improved tempo detection algorithm; I had really thought it would be done by now. I started playing with preliminary data halfway through the experiment, and I really thought that I could get everything ready, and just make some final tweaks once I had the complete data.

Two things interfered with this. First, our lab has been working on a simulation of a violin -- we have a set of equations (now in a series of C++ objects) which describe the motion of a violin (mainly the strings) when plucked or bowed. You can write a series of instructions (like "put a finger 0.206299 of the string-distance away from the nut (i.e. "a high second finger"), and bow on the D string with a force of 0.5 Newtons, with an upbow moving at 0.3 meters per seconds), and then the computer generates the resulting audio. I even added python bindings with SWIG! :) Very exciting, very fun... but sadly, very much not helping the tempo detection. :(

The second problem is that I noticed patterns in the data (well, "noticed" is too strong a word -- you'd have to be blind not to see the almost-normality!). That prompted a huge review of statistics. I'd forgotten literally everything from the second-year STAT 270 course I took back in 1998. I knew there was something called a normal curve, which had a bulge in the middle... but I didn't remember that it was also called a Gaussian curve, or what variance was (other than "something to do with the range of data"). So I've had a merry romp through kurtosis, the KS-test, the Shapiro-Wilks test, R, fitdistr(), Q-Q plots, scipy.stats, Cook's distance, the hat matrix, studentizing, etc. I worked in a combination of R, scipy, gnuplot, and matplotlib. I didn't actually use Rpy, although I should have.

But after going around in circles at least three times, I think I'm still back at where I started. I mean, yes, there is insufficient reason to reject a normal distribution for most of the taps-games... but for some games, an obviously good tempo produces residuals which can't reasonably be described as being from the normal distribution, while some other tap-games with an obviously bad tempo have tap-errors which appear to be quite normal. I think the whole statistical approach was a red herring. :(

On the plus side, I now have some practical experience in analyzing experimental data. I kind-of wish I did this in lab courses in 1st and 2nd year of my first degree... granted, philosophy doesn't tend to have many labs, but I spent a lot of my time taking courses outside my major anyway. I even did first-year physics and chemistry, but (regrettably) I only did the lecture courses, not the lab courses.

Well, that's life in an highly interdisciplinary area like music technology. Or at least, that's life in a highly interdisciplinary area if nobody is around to collaborate... I mean, in this field, you either find experts in other fields to cover your weak spots, or you have to learn everything yourself. So far I've been doing everything myself, which is not particularly ideal. :| I guess that once I'm a professor, it'll help me supervise students in the more experimental side of music technology.

Come to think of it, I had exactly the same problem with the violin equations -- I've never dealt with linear density or the second moment of area before, nor had I done any serious digital signal processing programming before. Simply doing things like correctly writing a mathematics equations into code like:

Y1[n-1] = -((w_n + r_n*r_n / w_n) * sin(w_n*dt)) * exp(-r_n*dt);
was a good, albeit time-consuming, experience. And dealing with all the off-by-one errors with math-indexing vs. C-array-indexing was fun... until I finally declared a standard and stuck with it. (I decided to start all for loops with n=1, but do all array indexing with [n-1]. Using n as a loop counter seems wierd to my CS-background, but this is for nodal synthesis, and all the papers I was working from used "sum over all n", and keeping your code syntax as close to the math as possible definitely reduces the chance of silly errors!).



Anyway, on to some pretty pictures. Although it would be nice to trust 971 tap-games (the tap-games where the player either agreed with the tempo, or specified their own), I'm still going through these manually and double-checking them all. So far (567 tap-games examined) I've found 7 tap-games where I don't believe the player's judgement. I'll be asking our local professor emeritus of music to look at those.

So, some data from the doubly-agreed 560 tap-games.

Looks pretty normal, right? At least, it did to me. But none of the normality tests in R and scipy thought it had any reasonable chance of being normal, and upon closer examination I had to admit that the tails were a bit heavier than a normal curve.

But we can still see some interesting things. For example, even this collection of the "most easily identified" tempi (i.e. the most accurate tap-games), there's still a significant amount of taps that are 100 milliseconds away from the correct time. And remember that we're fitting the tempo to give the lowest possible errors -- if we had a set tempo, the errors would be worse. Granted, there aren't all that many taps that are 100 milliseconds away... but 50 milliseconds is certainly within reasonable bounds.

On the question of "why not look at a set tempo" -- due to technical limitations in flash, this is not possible. Flash 10 (and below) can't generate a stable timer, so the visual metronome was created as an animated gif. The flash app doesn't actually know when the gif starts flashing, so unfortunately I can't make any guess about user-tap-times vs. metronome-times. I might try to duplicate portions of this experiment using a controlled setup in the lab to investigate that question.

Oh, one thing that definitely is not interesting is that the residuals are evenly split between positive and negative ("have a skewness very close to 0" in stats terms). That comes from the linear regression. All the reasons against measuring these against a set tempo also apply to measuring them against a set offset (or "intercept" in stats-talk). Now, it would be really interesting to see the distribution of residuals... but again, that requires a controlled lab setup. And a controlled lab setup requires people to be physically present, probably with paid participants, etc.

Hey, I just had a nice thought -- I could set up one of my old netbooks as a rhythm testing platform. It would have a reliable timer, etc. Then whenever I travel places, I take it along with me, and whenever I find a willing victim, I whip it out and they play with it. I'd need ethics approval for a long-term experiment (say, a year?), with me probably finding two or three participants each month... but if each person plays it for 15 minutes, that should give me enough data to see whether people tend to be ahead or behind the actual times.


Anyway, just looking at the residuals isn't enough. Maybe there were just a few tap-games that were wildly incorrect, and that's skewing the data. That's not the case here -- I mean, yes, there were some tap-games that looked pretty wild, but those aren't included in this current data set. But it's a reasonably question to ask, and easily answered by looking at the "average" amount of error. In particular, the standard deviation (also called "root-mean-squared-error").

So... it seems that 4 or 5 tap-games had an "average" error of less than 10 milliseconds, but the vast majority of these tap-games had a RMSE between 15 and 60 milliseconds. That's interesting to note -- I wasn't expecting it to be so high, and most people I've mentioned it to have been surprised that there would be so much variation. But even that figure is a low estimate -- I expect the actual range to be much higher once I include the entire dataset.

This is good to know for anybody working on rhythm grading -- if this experiment is an accurate representation of your target group, then you should expect the best people to have an RMSE between 15 and 25 ms. It's interesting to note that 15-25ms is the just noticeable difference of two sharp onsets. If you record a clap (or any abrupt sound), make a copy, and play the delayed copy 15 ms later than the first, most people will only hear one clap. If you play it 25 ms later, most people will hear two claps. I doubt that there's actually a connection between these two things, but it's interesting to note anyway. :)


Anyway, that's what I have so far. I'm going to force myself to take a break for a day or two, then tackle it again with a fresh outlook. Hopefully I'll be finished the improved algorithm next week, so that I can move on to the rhythm grading experiment! (and write this one up for publication, of course)

Posted at 2010-10-10 01:23 | Permanent link | Comments

Tempo experiment analysis: the bad

The previous post discussed good things about the tempo analysis so far. This one talks about bad stuff.

I had a great idea about three weeks into the experiment (aka "too late") -- if I recorded the time that the metronome started, I could figure out how long somebody watched the metronome before starting to tap. I knew that the timer only gave me the number of seconds since flash started to run (i.e. when somebody viewed my webpage), but I had previously decided that this wasn't a problem, because I could easily adjust for the overall offset by looking at the tap-times. I was so focused on the tempo detection that I didn't think about other interesting data that I could gather.

Whoops. That was a nice opportunity missed. :(


I'm not certain whether to regret the lack of tracking individual players. I'm not takling about identifying specific persons, but just knowing that (30? 200?) people played the game; that person X played 80 games, while person Y played 3 games; etc.

Technologically this would have taken a day or two (either browser cookies or flash cookies; I've never dealt with them before, but it can't be all that hard). Not a big deal. I could have even simply recorded their IP addresses! I mean, the web server logs record those all the time; I wouldn't have had to do any extra programming to get those. Granted, the data would be a bit fuzzy if people played the game at home and work, or on a desktop and their mobile phone, or having multiple people playing at the same house... but even that kind of rough idea could be helpful.

However, it probably would have added about 10 hours of extra red tape. I had enough trouble getting the university ethics committee to accept the game as it is (i.e. an anonymous flash game, but without signed consent forms); the extra scrutiny / extra forms / extra emails that would be required to add a simple web tracking cookie (of the kind that virtually every website uses these days!) would be significant.

Also, I couldn't honestly argue that this information would be necessary for my research. I mean, yes, tracking individual users will be necessary for the rhythm grading -- but this experiment was about tempo detection, and all I need to know is "a human produced this series of taps, and judged the automatic tempo detection to be ok/not ok". Knowing that a particular set of tap-games all came from the same person would be neat, but not actually useful for the tempo detection.


However, by far the worst problem was the grading. A large mouthful of crow for me here. :(

This was a last-minute addition -- the first version of the experiment didn't have any grading at all! But when I tried it out on my first trial group (i.e. my Mom :), the feedback was quite negative: it wasn't fun, there wasn't any real reason to play the game, etc. (I also heard that the "relative mode" was too difficult to understand, so I dropped that entire type of tempo detection. Three weeks of work down the drain! Lesson learned: do some user testing as soon as possible.)

So I quickly hacked on the rhythm grading that I used for my Masters' project. I added the warning:

"[The rhythm grade] is an approximate grade for how well your rhythms fit into your tempo. This grade calculation is known to be incorrect in some circumstances..."
but the grade would give people something to focus on.

And focus on it they did. :|

In retrospect, I shouldn't have been surprised at the amount of interest in getting high grades -- I'm highly competitive, and my published paper that gets the most citations is all about music education with games (in particular, score-giving games!). So I really have no excuse for not expecting people to focus on their rhythm grade, instead of the tempo detection.


But wait, it gets worse! After I added the grading, I did another test run with a larger group of family and friends. Two people commented that the grading seemed overly generous, but a very quick test ("good taps -- ok, 98%. tap randomly -- 0%. tap with one tap in the wrong place -- 38%. seems ok!") didn't reveal anything horribly wrong. And since I was already two weeks into my experiment time (the ethics committee approval is for a specific range of dates, and I didn't start user testing until the beginning of this range. I should have asked for more time; I'm certain I would have gotten it!), I went ahead and sent it to the wider audience (including this blog). I mean, I'm using the same grading as my Masters, and I tested that one quite a bit, right?

Well... it was almost the same as the earlier grading. The difference is that the Masters version acted on frame sizes -- it looked at recorded audio, divided them into frames of 512 samples, then looked for claps in the RMS of those frames. Each frame is 512 samples / 44100 samples/second = 11.6 milliseconds long. The flash timer gave me seconds.

Now, I didn't forget to convert between milliseconds and seconds. But I did forget about the frame size. So the total "amount of error" in a tap-game was 11.6 times too small. And the grade was simply 100.0 - total_errors.

Ouch. What effect did that have? Well, compare these two graphs:

Ouch.

If you missed it, look at the scale of the X axis. The scores range from 96% to 100%. They were supposed to go from 60% to 100%.

NB: these are the scores for tap-games with absolutely no ambiguity. No missing taps, no extra taps, no incorrect rhythms, etc. The other 481 tap-games had much lower scores, due to penalities for missing/extra/incorrect taps. It wasn't quite a complete disaster -- if I were using the same type of grading algorithm that Rock Band and Guitar Hero seem to use ("if the event was within X seconds of the correct time, get full points; otherwise, get 0 points"), then all of those tap-games would probably have received 100%.

But even if it wasn't a complete disaster, it was certainly a huge screw-up. It doesn't affect the scientific outcome -- I have the raw data; I can test all my tempo detection stuff on those taps. And I can also experiment with different grading algorithms to see what kind of distributions they'd give.

However, it made the game less fun (or less educational) to the participants. I suggested that, in return for your participation in this experiment, you could see a reasonable grade of your rhythmic accuracy, for your amusement or education. That didn't happen, and I apologize.


On to the ugly, or back to the good.

Posted at 2010-10-09 05:45 | Permanent link | Comments

Tempo experiment analysis: the good

Sorry for the delay for this analysis; sickness and wanting to have everything finished interfered. It's not completely finished, but I have enough interesting things to discuss for now. This blog post deals with things that went well.

The best thing that happened was that you guys actually played the game! :) I got much, much more data than I was expecting -- in the end, there were 1041 tapping games played, with a total of 14,342 taps. I was hoping for 200 or so tap-games, so this gives me five times as much data as I was hoping for in my wildest dreams!

Of those tap-games, 882 resulted in the player accepting the computer's tempo, 89 resulted in the player giving their own tempo, and 70 resulted in a disagreement about tempo but no other tempo given. 635 tap-games used an audio metronome, while 361 used a visual metronome, and 45 used no metronome at all.


Most tempi (plural of "tempo"; no, it's not "tempos") were very close to the metronome -- looking at tap-games with user-agreed or user-specified tempo, and excluding tap-games with no metronome, we find that 82.1% of them had a difference of less than 5%, and 41.7% exercise had a difference of less than 1%.


The distribution of tap-games played per level shows an unsurprising majority of level 1, but a surprising amount of interest in level 9. I'm guessing that this is because the game was too easy (see next blog post).


Returning to main point of this experiment, 560 tap-games have their tempo detected trivially with linear regression via ordinary least squares. The player tapped the correct rhythm, with no note more than a tenth of a quarter-note beat away from the strictly-metronomic position. I call those the "boring" tap-games... I mean, if the obvious solution works, then it's no fun! :)

Happily, this leaves 481 "interesting" tap-games which require more complication solutions. I'm still working on this part; my latest algorithm correctly identifies most of the tempi, but there's still a bunch left, and I think my current approach isn't going to pan out. I need to take a break for a day and return to it with a fresh mind.

Oh, on the topic of "boring" exercises -- judging from the data I've been looking at manually, I estimate that approximately two thirds of the tap-games can be handled with simple linear regression. The missing hundred tap-games (between "two thirds of 1041" and the "560" mentioned above) come from wanting to avoid false positives. I'm being paranoid about false positives; I would rather have the computer spend more time analyzing the taps instead of producing an incorrect tempo estimate.

On to the bad, or even the ugly.

Posted at 2010-10-08 23:42 | Permanent link | Comments

Tempo experiment over

The online tempo experiment is now over; I am no longer collecting information. You are welcome to continue playing the game if you find it helpful, though! More information below.

Thanks so much for participating! We collected a total of 1041 games played. Of these, 882 games had the human player agree with the calculated tempo. In 89 games, the player disagreed with the calculated tempo and gave their own tempo estimation, and 70 games had the player disagree with the tempo without giving their own tempo estimation.

My task now is to examine all these games; my goal is to have "3-agent agreement". For the "agreed with tempo" games, I need to check that I agree that the tempo is good -- this means that the player, me, and the computer all agree on the tempo. For both categories of "disagreed with tempo", I need to figure out what the real tempo should be, then tweak my algorithm so that the computer produces that tempo -- without ruining any of the detected tempos of the "agreed" examples. And if there's any disagreement between me and the player, I'll have to show the specific examples to our resident professor emeritus of music so that he can give a tie-breaking vote. And if he disagrees with both of us... well, there will probably be so few of those games that I can just discuss them individually in a special section of the paper. (I doubt there'll be any of those, though)

A few people asked me privately about getting 100%, so I took a quick glance at the games. The very best game had an average error of 1.25 milliseconds (this was on level 5), while the best game on level 1 had an average error of 12.6 milliseconds. Mathematicians everywhere just winced at my use of the word "average", so let me clarify that those were the RMSE. The mean squared errors were 1.5*10^-6 and 1.5*10^-4, respectively.

Those games were very much the exception, however -- most "great" exercises had an "average" error of 30-40 milliseconds. This is just based on me skimming through a list of hundreds of numbers, though... a detailed (and non-subjective :) analysis will be coming in the next few days.

Posted at 2010-10-02 22:13 | Permanent link | Comments

Recent posts

Monthly Archives

Yearly Archives


RSS