Artifastring 1.0
I'm happy to announce the release of Artifastring 1.0, a highly optimized physical simulation of a violin for sound synthesis. http://percivalmusic.ca/artifastring/
This is an implementation of modal bowed string physical modeling as described by Matthias Demoucron; references are given on the project's main page. The code is available under the GPL v3 or higher.
I had a blast optimising this stuff  actually, I had far too much fun working on tiny speed improvements. Surprisingly, making the code multithreaded didn't help; the extra overhead of the mutex and condition variables nuked the benefit of modeling each violin string separately for any realistic length of discretized input.
I won't get into everything I did, but here's the main four points:

As suggested in a conference paper, if a violin string wasn't vibrating and had no input, I didn't bother calculating new values for it  I just returned 0.0. (this model doesn't include sympathetic vibrations, so there was absolutely no change in the output from doing this). I also detected if the string was barely vibrating (i.e. producing a force of less than 1e6 on the bridge and each modal velocity was below 1e4), and then "turned off" that string. This step did change the output, but only very slightly. I picked those values experimentally  I plucked the string, then looked for the time when I really couldn't hear the sound any more. Admittedly, I was using a pair of £10 headphones... if I had a more expensive sound setup (whether headphones, speakers, a larger room, room with less background noise, etc), these values might be too low. If that were the case, it's easy enough to change those constants.

Caching any values more complicated than addition or multiplication. In particular, the results of any trig functions were cached  the amount of extra memory required is trivial (something like 20 kb ?).

The code calculates eigenvectors of the string for each mode. This is a series of calculatations in the form
y_i = sin(x*i) , i = 0, 1, 2... N (where N=56 in my case)
There's a wellknown optimization for this: to calculate y_n = sin(nW+B), use y_n = 2*cos(W)*y_(n1)  y(n2).
I could improve on that, though. In my case, there's no B, and I know that sin(0) = 0. So I only need to calculate sin(W) and cos(W).
But wait, it gets better! The quoted webpage includes this code:
double y[3]; double keep[N]; for (int i=0; i<N; i++) { y[2] = p*y[1]  y[0]; keep[i] = sqrt_two_div_l * y[2]; y[0] = y[1]; y[1] = y[2]; }
but this involves unnecessary copying. We can avoid that by using a ring buffer of size 2:
double y[2]; double keep[N]; unsigned int latest = 0; for (int i=0; i<N; i++) { y[latest] = p*y[!latest]  y[latest]; keep[i] = sqrt_two_div_l * y[latest]; latest = !latest; }
hmm... it just occurred to me that I could avoid using y if I postponed the multiplication by sqrt_two_div_l until after I'd finished calculating the entire array. I mean, go through the array once calculating the sin() values, then go through again multiplying everything by sqrt_two_div_l. I could also get rid of the "latest" variable, and just use i. Hmm... :)

Fast ring buffer pointer calculation: the traditional way is to use modulo:
i = (i++) % BUFFER_SIZE;
But modulo is division  it's slow. Another way is to use a comparison:
i++; if (i == BUFFER_SIZE) { i = 0; }
That's better, but still uses a bunch of CPU cycles. To improve it even more (courtesy of my brother), if you always use a buffer size which is a power of 2, you can just use an AND!
i++; i &= BUFFER_SIZE_MINUS_ONE;
Of course, we've defined BUFFER_SIZE and BUFFER_SIZE_MINUS_ONE at compiletime.
Sadly, none of this optimization is actually useful. Ok, the "inactive string" trick cut the processing time by more than half  but the other tricks only gave 25% improvements in speed. One of the first lessons for people doing computer science is that running time doesn't matter. The real improvements are made at the level of algorithms... if you can do O(ln(n)) less operations than the next person, then it doesn't matter if you wrote your code in perl and he used handcrafted assembly. Your code is guaranteed to be faster for any sufficiently large problem.