Graphing commits per developer

2010-08-05 08:01

I gave a talk at Rencontres Mondiales du Logiciel Libre, or the Libre Software Meeting 2010, over three weeks ago. This post has the scripts that I used to create the graphs of commits per developer. These scripts aren't particularly inspired, but there's been some interest in them, so in the spirit of open source and just plain good science, here there are. :)

Here's the ultimate goal: a graph showing commits per developer.

More context is in the previous post about the talk, Sustainable Development in F/OSS

First, we need a program to extract statistics. I used gitstats, and heartily recommend it to others.

Second, we need to know the range of dates to use. I used 6-month intervals. I wimped out here -- I'm sure that it's possible to make git tell you this automatically, but instead I just skimmed through the logs in gitk and manually picked out the first commit in each date range. The result is this:

2005-1  cccee40bb3a13b6c230fa98a8ca61f7c526d5f66
2005-7  b28d02e3cbb81b050cf4c6bba3260a331d4f7d3f
2006-1  16b713b9b9616d70c4a1b12506952098c01c5706
2006-7  eaed6aa8f16502d8159530eaf3ed4d56dbf8fef8
2007-1  ad33cc4b37628e047e4c8c3b3d42d93cb7525d0f
2007-7  699feb0fbe8448a94ca5d9de43dd126d34fd5341
2008-1  57221c9ee7a758169c9e2fe4805f0ed3598f50d5
2008-7  b57384106e4e54f4a22a93af88204f86f37d078b
2009-1  015ac40ff340255ea4ac76fbe5b5eab3c2a40adc
2009-7  bee6f18b78c002187d9197b6a334ff61c13a4ef9
2010-1  9429c53e9e4a1f88b32f4897fdc261ed9bd6684f
2010-7  ec376e079a0dc586ba3fe17113993525f2be69c2

(I used a tab between the columns, but that won't come out in the html)

Third, I ran the below script to run gitstats for each range of dates/commits:

# make-gitstats.py
# public domain if that matters; this is utterly trivial

#!/usr/bin/env python
import sys, os

GITDIR = "$HOME/src/lilypond"
COMMITS_FILENAME = "lily-dates.txt"

commits = open(COMMITS_FILENAME).readlines()

def getRange(begin, end):
        out_filename = begin.split('\t')[0] + '_' + end.split('\t')[0]
        commit_begin = begin.split('\t')[1].rstrip()
        commit_end   = end.split('\t')[1].rstrip()
        cmd = "gitstats"
        cmd += ' -c commit_begin='+commit_begin
        cmd += ' -c commit_end='+commit_end
        cmd += ' ' + GITDIR
        cmd += ' ' + out_filename
        print cmd
        os.system(cmd)

# whole range
getRange(commits[0], commits[-1])

for i in range(len(commits)-1):
        getRange( commits[i], commits[i+1] )

Fourth, I wimped out again. The previous step created a bunch of directories containing info about each date range nicely formatted in html. Instead of grabbing the info I wanted directly with python, I copied the table of authors (from the web page) into openoffice. I split up the commit field (which contains both the raw number, and the percentage of the total like "3656(26.20%)") with:

=VALUE(LEFT(B2;FIND("(";B2)-2))

Then I copied the columns around and exported it as a csv, ending up with something like this:

"2005-1 to 2005-7"   "2005-7 to 2006-1"   ...
626                  517                  ...
237                  77                   ...
123                  67                   ...

(again, there might be whitespace tab/space issues in this display)

Fifth, it's gnuplot time! Again, I could automate more of this file, but it worked for my purposes.

# combined.plot
set term png enhanced font
'/usr/share/fonts/truetype/ttf-dejavu/DejaVuSans.ttf' 12

set title "Git commits to LilyPond, 2005 - 2010 in 6 month
intervals"
set xlabel "Rank of developer"
set ylabel "Number of commits"

set key autotitle columnhead
set yrange [6:]
#set xrange [:20]

set style line 1  lt 1  lw 2 lc rgb "#0000ff"
set style line 2  lt 2  lw 2 lc rgb "#0033cc"
set style line 3  lt 3  lw 2 lc rgb "#006699"
set style line 4  lt 4  lw 2 lc rgb "#009966"
set style line 5  lt 5  lw 2 lc rgb "#00cc33"
set style line 6  lt 6  lw 2 lc rgb "#00ff00"
set style line 7  lt 7  lw 2 lc rgb "#33cc00"
set style line 8  lt 8  lw 2 lc rgb "#669900"
set style line 9  lt 9  lw 2 lc rgb "#996600"
set style line 10 lt 10 lw 2 lc rgb "#cc3300"
set style line 11 lt 11 lw 2 lc rgb "#ff0000"

set output "combined-normal.png"
plot "combined.csv" using 1 ls 1 w lines, \
        "combined.csv" using 2 ls 2 w lines, \
        "combined.csv" using 3 ls 3 w lines, \
        "combined.csv" using 4 ls 4 w lines, \
        "combined.csv" using 5 ls 5 w lines, \
        "combined.csv" using 6 ls 6 w lines, \
        "combined.csv" using 7 ls 7 w lines, \
        "combined.csv" using 8 ls 8 w lines, \
        "combined.csv" using 9 ls 9 w lines, \
        "combined.csv" using 10 ls 10 w lines, \
        "combined.csv" using 11 ls 11 w lines

set output "combined-log.png"
set log y
plot "combined.csv" using 1 ls 1 w lines, \
        "combined.csv" using 2 ls 2 w lines, \
        "combined.csv" using 3 ls 3 w lines, \
        "combined.csv" using 4 ls 4 w lines, \
        "combined.csv" using 5 ls 5 w lines, \
        "combined.csv" using 6 ls 6 w lines, \
        "combined.csv" using 7 ls 7 w lines, \
        "combined.csv" using 8 ls 8 w lines, \
        "combined.csv" using 9 ls 9 w lines, \
        "combined.csv" using 10 ls 10 w lines, \
        "combined.csv" using 11 ls 11 w lines

If you want to use any of this, go ahead! I declare it to be public domain.

Caution: as I mentioned in my talk, the raw number of commits per developer is a vague measure. It doesn't include emails (either organizing a project, reviewing patches, etc), amount of work behind each patch, etc. For example, in the range of Jan 2010 - July 2010, I'd say that the two programmers who did the most work (both in terms of the bugfixes and new features they wrote, but also in terms of reviewing other people's patches) were #5 and #6 on the "commits per developer" list.

I've considered starting up a project to give a better measure of a project's "health" -- taking into account emails, bugs fixed, reviewing activity... in the case of lilypond, all that info wouldn't be too hard to gather, and it would certainly be a fun task to develop algorithms to deal with that data.

However, if I want my PhD to be finished ASAP -- and I definitely do want this done so that I can start on my postdoc life -- then I really can't afford to branch off onto that type of work. At least, not alone. If anybody else is interested in working on this project, I'd definitely considering on that team. :)