April 12th, 2004

  • evan

graphjoin revisited

I thought I'd take another stab at this graphjoin problem. I went to the local cafe only to find I had a really old copy of the code on my laptop.
So I rewrote it, but in C this time, in the way the author of the paper suggested (using an adjacency matrix instead of an adjacency list).

I have friends data for: barbee brad bram dl ellenlouise erik evan faith gaal lisa meena mobley patrick petit_chou toast whitaker xaotica. I figured that'd have some interesting subgroups because they're different circles of friends of mine.
To run on my 400mhz Powerbook: ./friends 23.79s user 0.14s system 92% cpu 25.800 total
Way slow.

gprof says:
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 93.33    216.86   216.86     1101     0.20     0.21  matrix_step
  3.71    225.47     8.62     1100     0.01     0.01  merge
  1.60    229.18     3.71   730718     0.00     0.00  matrix_dq
  0.92    231.32     2.14  1948423     0.00     0.00  lookup

Which is not at all what I expected. matrix_step is the main iteration of the algorithm: it basically iterates matrix_dq over all edges and picks the largest one.
And matrix_dq (which is called a whole lot of times) is the part that moves into floating point and does some divides... but it's not the bottleneck. Instead, it's the iteration over the (rather sparse) graph. But I guess with n people (1105 in this case), you merge somewhere around n times (I end up with four groups), so it's O(n^3) iterations.

Then I rewrote it the way I did it in OCaml, except I could mutate the lists: 0.35s. Kick ass.
  • evan

graphjoin results

Brad asked if I got any interesting results.

When I run it over me + all my LJfriends - all people in that set who only have one link into it (1848 users total), it finds:
two huge groups (not too interesting),
one medium-sized one: Collapse ) that appears to be primarily LJ support and people from school, and then finally two small groups:
*** 313: lptran puzzlefite limenaide almondsilk rcyang supercat1102 ktnguyen qmdau
*** 328: diapholom hajime kayeyem xian nocte _generica jehilia soulwatcher bbt

The first one is my ex-roommate, his ex-girlfriend, and a bunch of his friends.
The second one includes two people I recognize from a group of friends in Australia. I don't know any of the others.

I think the problem is that by picking me and all of my one-hop friends, the nodes are unevenly distributed around me. There's also a low "q", the value the algorithm tries to maximize to indicate clustering strength. I wonder if throwing more data at it will make it worse or better...?

Anyway, if you'd like to play with it, the code (in C this time!) is up in my arch repository. I even put up simple instructions for checking things out for people who've never used arch.