evan_tech

Previous Entry Share Next Entry
03:10 am, 24 Dec 04

ljrank

So you take your friends graph as a, um, graph, and then you run PageRank over it. You expect to find the most "valuable" LJ users, where valuable is defined in the typical recusive fashion as those who're read by the most valuable users.

I've been thinking about this for years; I remember tossing around producing a "degree of bradness" as a sort of popularity contest.

Anyway, I've had some LJ friends list exports sitting around for a while so I hacked it up tonight.
% ./pr                  
Loading 1853: 10% [bettyray] 20% [dariusk] 
30% [fleur] 40% [iseebi] 50% [lavish] 60% [mordainlove] 
70% [polyethylene] 80% [shelbydee] 90% [thelonious] 
100% [zoolthelung] 
Loaded 86197 users total.
00:0.838194 01:273.662964 02:21.830791 03:163.136047 
04:13.124767 05:97.411224 06:8.424023 07:58.278320 
08:5.468758 09:34.933289 10:3.806503 11:20.983643 
12:2.911346 13:12.634644 14:2.156403 15:7.624634 
16:1.564331 17:4.617676 18:1.115997 19:2.799561 
20:0.787308 21:1.706665 22:0.551987 23:1.040894 
24:0.383224 25:0.642273 26:0.265137 27:0.395569 
28:0.182526 29:0.242188 30:0.125854 31:0.152222 
32:0.087189 33:0.091492 34:0.060013 35:0.055664 
36:0.041595 37:0.037964 38:0.027573 39:0.027344 
40:0.018799 41:0.017044 42:0.013931 43:0.011719 
44:0.008804 45:0.008606 46:0.006775 47:0.005219 
48:0.005386 49:0.003326 
50 steps to converge.
                evan: 74.869820
                brad: 135.409088
               meenk: 27.871605
          postmortal: 11.896143
             patrick: 40.594044
             graydon: 7.771641
                 jwz: 237.453430
As you'd expect, jwz is the winner of my test set.

I have 1853 friends lists in my data set, mostly centered around me, which is definitely a source of bias. But 1853 nodes out is a pretty large chunk of the graph (86,000+ users are connected via those lists!). The scoring is initialized with me getting a rank of 1, which is another source of bias. The users at the end were picked randomly to print.

This only takes a second or two to compute. I figure that ultimately, 5m LJ users * 4 bytes of float of rank per user * 2 sets (prev, cur) of ranks would only be 40mb of grinding. The actual set of edges (which are read while grinding) is much larger, though. (I really ought to be able to quote off the top of my head the expected edge count in a power-law graph like LJ, but I'm not quite there yet.) If this gets outside the bounds of the RAM on this machine I'll probably stop there, for I fear my knowledge of things at work would sneak in. But so far this is solely based on my understanding from reading a few papers...