evan_tech

Previous Entry Share Next Entry
11:35 am, 1 Dec 03

lj analyses

  • The LiveJournal interests graph is odd: each user is linked to a bunch of interests, and each interest is linked to a bunch of users (in this representation, all links are bidirectional). No users are directly linked and no interests are directly linked.

    Two subset interpretations:
    1. Make the nodes exclusively users and the links to other users based on shared interests. Now you can find a "what do [a] and [b] have in common?" based on shortest paths.

    2. Make the nodes exclusively interests and the links to other interests based on users who include both interests. (This graph may be more useful than the one above if the number of interests is below 20k or so, because then we can do full graph computations on a normal computer.) Make the link cost a function (inverse) of the number of users who include both interests. Now you can find the diameter (and the maximally-distant links) as the "most dissimilar" interests; users who intentionally pick dissimilar interests with the intention of messing up the graph will only add a high-cost link.

    Figuring out who in general has "interests in common" with a given user is a different problem. I recall some stuff from my stats class about relative information (a common interest like "music" is much less meaningful than a specific one like "debian") but I don't remember the details of how it works. :(

  • The shortest-path problem between two users or interests is much easier to solve if you're only computing for two specific users. A breadth-first search from both ends at the same time ought to be pretty quick; I imagine this is how LJ Connect actually works.

  • I'd really like to run some imaginary "flow" algorithm and figure out who is the "most popular" (high in-degree in friends links isn't as meaningful as "valuable" links-- think Google's pagerank), but I don't know anything about how those work.