29 Jan 05

lj data

I'm a big fan of GNU R. I fantasize about getting good enough at this sort of thing that I can cut up data in minutes. I'm certainly getting faster, but I have far to go. Here's a bit of a transcript and commentary of some R hackin' on my LJ data along with some graphs.

# First, get daily counts out of sqlite:
echo "year      month   day     count" > table
sqlite3 journal.db 'select year, month, day, count(*) from entry group by year, month, day' \
    | sed -e 's/|/\t/g' >> table</pre>

# and read it into R:
data = read.delim('table')
# we now have vectors "year", "month", "day", and "count".</i>

# create a vector of date objects:
dates = as.Date(paste(year, month, day), "%Y %m %d")

# unfortunately, this vector only has dates that have nonzero counts.
# so first create a vector of all dates in the range we care about:
alldates = seq(min(dates), max(dates), by='1 day') 

# then merge these with the existing list, creating holes where
# we didn't already have a count:
d = merge(data.frame(date=dates, count=count),
    data.frame(date=alldates), by='date', all.y=T)
# by='date' means "join on the 'date' column", 
# and all.y=T means "put in holes when the second list
# has a value that the first doesn't".

# now fill in the holes ("NA" values) with zeros:
d$count[is.na(d$count)] = 0

# get the weekdays of these dates into an ordered factor:
wd = ordered(weekdays(d$date, abbreviate=T),
  levels=c('Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'))

# plot weekdays versus entry counts per day.
boxplot(d$count ~ wd, ylim=c(0,15), main="Posts per weekday")

# as you'd expect, weekends are different.

# plot average posts per day over a 60-day window.
plot(d$date, filter(d$count, rep(1/60,60)), type='l', main='Posts per Day',
    ylab='average posts per day', xlab='year')