The idea is to design algorithms such that they work optimally with the processor cache, without using the cache size as a parameter to the algorithm. We talk about runtime for today's algorithms, which still matters, but using the cache effectively also matters and oftentimes dominates the big-O. Cutting to the spoilers: the strategy is to use divide-and-conquer techniques to recursively reduce to subproblems where eventually a subproblem (and its smaller recursions) will all fit in the cache. For example, with matrix multiply, you repeatedly split into submatrices along the biggest dimension, and eventually you can fit three submatrices (source1, source2, destination) all into the cache, at which point the remaining work has no cache misses.
They casually toss out towards the end that they're the group that made FFTW, and in fact that this is the technique that FFTW uses. (They mention right near the end, in the last column in the paper, the FFTW strategy: at compile-time they code-gen a manual expansion of the recursion near the leaves to reduce call overhead. This then lets the C compiler do the optimization for them, like register allocation.) This makes me very inclined to think they know what they're talking about.
Another cool thing there I hadn't seen before [not that I know anything about this stuff] is their "bit-interleaved" matrix layout. See Figure 2 on the third page. It's a recursive memory layout for matrixes that suits itself nicely for this sort of problem, with total locality within local blocks of numbers. What's the limitation? Hard to compute the memory location of a given (row,column) position.
But to be honest, I'm still skeptical, in a large part because I can't follow the equations as well as I'd like. I haven't done much complexity analysis. Like this easy one: they say a submatrix of size s by s needs Θ(s + s²/L) cache lines, where L is the cache line size... is that extra +s to account for each row of the matrix potentially not being contiguous? Now that I've written that out I think I've convinced myself it's true, but that example was pulled from their "introduction" section and it gets more complicated quickly. Despite my ignorance, there's a lot of what feels like fast and loose playing with the complexity, especially in section six where they explain why they can assume the cache is fully associative and has perfect knowledge of the future, proving that a real cache has effectively the same performance(!).
The other reason I'm skeptical is because their benchmarks at the end compare these algorithms to a naive iterative algorithm, something that wasn't apparent to me at first. Why not compare these to their cache-aware equivalents? Clearly an algorithm tuned to the cache size will outperform these, and while it's nice to see that their approach is better than the naive approach, the interesting (to me) comparison is how much you lose by using this approach instead of the known-good tuned approaches. (It also makes me curious of how hard it is to really tune something for today's memory hierarchies; it would seem [without thinking about it too hard] that different tweaks would trade off locality at different cache levels and it'd be hard to balance...)