09:02 pm, 11 Apr 05
c++
What's faster:
if (runtime_constant_condition) { a(); } else { b(); }
- Calling a virtual function
a_or_b();
that does something different depending on the subclass. (As I understand it, an indirect jump?)
In either case, i believe option 2 is much more likely to trash your cache twice, since you're going through an object-local derference (from .bss) over to the executable bits (.text), then hop off to the actual function (god knows where in .text). But i'm mostly talking out my ass there (see above).
* Feel free to substitute anyone else who knows the cycle cost of various operations on x86. I imagine this is a rather small degree of freedom.
I don't know much about compilers or hardware, but my guess is that branch predictors make (1) fast and hard to benchmark. I have no idea about (2).
1:
load runtime_constant_condition, reg0
compare reg0, 1
jump not equal b:
a: call a()
jump c
b: call b()
c:
2.
load object address, reg0
load (vtable offset)reg0, reg1
jump reg1
So worst case for 1. is a cache miss, a branch mispredict, and the static cost of an unconditional jump. I'm ignoring the cache miss for the targets instructions (a or b) because you pay that cost if you have a function pointer or an absolute address. Worst case for 2 is two cache misses, one to figure out the object's address, and another to get the vtable entry.
Now for the second case, you're very likely to have the object's address in cache, or at least in a register. For the first case, it's hard to say how likely you are to have some global memory in cache that's holding this variable.
So if we ignore the caches, it comes down to which stage branch trace buffers are checked vs which stage handles branch prediction. If you're executing this code a lot, the targets are likely to be stored in BTBs. The particulars of which vary greatly from architecture to architecture. I believe at least on the P4, the indirect jump will actually complete a while before the branch prediction unit can generate an answer which is then fed to the instruction fetcher. This certainly favors 2.
For such a simple case as this, there probably is no clear winner, and the difference in clock cycles will be relatively small. As soon as you're switching on more than one value of runtime_constant_condition, things start to look brighter for the second case, however. So I'm sorry to say this answer was not as satisfying as I had originally hoped it would be.
The first case wins if the code is not frequently run. The second case does if it is. This assumes the behavior of a pentium 4 or late generation alpha processors, and I assume opteron, but I don't know for certain.
I'd say you are right if you are comparing:
for(;;) {
if_based();
}
and
for (;;) {
a->vfunc();
}
but if you were applying vfunc to an array of a's (of course I'm assuming you're going through a base pointer; the compiler doesn't make a vtable access at all if it knows a is of type A instead of type Base), regardless of whether they are the same absolute type, you would absolutely get a cache miss on every access of the vtable address (unless, of course, every single object's vtable entry for this function is in cache cause you just did this operation).
So I guess what I'm saying is it depends greatly on how many instances of the object we are calling vfunc() on and where runtime_constant comes from. Which is a greater assumption, that runtime_constant is in cache or that every vtable address of every instance of A that vfunc is called on is in cache?
The more important question to me, as an engineer, is the one I posed earlier: which is more expensive? Given the fact that a virtual function would even be on the table as a solution implies to me that there are future possible needs beyond what this a and b can satisfy with their implementations of vfunc(). In which case, prefer the virtual function approach as it is more maintainable, more readable. Therefore, it is less expensive. This, of course, assumes you can accept the cost of a cache miss in your application.