# TSC and high precision time measuring
this is a gist for high accuracy performance benchmarking, using CPUs invariant cycle counter.
- Just show me the code
- RDTSC
- out-of-order execution and barriers
- example: what’s wrong with chrono
- Other architectures (TODO)
- Copyright notice
# Just show me the code
I made a handy single header C/C++ wrapper for rdtsc
. The code is shared under
EUPL-1.2 license, Available at
https://git.sr.ht/~shrik3/tsc_clock.git
|
|
NOTE: add constrains and compiler attributes for better optimization, e.g.
__attribute__((always_inline)) static inline rdtsc() {/*SNIPS*/}
NOTE: CPUID
instruction takes eax
to chose from different leaf functions.
I’m not sure if some are more costly than others, so I simply cleared eax
to
chose the basic leaf in the barrier_after()
variant.
NOTE: this example from linux kernel
uses lfence
or mfence
as barrier, and guards only before rdtsc
, not after.
Which suggests that the rdtscp
instruction along may suffice.
Read below for explanations (hopefully).
# RDTSC
- Real Time Clock (RTC)
- battery powered hardware clock on the mainboard.
- Invariant Time Stamp Counter (TSC)
- the invariant TSC runs at constant speed regardless of C state or CPU frequencies
- TSC Frequency
- the invariant CPU frequency can be calculated from several
CPUID
values. The libcpuid can already do it for you. e.g.$ cpuid_tool --clock-rdtsc > 2592 # unit in Mega Herz (one Million Hz)
- HW support
- check CPUID for features, e.g.
tsc, rdtscp, constant_tsc, nonstop_tsc
- Instruction
rdtsc
&rdtscp
- both instructions reads the 64 bit tsc value into
edx:eax
. Instructionrdtscp
additionally reads the TSC aux value intoecx
, we ignore this value in the examples but don’t forget to clobberrcx
. - Serialization
rdtsc
has no serialization what-so-ever, so out-of-order execution may break your measurement.rdtscp
has serialization before, but not after:rdtscp
must wait until all instructions before it complete. But instructions after it may still be executed out-of-order, i.e. executed before the TSC value is read out. For explicit serialization, we could useCPUID
instruction before or afterrdtsc(p)
.- Wrap-around issue?
- The time-stamp counter is a model-specific 64-bit counter that is reset to zero each time the processor is reset. If not reset, the counter will increment ~9.5 x 1016 times per year when the processor is operating at a clock rate of 3GHz. At this clock frequency, it would take over 190 years for the counter to wrap around.
# out-of-order execution and barriers
The existential question to ask is what exactly are you measuring when
instructions are executed out-of-order? (note, RDTSC
denotes the moment the
TSC value is read out (either into reg, or written to memory)
I. Ideally we want this | II. But it's fine to measure this | III. what actually happens is this bullshit |
|
|
|
I. Again, you to measure this |
IV. with |
V. with only |
|
|
|
you want to measure execution of inlined foo()
. But what actually happens is
III. where foo()
instructions may be reordered outside of your measurement
intervals, and other instructions may be included. The measured interval between
RDTSC
readings does not represent the execution time of foo()
A stronger configuration is V. where we replace rdtsc
with rdtscp
, which has
an internal barrier that waits until all preceding instructions are complete.
However it doesn’t prevent the sequential instructions to be re-ordered before
reading tsc value:
The RDTSCP instruction waits until all previous instructions have been executed before reading the counter. However, subsequent instructions may begin execution before the read operation is performed1.
With configuration IV, enforcing serialization with CPUID
barrier on both
sides ensures that we are measuring the runtime of foo
with the tradeoff of
total runtime.
question: does function call make a difference? If foo()
is not inline,
foo()#1~4
and other part of the program are delimited by jmp/call
and
ret
. Does function calls (or jumps) act as serialization?
# example: what’s wrong with chrono
timeval
have microsecond unit but you don’t necessarily get microsecond
precision from gettimeofday()
say you have pseudo code like this: the
small_workload()
is only several instructions; you measure each execution and
accumulate the runtime. Even ignoring the overhead of taking timestamps, the
result is still very off.
start_all = gettimeoftheday();
accumulate = 0;
// loop for a looooooot of times
while(condition) {
start_split = gettimeoftheday();
small_workload();
end_split = gettimeoftheday();
accumulate += ( end_split - start_split);
}
end_all = gettimeoftheday();
time_all = end_all - start_all;
print(accumulate, time_all); // they should be roughly the same, right?
That’s because the execution time of small_workload
is comparable (even
smaller than) to the gettimeofday()
resolution. The result is that each of
the “split” measurements are rounded up or down by one resolution unit. See
man clock_getres(3p)
. You could configure linux to use tsc as clock
source2, but syscall (even with vDSO) is heavier that it could have been;
It’s obvious that you should minimize the probing overhead.
# Other architectures (TODO)
# Copyright notice
This page borrows contents from other resources123. This page is shared under CC BY-SA-NC, any re-distrubution MUST NOT strip this note or the attributions and the attribution (reference) list.
The Torpey article2 is CC BY-NC-ND, which means No-Derivatives, therefore I will only vaguely refer to their points without directly reusing their words.
-
How to Benchmark Code Execution Times on Intel IA-32 and IA-64 Instruction Set Architectures, white paper, Gabriele Paoloni. https://cis.temple.edu/~qzeng/cis3207-spring18/files/ia-32-ia-64-benchmark-code-execution-paper.pdf ↩︎ ↩︎
-
Measuring Latency in Linux, Bill Torpey. http://btorpey.github.io/blog/2014/02/18/clock-sources-in-linux/ ↩︎ ↩︎ ↩︎
-
http://oliveryang.net/2015/09/pitfalls-of-TSC-usage/, Oliver Yang, NOTE: The content reuse needs to include the original link. ↩︎