# TSC and high precision time measuring

this is a gist for high accuracy performance benchmarking, using CPUs invariant cycle counter.

Just show me the code
RDTSC
out-of-order execution and barriers
example: what’s wrong with chrono
Other architectures (TODO)
Copyright notice

# Just show me the code

I made a handy single header C/C++ wrapper for rdtsc. The code is shared under EUPL-1.2 license, Available at https://git.sr.ht/~shrik3/tsc_clock.git

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32


uint64_t rdtsc_naked()
{
	uint32_t low, high;
	asm volatile("RDTSC\n\t" : "=a"(low), "=d"(high));
	return ((uint64_t)low) | (((uint64_t)high) << 32);
}

uint64_t rdtscp()
{
	uint32_t low, high;
	asm volatile("RDTSCP\n\t"
	             : "=a"(low), "=d"(high)
	             :
	             : "%rcx" /*rdtscp writes ecx, so clobber it*/
	);
	return ((uint64_t)low) | (((uint64_t)high) << 32);
}

uint64_t rdtscp_barrier_after()
{
	uint32_t low, high;
	asm volatile("RDTSCP\n\t"
	             "mov %%edx, %0\n\t"
	             "mov %%eax, %1\n\t"
	             "xor %%eax, %%eax\n\t"
	             "CPUID\n\t"
	             : "=r"(high), "=r"(low)
	             :
	             : "%rax", "%rbx", "%rcx", "%rdx");

	return ((uint64_t)low) | (((uint64_t)high) << 32);
}

NOTE: add constrains and compiler attributes for better optimization, e.g.

__attribute__((always_inline)) static inline rdtsc() {/*SNIPS*/}

NOTE: CPUID instruction takes eax to chose from different leaf functions. I’m not sure if some are more costly than others, so I simply cleared eax to chose the basic leaf in the barrier_after() variant.

NOTE: this example from linux kernel uses lfence or mfence as barrier, and guards only before rdtsc, not after. Which suggests that the rdtscp instruction along may suffice.

Read below for explanations (hopefully).

# RDTSC

Real Time Clock (RTC)

battery powered hardware clock on the mainboard.

Invariant Time Stamp Counter (TSC)

the invariant TSC runs at constant speed regardless of C state or CPU frequencies

TSC Frequency

the invariant CPU frequency can be calculated from several CPUID values. The libcpuid can already do it for you. e.g.

$ cpuid_tool --clock-rdtsc
> 2592 # unit in Mega Herz (one Million Hz)

HW support

check CPUID for features, e.g. tsc, rdtscp, constant_tsc, nonstop_tsc

Instruction rdtsc & rdtscp

both instructions reads the 64 bit tsc value into edx:eax. Instruction rdtscp additionally reads the TSC aux value into ecx, we ignore this value in the examples but don’t forget to clobber rcx.

Serialization

rdtsc has no serialization what-so-ever, so out-of-order execution may break your measurement. rdtscp has serialization before, but not after: rdtscp must wait until all instructions before it complete. But instructions after it may still be executed out-of-order, i.e. executed before the TSC value is read out. For explicit serialization, we could use CPUID instruction before or after rdtsc(p).

Wrap-around issue?

The time-stamp counter is a model-specific 64-bit counter that is reset to zero each time the processor is reset. If not reset, the counter will increment ~9.5 x 1016 times per year when the processor is operating at a clock rate of 3GHz. At this clock frequency, it would take over 190 years for the counter to wrap around.

# out-of-order execution and barriers

The existential question to ask is what exactly are you measuring when instructions are executed out-of-order? (note, RDTSC denotes the moment the TSC value is read out (either into reg, or written to memory)

I. Ideally we want this	II. But it's fine to measure this	III. what actually happens is this bullshit
`instr 1 instr 2 instr 3 instr 4 RDTSC ---+ foo()#1 \| foo()#2 \| foo()#3 \| foo()#4 \| RDTSC ---+ instr 5 instr 6 instr 7 instr 8`	`instr 1 instr 3 instr 4 instr 2 RDTSC ---+ foo()#4 \| foo()#1 \| foo()#3 \| foo()#2 \| RDTSC ---+ instr 6 instr 7 instr 8 instr 5`	`instr 1 instr 4 foo()#4 instr 2 RDTSC ---+ foo()#1 \| instr 6 \| instr 7 \| ???? foo()#2 \| instr 8 \| instr 3 \| RDTSC ---+ instr 5 foo()#3`
I. Again, you to measure this	IV. with `rdtscp` + barrier	V. with only `rdtscp`
`instr 1 instr 2 instr 3 instr 4 RDTSC ---+ foo()#1 \| foo()#2 \| foo()#3 \| foo()#4 \| RDTSC ---+ instr 5 instr 6 instr 7 instr 8`	`instr 1 instr 2 instr 3 instr 4 RDTSCP---+-----BARRIER--+ CPUID----\|-----BARRIER \| foo()#4 \| \| foo()#2 \| \| measured foo()#1 \| \| foo()#3 \| \| RDTSCP---+-----BARRIER--+ CPUID----------BARRIER instr 5 instr 6 instr 7 instr 8`	`instr 4 instr 1 instr 2 instr 3 foo()#1 <---- OUT OF ORDER RDTSCP---+ foo()#2 \| foo()#3 \| measured foo()#4 \| instr 5 <+--- OUT OF ORDER RDTSCP---+ instr 7 instr 8 instr 6`

you want to measure execution of inlined foo(). But what actually happens is III. where foo() instructions may be reordered outside of your measurement intervals, and other instructions may be included. The measured interval between RDTSC readings does not represent the execution time of foo()

A stronger configuration is V. where we replace rdtsc with rdtscp, which has an internal barrier that waits until all preceding instructions are complete. However it doesn’t prevent the sequential instructions to be re-ordered before reading tsc value:

The RDTSCP instruction waits until all previous instructions have been executed before reading the counter. However, subsequent instructions may begin execution before the read operation is performed¹.

With configuration IV, enforcing serialization with CPUID barrier on both sides ensures that we are measuring the runtime of foo with the tradeoff of total runtime.

question: does function call make a difference? If foo() is not inline, foo()#1~4 and other part of the program are delimited by jmp/call and ret. Does function calls (or jumps) act as serialization?

# example: what’s wrong with chrono

timeval have microsecond unit but you don’t necessarily get microsecond precision from gettimeofday() say you have pseudo code like this: the small_workload() is only several instructions; you measure each execution and accumulate the runtime. Even ignoring the overhead of taking timestamps, the result is still very off.

start_all = gettimeoftheday();
accumulate = 0;

// loop for a looooooot of times
while(condition) {
    start_split = gettimeoftheday();
    small_workload();
    end_split = gettimeoftheday();
    accumulate += ( end_split - start_split);
}

end_all = gettimeoftheday();
time_all = end_all - start_all;
print(accumulate, time_all); // they should be roughly the same, right?

That’s because the execution time of small_workload is comparable (even smaller than) to the gettimeofday() resolution. The result is that each of the “split” measurements are rounded up or down by one resolution unit. See man clock_getres(3p). You could configure linux to use tsc as clock source², but syscall (even with vDSO) is heavier that it could have been; It’s obvious that you should minimize the probing overhead.

# Other architectures (TODO)

# Copyright notice

This page borrows contents from other resources¹²³. This page is shared under CC BY-SA-NC, any re-distrubution MUST NOT strip this note or the attributions and the attribution (reference) list.

The Torpey article² is CC BY-NC-ND, which means No-Derivatives, therefore I will only vaguely refer to their points without directly reusing their words.

How to Benchmark Code Execution Times on Intel IA-32 and IA-64 Instruction Set Architectures, white paper, Gabriele Paoloni. https://cis.temple.edu/~qzeng/cis3207-spring18/files/ia-32-ia-64-benchmark-code-execution-paper.pdf ↩︎ ↩︎
Measuring Latency in Linux, Bill Torpey. http://btorpey.github.io/blog/2014/02/18/clock-sources-in-linux/ ↩︎ ↩︎ ↩︎
http://oliveryang.net/2015/09/pitfalls-of-TSC-usage/, Oliver Yang, NOTE: The content reuse needs to include the original link. ↩︎