how fast are things in Linux (WIP)
let’s benchmark the speed / latency of various things under linux.
Disclaimer: this is only some fun experiments to give an rough idea of magnitude of things. I don’t claim to provide any accurate or representative numbers.
# testing environment
processor: Intel(R) Core(TM) i5-6440HQ CPU @ 2.60GHz
kernel: Linux 6.6.63-1-lts #1 SMP PREEMPT_DYNAMIC x86_64 GNU/Linux
system: archlinux
glibc: 2.40
mitigations:
I have a rather old CPU from 2016, which has many HW vulnerability like spectre
and meltdown. The mitigation slows down the system (especially syscalls) quite
significantly (see appendix for the mitigations). Keep this in mind when reading
the numbers.
clock source:
For measurement I’m using CPU’s invariant TSC.
My CPU tsc ticks @ 2592 MHz, this is like cpu cycles but not nesessarily the
same counting.
# what’s before main()
This is actually pretty complicated: when running a program, the parent process
calls exec, and the newly forked process doesn’t directly start from main()
,
the libc needs to prepare various things for it. But for a quick measurement,
let’s see how long does it take to exec()
In the below example we are doing a fork()
in the main process, in the forked
process we when do a execve
. The tsc
difference is roughly the time spent on
calling fork()
, execve()
, then run until the main()
function of spawned
program. In dozens of runs, this number range from 1574520 to 5118116, which
translates into 607 to 1974 microseconds (µs). Note that in this case you
must have the constant_tsc
cpu feature, because the parent and child process
won’t necessarily spawn on the same core.
// main.c
int main() {
long long tsc_0;
pid_t my, pid;
my = getpid();
tsc_0 = RDTSC();
pid = fork();
if (pid == -1){
printf("fork error\n");
exit(-1);
} else if (pid > 0) {
printf("host counter before fork: %lld\n", tsc_0);
int status;
waitpid(pid, &status, 0);
} else {
execve("test", NULL, NULL);
}
return 0;
}
// test.c
int main() {
long long tsc_1;
tsc_1 = RDTSC();
printf("child_process in main: %lld\n", tsc_1);
return 0;
}
The difference between tsc 0
and tsc 1
is roughly the exec
overhead.
# print hello world
The speed of printing depends on many things but what about the trivial
printf("hello world\n");
result 100 runs:
loop | total time | total ticks | avg. ticks per iter | avg. each iter |
---|---|---|---|---|
printf() | 0.000265 s | 687078 | 6870 | 2690 ns |
btw, compiler optimization doesn’t change anything here.
# syscall
For a rough idea, we could have 1k ~ 100k syscalls per second. on a normal 4 cores system.
$ sudo perf stat -e raw_syscalls:sys_enter -a -I 1000 sleep 10
on my (pertty busy) 4C/8G production server:
# time counts unit events
1.000774756 26,478 raw_syscalls:sys_enter
2.002340928 29,731 raw_syscalls:sys_enter
3.006551494 28,774 raw_syscalls:sys_enter
4.007948189 73,816 raw_syscalls:sys_enter
5.009343398 31,859 raw_syscalls:sys_enter
6.010853823 52,696 raw_syscalls:sys_enter
7.012298835 55,095 raw_syscalls:sys_enter
8.013842981 21,614 raw_syscalls:sys_enter
9.015450515 26,814 raw_syscalls:sys_enter
10.006519753 52,751 raw_syscalls:sys_enter
on my laptop (I’m randomly clicking in the browser):
# time counts unit events
1.001079680 18,676 raw_syscalls:sys_enter
2.007895894 28,518 raw_syscalls:sys_enter
3.014606980 19,019 raw_syscalls:sys_enter
4.021105434 37,456 raw_syscalls:sys_enter
5.026087966 53,223 raw_syscalls:sys_enter
6.054707533 43,798 raw_syscalls:sys_enter
7.056551079 20,397 raw_syscalls:sys_enter
8.058208397 19,185 raw_syscalls:sys_enter
9.060285037 47,732 raw_syscalls:sys_enter
10.005679016 21,985 raw_syscalls:sys_enter
syscall latency: runtime of getpid() and the same loop without syscall as reference:
|
|
results (each loop for 1000000 times)
loop | total time | total ticks | avg. ticks per iter | avg. each iter |
---|---|---|---|---|
getpid() | 0.909797 s | 2358194840 | 2358 | 910 ns |
empty loop | 0.002529 s | 6556390 | 6 | 2.31 ns |
# signal delivery
TODO latency measurement between signal()
call and signal handler code (in
another process)
# pthread context switch
TODO latency measurement of pthread’s cooperate context switching with yield()
# linux process scheduling
TODO this is a bit complicated because linux scheduler is a blockbox: we have no idea what scheduler decides to switch to next.
# coroutine / fiber context switch
glibc swapcontext
invokes the rt_sigprocmask
syscall, therefore you have at
least 1 microsecond of latency.
There are also implicit latency: constantly swapping processor context would break locality (spoken: cache utility).
# setjmp / longjmp
#
boost::context
# further readings:
- Trace Linux System Calls with Least Impact on Performance in Production, Wenbo Zhang
- presentation on network scalability by Felix von Leitner. Idea, re-create the same benchmarks but on “more modern” machines. (Felix used 900MHz pentium 3 and 256 MB RAM, linux kernel 2.4/2.6)
- The C10K problem, Dan Kegel
- the thundering herd problem, Steve Molloy
- Jeff Darcy’s notes on high-performance server design (wayback machine)
# Appendix: testing environment
[+] click to expand full cpuinfo
$ lscpu:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 39 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Vendor ID: GenuineIntel
Model name: Intel(R) Core(TM) i5-6440HQ CPU @ 2.60GHz
CPU family: 6
Model: 94
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
Stepping: 3
CPU(s) scaling MHz: 26%
CPU max MHz: 3500.0000
CPU min MHz: 800.0000
BogoMIPS: 5202.65
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2
ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc
art arch_perfmon pebs bts rep_good nopl xtopology
nonstop_tsc cpuid ap erfmperf pni pclmulqdq dtes64
monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16
xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt
tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm
abm 3dnowprefetch cpuid_fault epb pti ssbd ibrs i bpb
stibp tpr_shadow flexpriority ept vpid ept_ad fsgsbase
tsc_adjust sgx bmi1 avx2 smep bmi2 erms invpcid mpx
rdsee d adx smap clflushopt intel_pt xsaveopt xsavec
xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify
hwp_act_window hwp_epp vnmi md_clear flush_l1d
arch_capabilities
Virtualization features:
Virtualization: VT-x
Caches (sum of all):
L1d: 128 KiB (4 instances)
L1i: 128 KiB (4 instances)
L2: 1 MiB (4 instances)
L3: 6 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-3
Vulnerabilities:
Gather data sampling: Vulnerable: No microcode
Itlb multihit: KVM: Mitigation: VMX disabled
L1tf: Mitigation; PTE Inversion; VMX conditional cache
flushes, SMT disabled
Mds: Mitigation; Clear CPU buffers; SMT disabled
Meltdown: Mitigation; PTI
Mmio stale data: Mitigation; Clear CPU buffers; SMT disabled
Reg file data sampling: Not affected
Retbleed: Mitigation; IBRS
Spec rstack overflow: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; usercopy/swapgs barriers and __user
pointer sanitization
Spectre v2: Mitigation; IBRS; IBPB conditional; STIBP disabled;
RSB filling; PBRSB-eIBRS Not affected; BHI Not
affected
Srbds: Mitigation; Microcode
Tsx async abort: Mitigation; TSX disabled
[+] click to leave a comment [+]
>> SEND COMMENT <<