how fast are things in Linux (WIP)

let’s benchmark the speed / latency of various things under linux.

Disclaimer: this is only some fun experiments to give an rough idea of magnitude of things. I don’t claim to provide any accurate or representative numbers.

# testing environment

processor:  Intel(R) Core(TM) i5-6440HQ CPU @ 2.60GHz
kernel:     Linux 6.6.63-1-lts #1 SMP PREEMPT_DYNAMIC x86_64 GNU/Linux
system:     archlinux
glibc:      2.40

mitigations:
I have a rather old CPU from 2016, which has many HW vulnerability like spectre and meltdown. The mitigation slows down the system (especially syscalls) quite significantly (see appendix for the mitigations). Keep this in mind when reading the numbers.

clock source:
For measurement I’m using CPU’s invariant TSC. My CPU tsc ticks @ 2592 MHz, this is like cpu cycles but not nesessarily the same counting.

# what’s before main()

This is actually pretty complicated: when running a program, the parent process calls exec, and the newly forked process doesn’t directly start from main(), the libc needs to prepare various things for it. But for a quick measurement, let’s see how long does it take to exec()

In the below example we are doing a fork() in the main process, in the forked process we when do a execve. The tsc difference is roughly the time spent on calling fork(), execve(), then run until the main() function of spawned program. In dozens of runs, this number range from 1574520 to 5118116, which translates into 607 to 1974 microseconds (µs). Note that in this case you must have the constant_tsc cpu feature, because the parent and child process won’t necessarily spawn on the same core.

// main.c
int main() {
    long long tsc_0;
    pid_t my, pid;
    my = getpid();
    tsc_0 = RDTSC();
    pid = fork();
    if (pid == -1){
        printf("fork error\n");
        exit(-1);
    } else if (pid > 0) {
        printf("host counter before fork: %lld\n", tsc_0);
        int status;
        waitpid(pid, &status, 0);
    } else {
        execve("test", NULL, NULL);
    }
    return 0;
}


// test.c
int main() {
    long long tsc_1;
    tsc_1 = RDTSC();
    printf("child_process in main: %lld\n", tsc_1);
    return 0;
}

The difference between tsc 0 and tsc 1 is roughly the exec overhead.

# print hello world

The speed of printing depends on many things but what about the trivial

printf("hello world\n");

result 100 runs:

loop	total time	total ticks	avg. ticks per iter	avg. each iter
printf()	0.000265 s	687078	6870	2690 ns

btw, compiler optimization doesn’t change anything here.

# syscall

For a rough idea, we could have 1k ~ 100k syscalls per second. on a normal 4 cores system.

$ sudo perf stat -e raw_syscalls:sys_enter -a -I 1000 sleep 10

on my (pertty busy) 4C/8G production server:

#           time             counts      unit events
     1.000774756             26,478      raw_syscalls:sys_enter                                                
     2.002340928             29,731      raw_syscalls:sys_enter                                                
     3.006551494             28,774      raw_syscalls:sys_enter                                                
     4.007948189             73,816      raw_syscalls:sys_enter                                                
     5.009343398             31,859      raw_syscalls:sys_enter                                                
     6.010853823             52,696      raw_syscalls:sys_enter                                                
     7.012298835             55,095      raw_syscalls:sys_enter                                                
     8.013842981             21,614      raw_syscalls:sys_enter                                                
     9.015450515             26,814      raw_syscalls:sys_enter                                                
    10.006519753             52,751      raw_syscalls:sys_enter

on my laptop (I’m randomly clicking in the browser):

#           time             counts unit events
     1.001079680             18,676      raw_syscalls:sys_enter                                                
     2.007895894             28,518      raw_syscalls:sys_enter                                                
     3.014606980             19,019      raw_syscalls:sys_enter                                                
     4.021105434             37,456      raw_syscalls:sys_enter                                                
     5.026087966             53,223      raw_syscalls:sys_enter                                                
     6.054707533             43,798      raw_syscalls:sys_enter                                                
     7.056551079             20,397      raw_syscalls:sys_enter                                                
     8.058208397             19,185      raw_syscalls:sys_enter                                                
     9.060285037             47,732      raw_syscalls:sys_enter                                                
    10.005679016             21,985      raw_syscalls:sys_enter

syscall latency: runtime of getpid() and the same loop without syscall as reference:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


int main() {   
    // SNIP
    volatile pid_t pid;
    for (int i = 0; i < RUNS; i++) {
        pid = syscall(SYS_getpid);
    }

    // SNIP
    volatile long long test = 0;
    for (int i = 0; i < RUNS; i++) {
        test += 1;
    }
    return 0;
}

results (each loop for 1000000 times)

loop	total time	total ticks	avg. ticks per iter	avg. each iter
getpid()	0.909797 s	2358194840	2358	910 ns
empty loop	0.002529 s	6556390	6	2.31 ns

# signal delivery

TODO latency measurement between signal() call and signal handler code (in another process)

# pthread context switch

TODO latency measurement of pthread’s cooperate context switching with yield()

# linux process scheduling

TODO this is a bit complicated because linux scheduler is a blockbox: we have no idea what scheduler decides to switch to next.

# coroutine / fiber context switch

glibc swapcontext invokes the rt_sigprocmask syscall, therefore you have at least 1 microsecond of latency.

There are also implicit latency: constantly swapping processor context would break locality (spoken: cache utility).

# setjmp / longjmp

# `boost::context`

# further readings:

Trace Linux System Calls with Least Impact on Performance in Production, Wenbo Zhang
presentation on network scalability by Felix von Leitner. Idea, re-create the same benchmarks but on “more modern” machines. (Felix used 900MHz pentium 3 and 256 MB RAM, linux kernel 2.4/2.6)
The C10K problem, Dan Kegel
the thundering herd problem, Steve Molloy
Jeff Darcy’s notes on high-performance server design (wayback machine)

# Appendix: testing environment

[+] click to expand full cpuinfo

$ lscpu:
Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          39 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   4
  On-line CPU(s) list:    0-3
Vendor ID:                GenuineIntel
  Model name:             Intel(R) Core(TM) i5-6440HQ CPU @ 2.60GHz
    CPU family:           6
    Model:                94
    Thread(s) per core:   1
    Core(s) per socket:   4
    Socket(s):            1
    Stepping:             3
    CPU(s) scaling MHz:   26%
    CPU max MHz:          3500.0000
    CPU min MHz:          800.0000
    BogoMIPS:             5202.65
    Flags:                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
                          mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2
                          ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc
                          art arch_perfmon pebs bts rep_good nopl xtopology
                          nonstop_tsc cpuid ap erfmperf pni pclmulqdq dtes64
                          monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16
                          xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt
                          tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm
                          abm 3dnowprefetch cpuid_fault epb pti ssbd ibrs i bpb
                          stibp tpr_shadow flexpriority ept vpid ept_ad fsgsbase
                          tsc_adjust sgx bmi1 avx2 smep bmi2 erms invpcid mpx
                          rdsee d adx smap clflushopt intel_pt xsaveopt xsavec
                          xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify
                          hwp_act_window hwp_epp vnmi md_clear flush_l1d
                          arch_capabilities
Virtualization features:  
  Virtualization:         VT-x
Caches (sum of all):      
  L1d:                    128 KiB (4 instances)
  L1i:                    128 KiB (4 instances)
  L2:                     1 MiB (4 instances)
  L3:                     6 MiB (1 instance)
NUMA:                     
  NUMA node(s):           1
  NUMA node0 CPU(s):      0-3
Vulnerabilities:          
  Gather data sampling:   Vulnerable: No microcode
  Itlb multihit:          KVM: Mitigation: VMX disabled
  L1tf:                   Mitigation; PTE Inversion; VMX conditional cache
                          flushes, SMT disabled
  Mds:                    Mitigation; Clear CPU buffers; SMT disabled
  Meltdown:               Mitigation; PTI
  Mmio stale data:        Mitigation; Clear CPU buffers; SMT disabled
  Reg file data sampling: Not affected
  Retbleed:               Mitigation; IBRS
  Spec rstack overflow:   Not affected
  Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:             Mitigation; usercopy/swapgs barriers and __user
                          pointer sanitization
  Spectre v2:             Mitigation; IBRS; IBPB conditional; STIBP disabled;
                          RSB filling; PBRSB-eIBRS Not affected; BHI Not
                          affected
  Srbds:                  Mitigation; Microcode
  Tsx async abort:        Mitigation; TSX disabled

[+] click to leave a comment [+]

the comment system on this blog works via email. The button
below will generate a mailto: link based on this page's url 
and invoke your email client - please edit the comment there!

[optional] even better, encrypt the email with my public key

- don't modify the subject field
- specify a nickname, otherwise your comment will be shown as   
  anonymous
- your email address will not be disclosed
- you agree that the comment is to be made public.
- to take down a comment, send the request via email.

>> SEND COMMENT <<