Linux research

namespaces

A namespace (NS) "wraps" some global system resource to provide isolation. Linux now supports multiple NS types, see namespaces(7):

namespacedescflag
Mount NSisolate mount point listCLONE_NEWNS
UTS NSisolate system identifiers (hostname / NIS domain nameCLONE_NEWUTS
IPC NSisolate system V IPC & POSIX MQ objectCLONE_NEWIPC
PID NSisolate PID number spaceCLONE_NEWPID
Network NSisolate network resources (network device, stack, portsCLONE_NEWNET
User NSisolate user ID and group ID number spacesCLONE_NEWUSER
Cgroup NSvirtualize (isolate) certain cgroup pathnamesCLONE_NEWCGROUP
Time NSisolate boot and monotonic clocksCLONE_NEWTIME

For each NS:

  • Multiple instances of a NS may exist on the system
  • At system boot, there is only one instance of each NS type (the initial namespace)
  • A process resides in one NS instance (of each NS)
  • A process inside NS instance only sees that instance type

Example UTS namespace, isolate two identifiers returned by uname(2):

  • nodename, (hostname) sethostname(2)
  • domainname, NIS domain name setdomainname(2)

Each UTS NS instance has it's own nodename and domainname

Each process has symlink files in /proc/PID/ns for every namespace for example /proc/PID/ns/time, the content can be read with readlink and has the form: ns-type: [magic-incode-#].

Namespaces API

Syscalls for NS:

  • clone(2) - create new (child) process in a new NS(s)
  • unshare(2) - create new NS(s) and move caller into it/them
  • setns(2) - move claling process to another (existing) NS instance

There are shell commands as well (from util-linux):

  • unshare(1) - create new NS and execute command in the NS(s)
  • nsenter(1) - enter existing NS and execute a command

Creating a new user namespace requires no privileges but all other namespaces required CAP_SYS_ADMIN privileges. Example:

$ sudo unshare -u bash
# hostname foobar
# hostname
foobar

User namespaces

Allow per namespace mappings of UIDs and GIDs processes, process's UIDs and GIDs inside NS may be different from outside NS. Process might have uid 0 inside the NS and nonzero UID outside. User NSs have a hierarchical relationship, parent of a user NS === user Ns of process that created this user NS. Parential relationship determines some rules about how capabilities work. When a new user NS is created, the first process in the NS has all capabilities that process has power of superuser only inside the user NS.

After creating a user NS defining a UID & GID mapping is done by writing to two files, /proc/PID/{uid_map,gid_map}. Records written to the map form ID-inside-ns ID-outside-ns length, ID-inside-ns and length define the range of IDs inside the user NS that are to be mapped. ID-outside-ns defines start of corresponding mapped range in "outside" user NS.

Example:

$ id
uid=1000(jelle)
$ unshare -U -r bash
usrns$ cat /proc/$$/uid_map
0 1000 1
usrns$ cat /proc/$$/gid_map
0 1000 1

Source:

  • https://man7.org/conf/meetup/understanding-user-namespaces--Google-Munich-Kerrisk-2019-10-25.pdf
  • https://lwn.net/Articles/531114/

containers

https://www.redhat.com/sysadmin/podman-inside-container https://developers.redhat.com/blog/2019/01/15/podman-managing-containers-pods

capabilities

cgroups

Sources:

  • https://lwn.net/Articles/604609/
  • https://lwn.net/Articles/679786/

eBPF

BPF (Berkely packet filter) developed in 1992, improved the performance of packate capture tools. In 2013, a major rewrite of BPF was proposed and included in the Linux kernel in 2014. Which turned BPF into a general purpose execution engine that can be used for a variety of things. BPF allows the kernel to run mini programs on system and application events, such as disk I/O. BPF can be considered a virtual machine due to its virtual instruction set executed by the Linux kernel BPF runtime which includes a runtime & JIT compiler for turning BPF instructions into native instructions for execution. BPF instructions must pass a verifier that checks for safety, ensuring it does not crash the kernel. BPF has three main uses in Linux: networking, observability & security.

Tracing is event based recording, such as strace, tcpdump.

Sampling take a subset of measurements to paint a coarse picture of the target, also known as profiling or creating a profile. For example, sample every 10 milliseconds, this has less overhead, but can miss events.

Observability is understanding a system through observation. Tools for this include, tracing, sampling and tools based on fixed counters. Does not include bencmark tools, which modify the state of the system. BPF tools are observability tools.

BCC (BPF Compiler Collection) is the first higher-level tracing framework developed for BPF.

Bpftrace a newer front end and that provides a special-purpose, high level language for developing BPF tools. BPFtrace is for one liners, BCC for compile scripts.

Workload characterization defines what workload is being applied.

Dynamic instrumentation (kprobes & uprobes)

A BPF tracing source, which can insert instrumentation points into live software, zero overhead when not in use, as software is unmodified. Often used to instrument start and end of kernel / application functions. Downside of dynamic tracing is that functions can be renamed (interface stability issue).

Example:

probedescription
kprobe:vfs_readinstrument beginning of kernel vfs_read()
kretprobe:vfs_readinstrument returns of kernel vfs_read()
uprobe:/bin/bash:readlineinstrument beginning of readline in /bin/bash
uretprobe:/bin/bash:readlineinstrument returns of readline in /bin/bash

Static instrumentation (tracepoints and UDST)

Static instrumentation is added by developers and user-level statically defined tracing (UDST) for userspace programs.

Example:

tracepointdescription
tracepoint:syscall:sys_enter_openinstrument open(2) syscall
udst:/usr/bin/mysqld:mysqld:query_statquery_stat probe

Listing all tracepoints matching sys_enter_open:

bpftrace -l 'tracepoint:syscalls:sys_enter_open*'

Or snoop on exec with execsnoop:

sudo /usr/share/bcc/tools/execsnoop

BPF Technology background

BPF was originally developed to offload packet filtering to kernel space for tcmpdump. This provided performance and safety benefits. The classic BPF used was very limited and only supported 2 registers versus 10, 32 bit registers width versus 64; in eBPF more storage options, 512 bytes of stack space and infinite "map" storage lastly supports more event targets. BPF is useful for performance tools as it is build into Linux, efficient and safe. BPF is more flexible then kernel modules, BPF programs are checked via a verifier before running and it supports more rich data structures via maps. It is also easier to learn as it doesn't require kernel build artifacts. BPF programs can be compiled once and run everywhere.

BPF programs can be written with llvm, BCC and bpftrace. BPf instructions can be viewed via bpftool and manipulate BPF objects including programs and maps.

BPF API

A BPF program can not call arbitrary kernel functions or read arbitrary memory, to accomplish this "helper" functions are provided as bpf_probe_read. Memory access for BPF is restricted to it's registers and the stack, bpf_probe_read can read arbitrary memory but it does some safety checks up front, it can also read userspace memory.

BPF Program Types

Program type specify the type of events that the BPF program attaches to in case of observability tools. The verifier uses the program type to restrict which kernel functions can be called and data structures to access.

BPF lacked concurrency until Linux 5.1, but tracing programs can't use it yet so a per CPU hash/map is used to keep track of event data and doesn't run into map overwrites or corruptions.

The BPF Type Format (BTF) is a metadata format that encodes debug information describing BPF programs, maps, etc. BTF is becoming a general purpose format for describing kernel data formats. Tracing tools require kernel headers installed to read / understand C structs otherwise they have to be defined in a BPF program.

BPF CO-RE (Compile Once, Run Everywhere)

Allow BPF programs to be compiled to BPF bytecode once and then packaged for other systems.

BPF sysfs interface

Linux 4.4 allows BPF programs and maps to be exposed over sysfs and allows the creation of persistent BPF programs to continue after the program that loaded them has exited. This is also called "pinning".

BPF limitations

  • Cannot call arbitrary kernel functions
  • No infinite loops allowed
  • Stack size limited to MAX_BPF_STACK (512)

Stack trace walking

Stack traces are used to understand the code paths that led to an event. BPF can record stack traces; framepointer based or ORC based stack walks.

Frame pointer based

The head of the linked list of stack frames can always be found in a register (RBP on x86_64) where the return is stored of a known offset (+8) from the RBP. The debugger just walks over the linked list from the RBP. GCC nowadaysdefaults to omitting the stack frame pointer and uses RBP as a general purpose register.

Debuginfo

Usually available via debug packages which contain debug files in DWARF format. Debug files are big and BPF does not support them.

LBR (Last Branch Record)

An Intel processor feature to record branches in a hardware buffer including function call branches. This has no overhead and limited in depth per processor from 4-32 branches which may not be enough.

ORC (Ooops Rewind Capability)

New debug format for stack frames, uses ELF sections (.orc_unwind, .orc_unwind_ip) and has been implemented in the Linux kernel.

Flamegraphs

Visualize stack traces, a stack backtrace or call trace. For example:

func_c
func_b
func_a

Where a calls b, calls c. All different call trees are recorded by how often a code path is taken for example:

func_e            
func_d            func_c
func_b   func_b   func_b
func_a   func_a   func_a
1        2        7


                       +---------+
		       # func_e  #
                       +---------+
  +------------------+ +---------+
  # func_c          #  # func_d  #
  +------------------+ +---------+
+--------------------------------+
# func_b                         #
+--------------------------------+
+--------------------------------+
# func_a                         #
+--------------------------------+

func_c uses 70% cpu time, func_e 10%.

Event sources

kprobes

Provide dynamic kernel instrumentation, can instrument any kernel function. When kretprobes are also used, function duration is also recorded. kprobes work by saving the target address and replacing it with a breakpoint instruction (int3 on x86_64) when instruction flow hits this breakpoint, the breakpoint handlers calls the kprobe handler afterwards the original instruction is executed. when kprobes are no longer needed the breakpoint is replaced by the original address. If ftrace already instruments the handler, ftrace simply calls the kprobe handler. When no longer used the ftrace kprobe handler is removed. For kretprobes, a kprobe entry is added to the function when called, the return address is saved and replaced with a "trampoline" function kretprobe_trampoline. When the function returns, CPU passes control to the trampoline function which calls the kretprobe hander. When no longer needed kprobe is removed.

This modifies kernel instruction text live, which means some functions are not allowed to be instrumented due to possible recursion. This does not work on ARM64 as kernel text is read only.

BPF can use kprobes via:

  • BCC - attack_kprobe & attach_kretprobee
  • bpftrace - krpboe & kretprobe
uprobes

User level dynamic instrumentation, same as kprobes but are file based, when a function in an executable is traced, all processes using that file now and in the future are traced.

BPF can use uprobes via:

  • BCC - attach_uprobe & attach_uretprobe
  • bpftrace - uprobe & uretprobe

tracepoints

Static kernel instrumentation, added by kernel developers as subsystem:eventname. Tracepoints work by at compile time adding an noop instruction (5 byte) on x86_64, can later be replaced with a jmp. A tracepoint handler trampoline is added to the end of the function which iterates over an array of registered tracepoint callbacks.

On enabling tracepoint, replace nop with jmp to tracepoint trampoline. Add an entry to the tracepoints callback array and sync RCU (read,copy,update). Removed drops array entry and if last rpelace the jmp with nop.

  • BCC: TRACEPOINT_PROBE
  • bpftrace: tracepoint probe type

BPF raw tracepoints (BPF_RAW_TRACEPOINT) creates a stable tracepoint without creating arguments so consumers have to handle raw arguments. It's a lot faster and allows consumers acccess to all arguments. The downside is that arguments might change.

UDST (User-level statically defined tracing)

Can be added by software via systemdtap-sclt-dev or facebook's folly which defines macros for instrumentation points.

PMC

Performance monitoring counters, programmable hardware counters on the processor. PMC modes:

  • counting - keep trac of rate of events (kernel reads).
  • overflow sampling - PMC sends interrupts to the kerne lfor the event they are monitoring.

Performance analysis

  • latency - how long to accomplish a request or operation (in ms)
  • rate -an operation or request rate per second
  • throughput - typically data movmenet in bits or bytes / sec
  • utilization - how busy a resource is over time as percentage
  • cost - the price / performance ratio

Workload characterization, understand the applied workload:

  • Who is causing the load? (PID, process, name, UID, IP Address)
  • Why is the load called? (code path, stack trace, flame graph)
  • What is the load? (IOPS, throughput, type)
  • How is the load changing over time? (pre-interval summary)

Drill down analysis

Examing a metric, finding ways to decompose into components, and so forth.

  1. Start examing the highest level
  2. Examine next level details
  3. Pick the most interesting breakdown or clue
  4. If problem is unsolved, go back to step 2

USE metrics, Utilization, resource, errors.

60 second analysis:

  • uptime - quick overview of load avg, three numbers are exponentially moving sum averages 1, 5, 15 minute constant
  • dmesg | tail - shows OOM, TCP dropping request issues
  • vmstat - virtual memory stats.
    • R - processes running on CPU waiting for a turn (does not include disk I/O): R > cpu count => saturation.
    • free - free memory in KBytes
    • si/so - swap in & out =. non-zero out of memory.
    • us, sy, id, wa and st: cputime on avg. across all cpu's, user, system time (kernel), idle, wait I/O and stolen time.
  • mpstat -P ALL 1 - per cpu time broken down in stats. CPU0 => 100% user time => single threaded bottleneck.
  • pidstat 1 - cpu usage per process rolling output.
  • iostat -xz 1 - Storage device I/O metrics.
    • r/s, w/s - delivered reads, writes to the device
    • await - time spend waiting on I/O compeltion in ms
    • aqu_sz - average number of requests issued to the device. > 1 can indicate saturation.
    • %util - device utlization (busy %) > 60% usually means poor performance
  • free -m - available memory not zero
  • sar -n DEV 1 network device metrics
  • sar -n TCP,ECTP 1 TCP metrics & errors:
    • active/s - number of locally initiated TCP connections / sec
    • passive/s - numbe of remotly initiated TCP connections / sec
    • rtrans/s - number of retransmits / sec

BCC Tool checklist

  • execsnoop - shows new process execution by printing one line of output for every execve(2)
    • look for short lived processes often not seen by normal tools
  • opensoop - prints one line of output for each open(2)
    • ERR colomn shows files failed to open
  • ext4slower - traces common operations from ext4 fs (reads,writes, opens, syncs) and prints those that exceed the limit (10ms)
  • biolatency - traces disk I/O latency (time from device => completion) shown as histogram
  • biosnoop - prints a line of output for each disk I/O with details including latency
  • cachestat - prints a one line summary every second showing stats fro mthe FS cache
  • tcpconnect - prints one line of output for every active TCP connection (connect)
  • tcpaccept - prints one line of output for every passive TCP connection (accept)
  • tcpretrans - prints one line of output for every TCP retransmit package
  • runqlat - times how long threads were waiting for their turn on CPU. Longer than expected waits for CPU access can be identified.
  • profile - CPU profiler, a tool to understand which code paths are consuming CPU resources. It takes samples of stac ktraces at timed intervals and prints an summary of unqiue stack traces + count.

Debugger

https://github.com/dylandreimerink/edb

Sources:

  • https://lwn.net/Articles/740157/
  • https://docs.cilium.io/en/v1.8/bpf/
  • https://www.kernel.org/doc/html/latest/bpf/index.html
  • BPF Performance Tools

io_uring

https://lwn.net/Articles/776703/ https://lwn.net/Articles/847951/ https://lwn.net/Articles/803070/ https://lwn.net/Articles/815491/ https://lwn.net/Articles/858023/ https://lwn.net/Articles/810414/