bpf/ebpf
bpftrace
- docs
https://github.com/iovisor/bpftrace https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md http://www.brendangregg.com/BPF/bpftrace-cheat-sheet.html http://www.brendangregg.com/bpf-performance-tools-book.html
- articles
https://lwn.net/Articles/793749/ https://fedoramagazine.org/trace-code-in-fedora-with-bpftrace/
- Probe points
kprobe – kernel function start kretprobe – kernel function return uprobe – user-level function start uretprobe – user-level function return tracepoint – kernel static tracepoints usdt – user-level static tracepoints profile – timed sampling interval – timed output software – kernel software events hardware – processor-level events
- List kprobe/kretprobe, tracepoints, software and hardware probes
bpftrace -l
- List available fields in a tracepoint
bpftrace -lv "t:syscalls:sys_enter_execve"
- The uprobe/uretprobe and usdt probes are userspace probes specific to a given executable.
List probes for an executable
bpftrace -l "uprobe:/bin/bash"
Example to show the return values of readline
bpftrace -e 'uretprobe:/bin/bash:readline { printf("readline: \"%s\"\n", str(retval)); }'
A str() call is necessary to turn the char * pointer to a string.
-
The profile and interval probes fire at fixed time intervals.
-
Example: show the parent process (comm) and processes executed by it
bpftrace -e 't:syscalls:sys_enter_execve { printf("%s called %s\n", comm, str(args->filename)); }'
- Count system calls using maps
Some probes allow wildcards.
In this example, the action block attaches to all tracepoints whose name starts with t:syscalls:sys_enter_, which means all available syscalls.
The bpftrace builtin function count() counts the number of times this function is called. @[] represents a map (an associative array). The key of this map is probe, which is another bpftrace builtin that represents the full probe name.
bpftrace -e 't:syscalls:sys_enter_* { @[probe] = count(); }' bpftrace -e 't:syscalls:sys_enter_* / pid == 1234 / { @[probe] = count(); }'
examples
- Write bytes by process
bpftrace -e 't:syscalls:sys_exit_write /args->ret > 0/ { @[comm] = sum(args->ret); }'
it uses a filter to discard the negative values, which are error codes (/args->ret > 0/) comm represents the process name that called the syscall. sum() builtin function accumulates the number of bytes written for each map entry or process. the write syscall returns the number of written bytes. args->ret provides access to the bytes.
- Read size distribution by process (histogram)
bpftrace -e 't:syscalls:sys_exit_read { @[comm] = hist(args->ret); }'
Histograms are BPF maps, so they must always be attributed to a map (@). In this example, the map key is comm. To generate just one global histogram, attribute the hist() function just to ‘@’ (without any key).
- watch tcp_sendmsg size for ten seconds, and watch tcp_sendmsg errors
net/ipv4/tcp.c
: int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size);
This is using a kprobe ("k") on tcp_sendmsg(), and saving a histogram of arg2 (size)
bpftrace -e 'k:tcp_sendmsg { @size = hist(arg2); } interval:s:10 { exit(); }'
Using a kretprobe ("kr"), and I'm frequency counting retval, which is either a negative error code or the size. It don't care about the size, so use a ternary operator to set all positive values to zero.
bpftrace -e 'kr:tcp_sendmsg { @retvals[retval > 0 ? 0 : retval] = count(); } interval:s:10 { exit(); }'
shipped scripts in bpftrace
killsnoop.bt – Trace signals issued by the kill() syscall. tcpconnect.bt – Trace all TCP network connections. pidpersec.bt – Count new procesess (via fork) per second. opensnoop.bt – Trace open() syscalls. vfsstat.bt – Count some VFS calls, with per-second summaries.