bpf/ebpf

bpftrace

https://github.com/iovisor/bpftrace https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md http://www.brendangregg.com/BPF/bpftrace-cheat-sheet.html http://www.brendangregg.com/bpf-performance-tools-book.html

https://lwn.net/Articles/793749/ https://fedoramagazine.org/trace-code-in-fedora-with-bpftrace/

kprobe – kernel function start
kretprobe – kernel function return
uprobe – user-level function start
uretprobe – user-level function return
tracepoint – kernel static tracepoints
usdt – user-level static tracepoints
profile – timed sampling
interval – timed output
software – kernel software events
hardware – processor-level events
bpftrace -l
bpftrace -lv "t:syscalls:sys_enter_execve"

List probes for an executable

bpftrace -l "uprobe:/bin/bash"

Example to show the return values of readline

bpftrace -e 'uretprobe:/bin/bash:readline { printf("readline: \"%s\"\n", str(retval)); }'

A str() call is necessary to turn the char * pointer to a string.

bpftrace -e 't:syscalls:sys_enter_execve { printf("%s called %s\n", comm, str(args->filename)); }'

Some probes allow wildcards.

In this example, the action block attaches to all tracepoints whose name starts with t:syscalls:sys_enter_, which means all available syscalls.

The bpftrace builtin function count() counts the number of times this function is called. @[] represents a map (an associative array). The key of this map is probe, which is another bpftrace builtin that represents the full probe name.

bpftrace -e 't:syscalls:sys_enter_* { @[probe] = count(); }'
bpftrace -e 't:syscalls:sys_enter_* / pid == 1234 / { @[probe] = count(); }'

examples

bpftrace -e 't:syscalls:sys_exit_write /args->ret > 0/ { @[comm] = sum(args->ret); }'

it uses a filter to discard the negative values, which are error codes (/args->ret > 0/) comm represents the process name that called the syscall. sum() builtin function accumulates the number of bytes written for each map entry or process. the write syscall returns the number of written bytes. args->ret provides access to the bytes.

bpftrace -e 't:syscalls:sys_exit_read { @[comm] = hist(args->ret); }'

Histograms are BPF maps, so they must always be attributed to a map (@). In this example, the map key is comm. To generate just one global histogram, attribute the hist() function just to ‘@’ (without any key).

net/ipv4/tcp.c: int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size);

This is using a kprobe ("k") on tcp_sendmsg(), and saving a histogram of arg2 (size)

bpftrace -e 'k:tcp_sendmsg { @size = hist(arg2); } interval:s:10 { exit(); }'

Using a kretprobe ("kr"), and I'm frequency counting retval, which is either a negative error code or the size. It don't care about the size, so use a ternary operator to set all positive values to zero.

bpftrace -e 'kr:tcp_sendmsg { @retvals[retval > 0 ? 0 : retval] = count(); }
        interval:s:10 { exit(); }'

shipped scripts in bpftrace

    killsnoop.bt – Trace signals issued by the kill() syscall.
    tcpconnect.bt – Trace all TCP network connections.
    pidpersec.bt – Count new procesess (via fork) per second.
    opensnoop.bt – Trace open() syscalls.
    vfsstat.bt – Count some VFS calls, with per-second summaries.