The BSD or Berkely Packet Filter is a register-based filter evaluator and network tap invented 1990 by Steven McCanne and Van Jacobson to replace the CMU/Stanford Packet Filter (CSPF) and Sun NIT filter technology with a faster alternative1).
While bpf consists of two components, the filter evaluator and the network tap, we'll ignore the network tap and focus on the filter evaluator instead.
The filter evaluator - the bpf pseudo machine - consists of
the addressing mode:
| code | description |
| #k | the literal value stored in k |
| #len | the length of the packet |
| M[k] | the word at offset k in the scratch memory store |
| [k] | the byte, halfword, or word at byte offset k in the packet |
| [x+k] | the byte, halfword, or word at offset x+k in the packet |
| L | an offset from the current instruction to L |
| #k, Lt, Lf | the offset to Lt if the predicate is true, otherwise the offset to Lf |
| x | the index register |
| 4*([k]&0xf) | four times the value of the low four bits of the byte at offset k in the packet |
source: 2)
And the instructions:
| opcodes | addr modes |
| LOAD INSTRUCTIONS |
| ldb | [k] | [x+k] |
| ldh | [k] | [x+k] |
| ld | #k | #len | M[k] | [k] | [x+k] |
| ldx | #k | #len | M[k] | 4*([k]&0xf) |
| STORE INSTRUCTIONS |
| st | M[k] |
| stx | M[k] |
| ALU INSTRUCTIONS |
| add | #k | x |
| sub | #k | x |
| mul | #k | x |
| div | #k | x |
| and | #k | x |
| or | #k | x |
| lsh | #k | x |
| rsh | #k | x |
| BRANCH INSTRUCTIONS |
| jmp | L |
| jeq | #k, Lt, Lf |
| jgt | #k, Lt, Lf |
| jge | #k, Lt, Lf |
| jset | #k, Lt, Lf |
| RETURN INSTRUCTIONS |
| ret | #k | a |
| MISC INSTRUCTIONS |
| tax | |
| txa | |
source: 3)
and instructions.
These instructions are encoded in a fixed-length format:
| 32bit |
| opcode:16 | jt:8 | jf:8 |
| k:32 |
The fields are:
In c, this is struct bpf_insn
struct bpf_insn {
u_short code;
u_char jt;
u_char jf;
u_long k;
};
4)
A bpf filter program is a bunch of instructions therefore, e.g.:
struct bpf_insn insns[] = {
BPF_STMT(BPF_LD+BPF_H+BPF_ABS, 12),
BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, ETHERTYPE_REVARP, 0, 3),
BPF_STMT(BPF_LD+BPF_H+BPF_ABS, 20),
BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, REVARP_REQUEST, 0, 1),
BPF_STMT(BPF_RET+BPF_K, sizeof(struct ether_arp) +
sizeof(struct ether_header)),
BPF_STMT(BPF_RET+BPF_K, 0),
};
Taken from the BPF manual: Taken from the Reverse ARP Daemon. It accepts only Reverse ARP requests.5)
To simplify interfacing bpf, there is a language which is compiled to bpf: the pcap-filter syntax 6).
This language is parsed via a lex&yacc defined grammar 7) 8) and converted into an internal block formatting 9).
This internal represenation is optional optimized 10) and converted to the bpf filter afterwards.
The optimization process is rather complicated11), and is the prone to problems therefore.
As a recent example, the following filter12) would hit an endless loop in the optimizer.
((ether[0:4] & 0xFFFFFF00) = 0x005056 or (ether[6:4] & 0xFFFFFF00) = 0x005056) and
((ether[0:4] & 0xFFFFFF00) = 0x001B0D or (ether[6:4] & 0xFFFFFF00) = 0x001B0D)
Manual optimization, one would expand the and
((ether[0:4] & 0xFFFFFF00) = 0x005056 and (ether[0:4] & 0xFFFFFF00) = 0x001B0D) or
((ether[0:4] & 0xFFFFFF00) = 0x005056 and (ether[6:4] & 0xFFFFFF00) = 0x001B0D) or
((ether[6:4] & 0xFFFFFF00) = 0x005056 and (ether[0:4] & 0xFFFFFF00) = 0x001B0D) or
((ether[6:4] & 0xFFFFFF00) = 0x005056 and (ether[6:4] & 0xFFFFFF00) = 0x001B0D)
and eliminate invalid combinations afterwards
((ether[0:4] & 0xFFFFFF00) = 0x005056 and (ether[6:4] & 0xFFFFFF00) = 0x001B0D) or
((ether[6:4] & 0xFFFFFF00) = 0x005056 and (ether[0:4] & 0xFFFFFF00) = 0x001B0D)
but, the optimizer in pcap is much more advanced.
As an example for optimizations, let's look on the filter '1+1=2' with optimaziation turned off:
sudo tcpdump -i lo -dO "1+1=2"
(000) ld #0x1
(001) st M[0]
(002) ld #0x1
(003) st M[1]
(004) ldx M[1]
(005) ld M[0]
(006) add x
(007) st M[1]
(008) ld #0x2
(009) st M[2]
(010) ldx M[2]
(011) ld M[1]
(012) sub x
(013) jeq #0x0 jt 14 jf 15
(014) ret #65535
(015) ret #0
and turned on:
sudo tcpdump -i lo -d "1+1=2"
(000) ret #65535
This optimized filter evaluates to true always.
Now, lets look on a different filter, which will evaluate to false everytime it is run '(1+1)*4/8*2=1'
sudo tcpdump -i lo -d "(1+1)*4/8*2=1"
tcpdump: expression rejects all packets
Besides resolving static calculations, the optimaziation process involves variable dependent re-ordering of blocks and other things.
Once the bpf_program is compiled, it can be attached to a socket by passing it to the kernel.
The kernel will check the filter for sanity, to make sure the filter will halt at a given point, so there are no backward branches and the last instruction in the filter is a BPF_RET.
The kernel implements the previously mentioned bpf pseudo machine, applies the filter to packets for the socket and has only the packets passing which match the filter.
The linux kernel is somewhat special, as it translates the bpf filter into a linux socket filter during the sanity check, and supports some non-standard operations, like matching packets per cpu.
To my knowledge there is no independent pcap-filter parser/compiler/optimizer besides the ones shipped with libpcap, so I started playing with antlr to see where I'll get with it.
I choose to convert the grammar.y file with some antlr2.7 code into an antlr grammar, and convert this antlr grammar into an antlr3 grammar later on.
It was the worst possible thing I could have done, and it would have been much faster to write it by hand.
After transforming the LR grammar into an LL manually, parsing things successfully, having some fun with the AST, I noticed antlr does not support python in the current version, last version supported is antlr 3.1 where 3.4 is current, and therefore decided to abandon it.
Given the linux kernel got a jit compiler for bpf filters recently13), I was courious and wanted to benchmark it.
So I grabbed the code from linux kernel, and build a userspace library of it.
To compare against, I decided to create a list of bpf engines:
the bpf-llvm
14) compiler from libtrace
15)
the FreeBSD
16) kernel jit compiler
17)
the linux kernel bpf vm
18)
linux kernel bpf jit for x86_64
19)
-
and compile a library of each of them, so I could interface it easily.
Once I had the engines running, I decided to run the benchmarking from within python using ctypes.
Having basics working, I created a testset of data using scapy, plan was 2GB of random IP|IPv6/TCP|UDP/Payload? data, but as creating 800MByte of it took more than 6 hours, I decided to stick with the 800MByte instead.
Well, I've had no experience with llvm, and the experience I've had by now could not have been much worse.
For the code I gathered from libtrace, the idea is to compile bpf-opcodes.c to bpf-opcodes.llvm.bc using llvm-gcc, and this new created file to bpf-opcodes.llvm.cc via llc, so the .cc file can be compiled into the library.
Starting with llvm-gcc to create bpf-opcodes.llvm.bc from bpf-opcodes.c:
Some time ago llvm-ggc was something standalone, now llvm-gcc is a bash script running gcc with the dragonegg plugin for llvm code generation.
Ubuntu ships a dragonegg plugin compiled with a different gcc version or something, I got some messages complaining about it.
llvm-gcc -std=c99 -c -O0 -emit-llvm -o bpf-opcodes.llvm.bc bpf-opcodes.c
Incompatible plugin version
cc1: error: Fail to initialize plugin /usr/lib/x86_64-linux-gnu/gcc/x86_64-linux-gnu/4.5/plugin/dragonegg.so
Setting dragonegg_disable_version_check=1 fixed this problem.
Next, llc to create bpf-opcodes.llvm.cc from bpf-opcodes.llvm.bc:
llc would get the job done …
llc -march=cpp -cppgen=module bpf-opcodes.llvm.bc -o bpf-opcodes.llvm.cc
llc: bpf-opcodes.llvm.bc:1:1: error: expected top-level entity
ELF...
So, the llvm-gcc -emit-llvm did not work, you'll get ELF instead, passing -S -flto did the job, thats a specialty for the dragonegg version of llvm-gcc.
dragonegg_disable_version_check=1 llvm-gcc -std=c99 -c -O0 -S -flto -o bpf-opcodes.llvm.bc bpf-opcodes.c
Getting the linux kernel jit compiler into userspace and working was not as painful as llvm. I had to come up with some definitions for sk_buff and others, but the jit code was prepared for changing offsets in sk_buff already.
I missed the kernels internal translation of bpf instructions to it's own code, and it took me some time to come over it, additionally the jit compiler code currently has a very subtile bug, and it took me some time to understand the problem , debug and fix it.
The FreeBSD jit code was trivial to port to userspace, they even have the proper guards for __KERNEL in place.
Really charming.
This one was pretty easy, as I knew about the translation thing already, there were little surprisese.
This one was userspace already, did not require and llvm tricks, and I did not have to write any code for it.
To create the data, I had to write a pcap file, so first I looked up the headers required and defined the structures accordingly using python ctypes.
import ctypes
class pcap_file_header(ctypes.Structure):
_fields_ = [("magic",ctypes.c_uint32),
("version_major",ctypes.c_uint16),
("version_minor",ctypes.c_uint16),
("thiszone",ctypes.c_int32),
("sigfigs",ctypes.c_uint32),
("snaplen",ctypes.c_int32),
("linktype",ctypes.c_int32)]
def to_buffer(self):
return buffer(self)[:]
class pcap_timeval(ctypes.Structure):
_fields_ = [("tv_sec",ctypes.c_uint32),
("tv_usec",ctypes.c_uint32)]
def to_buffer(self):
return bytes(self)[:]
class pcap_pkthdr(ctypes.Structure):
_fields_ = [("ts",pcap_timeval),
("caplen",ctypes.c_uint32),
("len",ctypes.c_uint32)]
def to_buffer(self):
return buffer(self)[:]
Next, a class which can read and write pcap files, so we can use to create the data and run the benchmark on the data later on:
from scapy.all import Ether,IP,IPv6,TCP,UDP,fuzz,RandIP,RandIP6,RandMAC,RandString,Raw
class pcap_file(object):
def __init__(self):
self.file = None
def create(self,path, snaplen=40,linktype=113):
self.file = open(path,'wb+')
self.snaplen = snaplen
hdr = pcap_file_header(magic=0xa1b2c3d4, version_major=2, version_minor=4, snaplen=snaplen, linktype=linktype)
self.file.write(hdr.to_buffer())
def open(self,path):
self.file = open(path,"rb")
hdr = pcap_file_header.from_buffer_copy(self.file.read(ctypes.sizeof(pcap_file_header)))
self.snaplen = hdr.snaplen
def close(self):
self.file.close()
self.file = None
def write(self,data,xferlen=None,ts=None):
if xferlen == None:
xferlen = len(data)
if ts == None:
ts = time.time()
ts = pcap_timeval(tv_sec=int(ts),tv_usec=int(ts-int(ts)))
hdr = pcap_pkthdr(ts=ts,caplen=min(self.snaplen,len(data)),len=len(data))
self.file.write(hdr.to_buffer())
if len(data) > self.snaplen:
self.file.write(data[:self.snaplen])
else:
self.file.write(data)
def read(self):
hdr = pcap_pkthdr.from_buffer_copy(self.file.read(ctypes.sizeof(pcap_pkthdr)))
pkt = self.file.read(hdr.caplen)
return pkt
def random(self):
import random
l3 = [(IP,{'src':RandIP(),'dst':RandIP()}),
(IPv6,{'src':RandIP6(),'dst':RandIP6()})]
l4 = [(UDP,{}),
(TCP,{})]
_l3 = l3[random.randint(0,len(l3)-1)]
_l4 = l4[random.randint(0,len(l4)-1)]
pkt = fuzz(Ether(src=RandMAC(),dst=RandMAC())/_l3[0](**_l3[1])/_l4[0](**_l4[1]))
if random.randint(0,100)>20:
pkt = pkt / Raw(load=RandString(size=random.randint(0,64)))
data = pkt.build()
self.write(data)
The ctypes glue to interface the previously created bpf libraries for the different engines is not that exciting, but some things are noteworthy.
For one, ctypes are pretty painful when dealing with variable length arrays in structs.
struct sk_filter
{
// atomic_t refcnt;
unsigned int len; /* Number of filter blocks */
unsigned int (*bpf_func)(const struct sk_buff *skb,
const struct sock_filter *filter);
// struct rcu_head rcu;
struct sock_filter insns[0];
};
Even worse, if you have a function which takes this struct as argument:
struct bpf_jit_t *bpf_jit_compile(struct sk_filter *fp)
Following the manual …
15.17.1.20. Variable-sized data types
Another way to use variable-sized data types with ctypes is to use the dynamic
nature of Python, and (re-)define the data type after the required size is already known,
on a case by case basis.
_sk_filter returns a class with the required number for the array, so the function arguments can be set properly our argument will be accepted as well.
class sock_filter(bpf_insn):
pass
class sk_filter(ctypes.Structure):
_fields_ = [("len", ctypes.c_uint),("bpf_func",ctypes.CFUNCTYPE(ctypes.c_int, ctypes.POINTER(sk_buff),ctypes.POINTER(bpf_jit_t))),("insns",sock_filter * 16)]
def _sk_filter(size):
class sk_filter(ctypes.Structure):
_fields_ = [("len", ctypes.c_uint),("bpf_func",ctypes.CFUNCTYPE(ctypes.c_int, ctypes.POINTER(sk_buff),ctypes.POINTER(bpf_jit_t))),("insns",sock_filter * size)]
return sk_filter
class bpfjitlinux(bpflinux):
jit = ctypes.cdll.LoadLibrary("/opt/bpf/lib/libbpfjitlinux.so")
jit.bpf_jit_compile.restype = ctypes.POINTER(bpf_jit_t)
jit.bpf_jit_compile.argtypes = [ctypes.POINTER(sk_filter)] # will be overwritten for the dynamic array in sk_filter at compile()
def __init__(self, snaplen=40,linktype=12,optimize=0,mask=0):
bpflinux.__init__(self,snaplen,linktype,optimize,mask)
def compile(self,pattern):
bpflinux.compile(self,pattern)
at = _sk_filter(self.program.bf_len)
self.jit.bpf_jit_compile.argtypes = [ctypes.POINTER(at)]
As I was working with bpf_insn already I decided to throw in some lines to graph things properly, using the bpf_program.bf_insns as source.
Given the data is random, I thought about filters.
Initial idea was providing some snippets, and creating all combinations without repetitons, or'ing them, done.
a = ["net 10.0.0.0/8","net 2001:470::/28","port 22","tcp","udp","portrange 0-1024"]
for i in range(len(a)):
for j in itertools.combinations(a,i):
print(' or '.join([" ({}) ".format(n) for n in j]))
Which would result in 63 filter patterns 21)
The 'longest' pattern would look like:
(net 2001:470::/28) or (port 22) or (tcp) or (udp) or (portrange 0-1024)
looking on this pattern, the pattern will match all data via ”(udp) or (tcp)”, and the braces should be random too therefore.
At this point I decided to come up with some handcrafted patterns, as I felt running a myriad of filters for each engine would exceed my cpu cycles anyway.
The first pattern I started with was:
“src net 10.0.0.0/8 or (dst net 127.0.0.1 and port 22) or src net 2001:470::/28”
I'd have loved to show the code generated somehow, but the graphs created for the jit code were just way too large, something which is easy to show is code size.
python benchmark/python/bpf.py bpf --infile /tmp/2G.pcap -p 50000 -o 0 -f "src net 10.0.0.0/8 or (dst net 127.0.0.1 and port 22) or src net 2001:470::/28" code -e jit-linux jit-freebsd
ls -al *.bin | awk ' { print $5" "$8 }' | sort -nr
1928 bpfjitfreebsd.bin
837 bpfjitlinux.bin
The second pattern which convinced me was:
”(tcp and portrange 0-1024) or (udp and portrange 1025-2048)”
This pattern was particular interesting as the linux jit compiler would segfault due to bad generated code for conditional jumps. While it took me some time to get an idea what was going wrong, the fix was trivial, so I notified the author, provided some example and asked him to take care of getting in linux mainline, which happened already22).
After fixing the jit compiler accordingly, here is the size of the generated code:
python benchmark/python/bpf.py bpf --infile /tmp/2G.pcap -p 50000 -o 0 -f "(tcp and portrange 0-1024) or (udp and portrange 1025-2048)" code -e jit-linux jit-freebsd
ls -al *.bin | awk ' { print $5" "$8 }' | sort -nr
4073 bpfjitfreebsd.bin
1697 bpfjitlinux.bin
While I was unable to get any useful information from llvm, it can be said the jit code freebsd generates is about twice the size of the linux jit. Looking on the jit compiler, this is easy to explain, for one linux does more passes to compact the code, but, way more important, linux has predefined functions to access the packet data at a given offset, in freebsd every access to the packet adds a new dump off such function to the machine code.
As I had everything required in python ctypes already, and I figured the overall overhead should be pretty much the same for every engine, I choose ctypes to benchmark.
python benchmark/python/bpf.py bpf -f "(tcp and portrange 0-1024) or (udp and portrange 1025-2048)" -i /tmp/2G.pcap -p -1000 --optimize 0 benchmark --count 10 --engines llvm pcap linux jit-linux jit-freebsd
Given the numbers I got, this was a misconception.
benchmark/python/bpf.py bpf -f "(tcp and portrange 0-1024) or (udp and portrange 1025-2048)" -i /tmp/2G.pcap -p -1 --optimize 1 benchmark --count 10 --engines llvm pcap linux jit-linux jit-freebsd;
| | min | max | total | avg | stddev |
| bpfjitfreebsd | 40.193334 | 40.871748 | 405.600501 | 40.560050 | 0.207135 |
| bpfllvm | 35.406323 | 37.177058 | 359.292445 | 35.929244 | 0.467665 |
| bpfjitlinux | 52.887720 | 54.133430 | 534.270949 | 53.427094 | 0.391791 |
| bpflinux | 51.713143 | 52.517396 | 520.397351 | 52.039735 | 0.253077 |
| bpf | 29.463384 | 30.435716 | 298.915537 | 29.891553 | 0.330827 |
Therefore I wrote the benchmarking in c, and given the numbers I got there, counting fractions of a second did not work out.
The c util does not feature such table output format.
benchmark/c/bpf -r /tmp/2G.pcap -c 10 -p -1 --filter "(tcp and portrange 0-1024) or (udp and portrange 1025-2048)" --engine pcap
engine pcap
snaplen = 128
3.501191 268023
3.471514 268023
3.470367 268023
3.469340 268023
3.467841 268023
3.465518 268023
3.471081 268023
3.472521 268023
3.472745 268023
3.469957 268023
benchmark/c/bpf -r /tmp/2G.pcap -c 10 -p -1 --filter "(tcp and portrange 0-1024) or (udp and portrange 1025-2048)" --engine jit-linux
engine jit-linux
snaplen = 128
1.566204 268023
1.524094 268023
1.524044 268023
1.521646 268023
1.533437 268023
1.525648 268023
1.523268 268023
1.522569 268023
1.526384 268023
1.524632 268023
The jit seemed faster, but counting cycles was the better choice, so I made use of Valgrinds Callgrind to count the cycles.
I'd love to be able to run Cachegrind on jit code too, but I was unable to figure out how to find the jit function in the cachegrind profiling data.
My steps to create the callgrind profiling data, as callgrind slows things down by factor upto 100, I choose to use it with 1000 packets only:
for i in pcap linux jit-linux jit-freebsd llvm;
do valgrind --tool=callgrind --callgrind-out-file=doc/pattern1/$i.callgrind benchmark/c/.libs/bpf -r /tmp/2G.pcap -c 1 -p 1000 --filter "src net 10.0.0.0/8 or (dst net 127.0.0.1 and port 22) or src net 2001:470::/28" --engine $i;
done

pcap overview

linux detail

linux overview

linux detail

llvm overview

llvm detail

linux overview

jit-linux detail

jit-freebsd overview

jit-freebsd detail
Besides from the cycles required to filter the packets, we even got the overhead for creating the filter.
As every engine works on bpf_program (somehow), every engine has the overhead required to create the bpf_program from the pcap-syntax filter.
For llvm, creating the jit code exceeds the cycles required to filter the packets by factor 200, for linux (without jit), the number of cycles to transform the filter are so low (9245), they are not even in the overall graph.
| engine | filter cycles | compile cycles |
| jit-linux | 106468 | 33126+72796 |
| jit-freebsd | 113958 | 48292+72796 |
| llvm | 157394 | 380843640+72796 |
| pcap | 276910 | 72796 |
| linux | 351391 | 9245+72796 |
Same steps to create the callgrind profiling data, just the pattern changed:
for i in pcap linux jit-linux jit-freebsd llvm;
do valgrind --tool=callgrind --callgrind-out-file=doc/pattern1/$i.callgrind benchmark/c/.libs/bpf -r /tmp/2G.pcap -c 1 -p 1000 --filter "(tcp and portrange 0-1024) or (udp and portrange 1025-2048)" --engine $i;
done
Given we saw the kcachegrind images already, and just the numbers will change, just the numbers this time.
| engine | filter cycles | compile cycles |
| jit-linux | 180766 | 49211+84160 |
| jit-freebsd | 198545 | 92231+84160 |
| llvm | 274455 | 652949592+84160 |
| pcap | 485792 | 84160 |
| linux | 630297 | 8245+84160 |
Overall the performance improvement for the linux bpf jit is fine, it is about 3.4 times faster than the standard version, and even the fastest of all jit engines, beating freebsd by about 10%. Interestingly the costs to compile the linux jit are lower than the costs to compile the freebsd jit, even though the linux jit can take more passes compacting the code.
Being fast is great, but being right is substantial.
So I decided to compare the engines next to each other.
Noting down packets where the results for the different engines would mismatch, and there is a clear winner.
While all engines agreed whether a packet matched the filter or not, the llvm engine was different.
Example, the pattern “tcp payload starts with an upper character”, llvm disagreed with everybody else in 1% of the cases:
python benchmark/python/bpf.py bpf --infile /tmp/2G.pcap --packets -1 --filter "tcp[20:1] > 65 and tcp[20:1] < 90" --optimize 1 cmp -e pcap jit-linux linux llvm jit-freebsd
| | bpf | bpfjitfreebsd | bpfjitlinux | bpflinux | bpfllvm |
| bpf | M:654619 F:8009263 | m: 0 f: 0 | m: 0 f: 0 | m: 0 f: 0 | m: 0 f: 81831 |
| bpfjitfreebsd | | M:654619 F:8009263 | m: 0 f: 0 | m: 0 f: 0 | m: 0 f: 81831 |
| bpfjitlinux | | | M:654619 F:8009263 | m: 0 f: 0 | m: 0 f: 81831 |
| bpflinux | | | | M:654619 F:8009263 | m: 0 f: 81831 |
| bpfllvm | | | | | M:572788 F:8091094 |
While I won't keep the code up to date with latest enhancements in libtrace, linux or freebsd kernel, I think it may be a good starting point for others.
Therefore the code is available on git.
The project bpftk consists of all the pieces and snippets I copied together:
the different bpf engines (all build as libraries)
/jit-linux
/jit-freebsd
/linux
/llvm
bpfc compiler copied from netsniff-ng (library)
antlr pcap-syntax grammar and conversion things
benchmark
/benchmark/python - the python script to interface the bpf libraries and have fun with bpf filters, e.g. graphing a filter, dumping the jit code, overall great for prototyping …
benchmark/c - the c code to benchmark for real numbers
I threw in the minimum of automagic, to make sure I can compile it properly myself.
May not work for you though.
2011:12:28:bpf_performance [carnivore news]