bpf performance

about bpf

The BSD or Berkely Packet Filter is a register-based filter evaluator and network tap invented 1990 by Steven McCanne and Van Jacobson to replace the CMU/Stanford Packet Filter (CSPF) and Sun NIT filter technology with a faster alternative1). While bpf consists of two components, the filter evaluator and the network tap, we'll ignore the network tap and focus on the filter evaluator instead.

The filter evaluator - the bpf pseudo machine - consists of

  • an accumulator (A)
  • an index register (x)
  • scratch memory store (M[])
  • an program counter (pc)

the addressing mode:

code description
#k the literal value stored in k
#len the length of the packet
M[k] the word at offset k in the scratch memory store
[k] the byte, halfword, or word at byte offset k in the packet
[x+k] the byte, halfword, or word at offset x+k in the packet
L an offset from the current instruction to L
#k, Lt, Lf the offset to Lt if the predicate is true, otherwise the offset to Lf
x the index register
4*([k]&0xf)four times the value of the low four bits of the byte at offset k in the packet

source: 2)

And the instructions:

opcodes addr modes
ldb [k] [x+k]
ldh [k] [x+k]
ld #k #len M[k] [k] [x+k]
ldx #k #len M[k] 4*([k]&0xf)
st M[k]
stx M[k]
add #k x
sub #k x
mul #k x
div #k x
and #k x
or #k x
lsh #k x
rsh #k x
jmp L
jeq #k, Lt, Lf
jgt #k, Lt, Lf
jge #k, Lt, Lf
jset #k, Lt, Lf
ret #k a

source: 3)

and instructions.

These instructions are encoded in a fixed-length format:


The fields are:

  • opcode - indicationg instruction type and addressing mode
  • jt/jf - relative offsets for the conditional branch instructions
  • k - the generic field

In c, this is struct bpf_insn

struct bpf_insn {
        u_short code;
        u_char  jt;
        u_char  jf;
        u_long k;


A bpf filter program is a bunch of instructions therefore, e.g.:

struct bpf_insn insns[] = {
        BPF_STMT(BPF_RET+BPF_K, sizeof(struct ether_arp) +
                 sizeof(struct ether_header)),
        BPF_STMT(BPF_RET+BPF_K, 0),

Taken from the BPF manual: Taken from the Reverse ARP Daemon. It accepts only Reverse ARP requests.5)

To simplify interfacing bpf, there is a language which is compiled to bpf: the pcap-filter syntax 6). This language is parsed via a lex&yacc defined grammar 7) 8) and converted into an internal block formatting 9). This internal represenation is optional optimized 10) and converted to the bpf filter afterwards.

The optimization process is rather complicated11), and is the prone to problems therefore. As a recent example, the following filter12) would hit an endless loop in the optimizer.

((ether[0:4] & 0xFFFFFF00) = 0x005056 or (ether[6:4] & 0xFFFFFF00) = 0x005056) and 
((ether[0:4] & 0xFFFFFF00) = 0x001B0D or (ether[6:4] & 0xFFFFFF00) = 0x001B0D)

Manual optimization, one would expand the and

((ether[0:4] & 0xFFFFFF00) = 0x005056 and (ether[0:4] & 0xFFFFFF00) = 0x001B0D) or
((ether[0:4] & 0xFFFFFF00) = 0x005056 and (ether[6:4] & 0xFFFFFF00) = 0x001B0D) or
((ether[6:4] & 0xFFFFFF00) = 0x005056 and (ether[0:4] & 0xFFFFFF00) = 0x001B0D) or
((ether[6:4] & 0xFFFFFF00) = 0x005056 and (ether[6:4] & 0xFFFFFF00) = 0x001B0D)

and eliminate invalid combinations afterwards

((ether[0:4] & 0xFFFFFF00) = 0x005056 and (ether[6:4] & 0xFFFFFF00) = 0x001B0D) or
((ether[6:4] & 0xFFFFFF00) = 0x005056 and (ether[0:4] & 0xFFFFFF00) = 0x001B0D)

but, the optimizer in pcap is much more advanced.

As an example for optimizations, let's look on the filter '1+1=2' with optimaziation turned off:

sudo tcpdump -i lo -dO "1+1=2"
(000) ld       #0x1
(001) st       M[0]
(002) ld       #0x1
(003) st       M[1]
(004) ldx      M[1]
(005) ld       M[0]
(006) add      x
(007) st       M[1]
(008) ld       #0x2
(009) st       M[2]
(010) ldx      M[2]
(011) ld       M[1]
(012) sub      x
(013) jeq      #0x0             jt 14	jf 15
(014) ret      #65535
(015) ret      #0

and turned on:

sudo tcpdump -i lo -d "1+1=2"
(000) ret      #65535

This optimized filter evaluates to true always.

Now, lets look on a different filter, which will evaluate to false everytime it is run '(1+1)*4/8*2=1'

sudo tcpdump -i lo -d "(1+1)*4/8*2=1"
tcpdump: expression rejects all packets

Besides resolving static calculations, the optimaziation process involves variable dependent re-ordering of blocks and other things.

Once the bpf_program is compiled, it can be attached to a socket by passing it to the kernel. The kernel will check the filter for sanity, to make sure the filter will halt at a given point, so there are no backward branches and the last instruction in the filter is a BPF_RET. The kernel implements the previously mentioned bpf pseudo machine, applies the filter to packets for the socket and has only the packets passing which match the filter.

The linux kernel is somewhat special, as it translates the bpf filter into a linux socket filter during the sanity check, and supports some non-standard operations, like matching packets per cpu.

To my knowledge there is no independent pcap-filter parser/compiler/optimizer besides the ones shipped with libpcap, so I started playing with antlr to see where I'll get with it. I choose to convert the grammar.y file with some antlr2.7 code into an antlr grammar, and convert this antlr grammar into an antlr3 grammar later on. It was the worst possible thing I could have done, and it would have been much faster to write it by hand. After transforming the LR grammar into an LL manually, parsing things successfully, having some fun with the AST, I noticed antlr does not support python in the current version, last version supported is antlr 3.1 where 3.4 is current, and therefore decided to abandon it.

bpf jit

Given the linux kernel got a jit compiler for bpf filters recently13), I was courious and wanted to benchmark it. So I grabbed the code from linux kernel, and build a userspace library of it.

To compare against, I decided to create a list of bpf engines:

  • the bpf-llvm14) compiler from libtrace 15)
  • the FreeBSD16) kernel jit compiler 17)
  • the linux kernel bpf vm 18)
  • linux kernel bpf jit for x86_64 19)
  • libpcaps bpf_filter 20)

and compile a library of each of them, so I could interface it easily. Once I had the engines running, I decided to run the benchmarking from within python using ctypes.

Having basics working, I created a testset of data using scapy, plan was 2GB of random IP|IPv6/TCP|UDP/Payload? data, but as creating 800MByte of it took more than 6 hours, I decided to stick with the 800MByte instead.

the participants

libtrace llvm's powered bpf-jit

Well, I've had no experience with llvm, and the experience I've had by now could not have been much worse. For the code I gathered from libtrace, the idea is to compile bpf-opcodes.c to bpf-opcodes.llvm.bc using llvm-gcc, and this new created file to bpf-opcodes.llvm.cc via llc, so the .cc file can be compiled into the library. Starting with llvm-gcc to create bpf-opcodes.llvm.bc from bpf-opcodes.c:
Some time ago llvm-ggc was something standalone, now llvm-gcc is a bash script running gcc with the dragonegg plugin for llvm code generation. Ubuntu ships a dragonegg plugin compiled with a different gcc version or something, I got some messages complaining about it.

llvm-gcc -std=c99 -c -O0 -emit-llvm -o bpf-opcodes.llvm.bc bpf-opcodes.c
Incompatible plugin version
cc1: error: Fail to initialize plugin /usr/lib/x86_64-linux-gnu/gcc/x86_64-linux-gnu/4.5/plugin/dragonegg.so

Setting dragonegg_disable_version_check=1 fixed this problem.

Next, llc to create bpf-opcodes.llvm.cc from bpf-opcodes.llvm.bc:
llc would get the job done …

llc -march=cpp -cppgen=module  bpf-opcodes.llvm.bc -o bpf-opcodes.llvm.cc
llc: bpf-opcodes.llvm.bc:1:1: error: expected top-level entity

So, the llvm-gcc -emit-llvm did not work, you'll get ELF instead, passing -S -flto did the job, thats a specialty for the dragonegg version of llvm-gcc.

dragonegg_disable_version_check=1 llvm-gcc -std=c99 -c -O0 -S -flto -o bpf-opcodes.llvm.bc bpf-opcodes.c

Linux x86_64 JIT

Getting the linux kernel jit compiler into userspace and working was not as painful as llvm. I had to come up with some definitions for sk_buff and others, but the jit code was prepared for changing offsets in sk_buff already.
I missed the kernels internal translation of bpf instructions to it's own code, and it took me some time to come over it, additionally the jit compiler code currently has a very subtile bug, and it took me some time to understand the problem , debug and fix it.

Freebsd 9 x86_64 JIT

The FreeBSD jit code was trivial to port to userspace, they even have the proper guards for __KERNEL in place. Really charming.

Linux sk_run_filter

This one was pretty easy, as I knew about the translation thing already, there were little surprisese.

libpcap bpf_filter

This one was userspace already, did not require and llvm tricks, and I did not have to write any code for it.

the benchmarking data

To create the data, I had to write a pcap file, so first I looked up the headers required and defined the structures accordingly using python ctypes.

import ctypes
class pcap_file_header(ctypes.Structure):
	_fields_ = [("magic",ctypes.c_uint32),
	def to_buffer(self):
		return buffer(self)[:]
class pcap_timeval(ctypes.Structure):
	_fields_ = [("tv_sec",ctypes.c_uint32),
	def to_buffer(self):
		return bytes(self)[:]
class pcap_pkthdr(ctypes.Structure):
	_fields_ = [("ts",pcap_timeval),
	def to_buffer(self):
		return buffer(self)[:]

Next, a class which can read and write pcap files, so we can use to create the data and run the benchmark on the data later on:

from scapy.all import Ether,IP,IPv6,TCP,UDP,fuzz,RandIP,RandIP6,RandMAC,RandString,Raw
class pcap_file(object):
	def __init__(self):
		self.file = None
	def create(self,path, snaplen=40,linktype=113):
		self.file = open(path,'wb+')
		self.snaplen = snaplen
		hdr = pcap_file_header(magic=0xa1b2c3d4, version_major=2, version_minor=4, snaplen=snaplen, linktype=linktype)
	def open(self,path):
		self.file = open(path,"rb")
		hdr = pcap_file_header.from_buffer_copy(self.file.read(ctypes.sizeof(pcap_file_header)))
		self.snaplen = hdr.snaplen
	def close(self):
		self.file = None
	def write(self,data,xferlen=None,ts=None):
		if xferlen == None:
			xferlen = len(data)
		if ts == None:
			ts = time.time()
		ts = pcap_timeval(tv_sec=int(ts),tv_usec=int(ts-int(ts)))
		hdr = pcap_pkthdr(ts=ts,caplen=min(self.snaplen,len(data)),len=len(data))
		if len(data) > self.snaplen:
	def read(self):
		hdr = pcap_pkthdr.from_buffer_copy(self.file.read(ctypes.sizeof(pcap_pkthdr)))
		pkt = self.file.read(hdr.caplen)
		return pkt
	def random(self):
		import random
		l3 = [(IP,{'src':RandIP(),'dst':RandIP()}),
		l4 = [(UDP,{}),
		_l3 = l3[random.randint(0,len(l3)-1)]
		_l4 = l4[random.randint(0,len(l4)-1)]
		pkt = fuzz(Ether(src=RandMAC(),dst=RandMAC())/_l3[0](**_l3[1])/_l4[0](**_l4[1]))
		if random.randint(0,100)>20:
			pkt = pkt / Raw(load=RandString(size=random.randint(0,64)))
		data = pkt.build()

the ctypes glue

The ctypes glue to interface the previously created bpf libraries for the different engines is not that exciting, but some things are noteworthy. For one, ctypes are pretty painful when dealing with variable length arrays in structs.

struct sk_filter
//	atomic_t		refcnt;
	unsigned int         	len;	/* Number of filter blocks */
	unsigned int		(*bpf_func)(const struct sk_buff *skb,
					    const struct sock_filter *filter);
//	struct rcu_head		rcu;
	struct sock_filter     	insns[0];

Even worse, if you have a function which takes this struct as argument:

struct bpf_jit_t *bpf_jit_compile(struct sk_filter *fp)

Following the manual … Variable-sized data types
Another way to use variable-sized data types with ctypes is to use the dynamic 
nature of Python, and (re-)define the data type after the required size is already known, 
on a case by case basis.

_sk_filter returns a class with the required number for the array, so the function arguments can be set properly our argument will be accepted as well.

class sock_filter(bpf_insn):
class sk_filter(ctypes.Structure):
	_fields_ = [("len", ctypes.c_uint),("bpf_func",ctypes.CFUNCTYPE(ctypes.c_int, ctypes.POINTER(sk_buff),ctypes.POINTER(bpf_jit_t))),("insns",sock_filter * 16)]
def _sk_filter(size):
	class sk_filter(ctypes.Structure):
		_fields_ = [("len", ctypes.c_uint),("bpf_func",ctypes.CFUNCTYPE(ctypes.c_int, ctypes.POINTER(sk_buff),ctypes.POINTER(bpf_jit_t))),("insns",sock_filter * size)]
	return sk_filter
class bpfjitlinux(bpflinux):
	jit = ctypes.cdll.LoadLibrary("/opt/bpf/lib/libbpfjitlinux.so")
	jit.bpf_jit_compile.restype = ctypes.POINTER(bpf_jit_t)
	jit.bpf_jit_compile.argtypes = [ctypes.POINTER(sk_filter)] # will be overwritten for the dynamic array in sk_filter at compile()
	def __init__(self, snaplen=40,linktype=12,optimize=0,mask=0):
	def compile(self,pattern):
		at = _sk_filter(self.program.bf_len)
		self.jit.bpf_jit_compile.argtypes = [ctypes.POINTER(at)]

As I was working with bpf_insn already I decided to throw in some lines to graph things properly, using the bpf_program.bf_insns as source.

the interesting part

Given the data is random, I thought about filters. Initial idea was providing some snippets, and creating all combinations without repetitons, or'ing them, done.

a = ["net","net 2001:470::/28","port 22","tcp","udp","portrange 0-1024"]
for i in range(len(a)):
	for j in itertools.combinations(a,i):
		print(' or '.join([" ({}) ".format(n) for n in j]))

Which would result in 63 filter patterns 21)

The 'longest' pattern would look like:

 (net 2001:470::/28)  or  (port 22)  or  (tcp)  or  (udp)  or  (portrange 0-1024) 

looking on this pattern, the pattern will match all data via ”(udp) or (tcp)”, and the braces should be random too therefore. At this point I decided to come up with some handcrafted patterns, as I felt running a myriad of filters for each engine would exceed my cpu cycles anyway.

pattern #1

The first pattern I started with was: “src net or (dst net and port 22) or src net 2001:470::/28”

I'd have loved to show the code generated somehow, but the graphs created for the jit code were just way too large, something which is easy to show is code size.

python benchmark/python/bpf.py bpf --infile /tmp/2G.pcap -p 50000 -o 0 -f  "src net or (dst net and port 22) or src net 2001:470::/28"  code -e jit-linux jit-freebsd
ls -al *.bin | awk ' { print $5" "$8 }' | sort -nr
1928 bpfjitfreebsd.bin
837 bpfjitlinux.bin

pattern #2

The second pattern which convinced me was: ”(tcp and portrange 0-1024) or (udp and portrange 1025-2048)”

This pattern was particular interesting as the linux jit compiler would segfault due to bad generated code for conditional jumps. While it took me some time to get an idea what was going wrong, the fix was trivial, so I notified the author, provided some example and asked him to take care of getting in linux mainline, which happened already22).

After fixing the jit compiler accordingly, here is the size of the generated code:

python benchmark/python/bpf.py bpf --infile /tmp/2G.pcap -p 50000 -o 0 -f  "(tcp and portrange 0-1024) or (udp and portrange 1025-2048)"  code -e jit-linux jit-freebsd
ls -al *.bin | awk ' { print $5" "$8 }' | sort -nr
4073 bpfjitfreebsd.bin
1697 bpfjitlinux.bin

While I was unable to get any useful information from llvm, it can be said the jit code freebsd generates is about twice the size of the linux jit. Looking on the jit compiler, this is easy to explain, for one linux does more passes to compact the code, but, way more important, linux has predefined functions to access the packet data at a given offset, in freebsd every access to the packet adds a new dump off such function to the machine code.

beating the clock

As I had everything required in python ctypes already, and I figured the overall overhead should be pretty much the same for every engine, I choose ctypes to benchmark.

python benchmark/python/bpf.py bpf -f "(tcp and portrange 0-1024) or (udp and portrange 1025-2048)" -i /tmp/2G.pcap -p -1000 --optimize 0 benchmark --count 10 --engines llvm pcap linux jit-linux jit-freebsd

Given the numbers I got, this was a misconception.

benchmark/python/bpf.py bpf -f "(tcp and portrange 0-1024) or (udp and portrange 1025-2048)" -i /tmp/2G.pcap -p -1 --optimize 1 benchmark --count 10 --engines llvm pcap linux jit-linux jit-freebsd;
min max total avg stddev
bpfjitfreebsd 40.193334 40.871748 405.600501 40.560050 0.207135
bpfllvm 35.406323 37.177058 359.292445 35.929244 0.467665
bpfjitlinux 52.887720 54.133430 534.270949 53.427094 0.391791
bpflinux 51.713143 52.517396 520.397351 52.039735 0.253077
bpf 29.463384 30.435716 298.915537 29.891553 0.330827

Therefore I wrote the benchmarking in c, and given the numbers I got there, counting fractions of a second did not work out. The c util does not feature such table output format.

benchmark/c/bpf -r /tmp/2G.pcap -c 10 -p -1 --filter "(tcp and portrange 0-1024) or (udp and portrange 1025-2048)" --engine pcap
engine pcap
snaplen = 128
3.501191 268023
3.471514 268023
3.470367 268023
3.469340 268023
3.467841 268023
3.465518 268023
3.471081 268023
3.472521 268023
3.472745 268023
3.469957 268023
benchmark/c/bpf -r /tmp/2G.pcap -c 10 -p -1 --filter "(tcp and portrange 0-1024) or (udp and portrange 1025-2048)" --engine jit-linux
engine jit-linux
snaplen = 128
1.566204 268023
1.524094 268023
1.524044 268023
1.521646 268023
1.533437 268023
1.525648 268023
1.523268 268023
1.522569 268023
1.526384 268023
1.524632 268023

The jit seemed faster, but counting cycles was the better choice, so I made use of Valgrinds Callgrind to count the cycles. I'd love to be able to run Cachegrind on jit code too, but I was unable to figure out how to find the jit function in the cachegrind profiling data.

pattern "src net or (dst net and port 22) or src net 2001:470::/28"

My steps to create the callgrind profiling data, as callgrind slows things down by factor upto 100, I choose to use it with 1000 packets only:

for i in pcap linux jit-linux jit-freebsd llvm; 
	do valgrind --tool=callgrind --callgrind-out-file=doc/pattern1/$i.callgrind benchmark/c/.libs/bpf -r /tmp/2G.pcap -c 1 -p 1000 --filter "src net or (dst net and port 22) or src net 2001:470::/28" --engine $i; 

pcap overview
pcap overview
pcap detail
linux detail


linux overview
linux overview
linux detail
linux detail


llvm overview
llvm overview
llvm detail
llvm detail


jitlinux overview
linux overview
jitlinux detail
jit-linux detail


jit-freebsd overview
jit-freebsd overview
jit-freebsd detail
jit-freebsd detail


Besides from the cycles required to filter the packets, we even got the overhead for creating the filter. As every engine works on bpf_program (somehow), every engine has the overhead required to create the bpf_program from the pcap-syntax filter. For llvm, creating the jit code exceeds the cycles required to filter the packets by factor 200, for linux (without jit), the number of cycles to transform the filter are so low (9245), they are not even in the overall graph.

engine filter cycles compile cycles
jit-linux 106468 33126+72796
jit-freebsd 113958 48292+72796
llvm 157394 380843640+72796
pcap 276910 72796
linux 351391 9245+72796

pattern "(tcp and portrange 0-1024) or (udp and portrange 1025-2048)"

Same steps to create the callgrind profiling data, just the pattern changed:

for i in pcap linux jit-linux jit-freebsd llvm; 
	do valgrind --tool=callgrind --callgrind-out-file=doc/pattern1/$i.callgrind benchmark/c/.libs/bpf -r /tmp/2G.pcap -c 1 -p 1000 --filter "(tcp and portrange 0-1024) or (udp and portrange 1025-2048)" --engine $i; 

Given we saw the kcachegrind images already, and just the numbers will change, just the numbers this time.

engine filter cycles compile cycles
jit-linux 180766 49211+84160
jit-freebsd 198545 92231+84160
llvm 274455 652949592+84160
pcap 485792 84160
linux 630297 8245+84160

Overall the performance improvement for the linux bpf jit is fine, it is about 3.4 times faster than the standard version, and even the fastest of all jit engines, beating freebsd by about 10%. Interestingly the costs to compile the linux jit are lower than the costs to compile the freebsd jit, even though the linux jit can take more passes compacting the code.


Being fast is great, but being right is substantial.
So I decided to compare the engines next to each other. Noting down packets where the results for the different engines would mismatch, and there is a clear winner.
While all engines agreed whether a packet matched the filter or not, the llvm engine was different.

Example, the pattern “tcp payload starts with an upper character”, llvm disagreed with everybody else in 1% of the cases:

python benchmark/python/bpf.py bpf --infile /tmp/2G.pcap --packets -1 --filter "tcp[20:1] > 65 and tcp[20:1] < 90" --optimize 1 cmp -e pcap jit-linux linux llvm jit-freebsd
bpf bpfjitfreebsd bpfjitlinux bpflinux bpfllvm
bpf M:654619 F:8009263 m: 0 f: 0 m: 0 f: 0 m: 0 f: 0 m: 0 f: 81831
bpfjitfreebsd M:654619 F:8009263 m: 0 f: 0 m: 0 f: 0 m: 0 f: 81831
bpfjitlinux M:654619 F:8009263 m: 0 f: 0 m: 0 f: 81831
bpflinux M:654619 F:8009263 m: 0 f: 81831
bpfllvm M:572788 F:8091094


While I won't keep the code up to date with latest enhancements in libtrace, linux or freebsd kernel, I think it may be a good starting point for others.
Therefore the code is available on git. The project bpftk consists of all the pieces and snippets I copied together:

  • the different bpf engines (all build as libraries)
    • /jit-linux
    • /jit-freebsd
    • /linux
    • /llvm
  • bpfc compiler copied from netsniff-ng (library)
    • /bpfc
  • antlr pcap-syntax grammar and conversion things
    • /antlr
  • benchmark
    • /benchmark/python - the python script to interface the bpf libraries and have fun with bpf filters, e.g. graphing a filter, dumping the jit code, overall great for prototyping …
    • benchmark/c - the c code to benchmark for real numbers

I threw in the minimum of automagic, to make sure I can compile it properly myself. May not work for you though.


2011:12:28:bpf_performance [carnivore news]

1 | | 2012/03/05 16:41 | reply

2011/12/28/bpf_performance.txt · Last modified: 2012/01/04 13:45 by common
chimeric.de = chi`s home Creative Commons License Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0