Table of Contents

arcane bugs

dionaea is meant to support SIGHUP, allowing you to rotate logfiles without restarting the service.
Basically it works, but I got noticed there things were going wrong after SIGHUP, events would appear multiple times in the sql logging.

symptoms

For example you'd have multiple downloads of the same file for a single event.

  connection 1828337 smbd tcp accept a.b.c.d:445 <- w.x.y.z:4029
   dcerpc request: uuid '4b324fc8-1670-01d3-1278-5a47bf6ee188' opnum 31
   dcerpc bind: uuid '4b324fc8-1670-01d3-1278-5a47bf6ee188' transfersyntax 8a885d04-1ceb-11c9-9fe8-08002b104860
   offer: http://w.x.y.z:4344/mfbjb
   offer: http://w.x.y.z:4344/mfbjb
   offer: http://w.x.y.z:4344/mfbjb
   download: 87136c488903474630369e232704fa4d http://w.x.y.z:4344/mfbjb
   download: 87136c488903474630369e232704fa4d http://w.x.y.z:4344/mfbjb
   download: 87136c488903474630369e232704fa4d http://w.x.y.z:4344/mfbjb
   download: 87136c488903474630369e232704fa4d http://w.x.y.z:4344/mfbjb
   download: 87136c488903474630369e232704fa4d http://w.x.y.z:4344/mfbjb
   download: 87136c488903474630369e232704fa4d http://w.x.y.z:4344/mfbjb
   profile: [{'return': '0x7df20000', 'args': ['urlmon'], 'call': 'LoadLibraryA'}, {'return': '0', 'args': ['', 'http://w.x.y.z:4344/mfbjb', 'x.', '0', '0'], 'call': 'URLDownloadToFile'}, {'return': '0x00000000', 'args': ['x.'], 'call': 'LoadLibraryA'}, {'return': '0', 'args': ['0'], 'call': 'ExitThread'}]

The offer was logged 3 times, the download 6 times.

more symptoms

I added some debug code, so we could see which incident handlers get deleted during SIGHUP, and it turned out some handlers were not caught by pythons garbage collector.

python message: Import ihandlers exists, reloading
ihandlers debug: deleting <ftp.ftpdownloadhandler object at 0xb71f2f6c>
ihandlers debug: deleting <tftp.tftpdownloadhandler object at 0xb719ef6c>
ihandlers debug: deleting <emu.emuprofilehandler object at 0xb719edac>
ihandlers debug: deleting <cmd.cmdshellhandler object at 0xb719efec>
ihandlers debug: deleting <store.storehandler object at 0xb71a208c>
ihandlers debug: deleting <test.uniquedownloadihandler object at 0xb719ef2c>
ihandlers debug: deleting <surfids.surfidshandler object at 0xb71af3ac>
ihandlers debug: deleting <logsql.logsqlhandler object at 0xb6f9dfec>
incident warning: ihandler_free i 0x82d17c8
incident warning: ihandler_free i 0x82d1770
incident warning: ihandler_free i 0x82c2410
incident warning: ihandler_free i 0x82d4280

Eight handlers should have been free'd, four got free'd.

dr. house

Yesterday we've had the possibility to debug the problem on a running system, as dionaea provides a cli to the python interpreter, this is much better than post-mortem analysis, which basically reduces debugging to grepping logfiles.
I got the idea something was wrong with the reference counting, and remembered reading about objgraph to debug python ref counting problems.
As the dionaea cli was pretty noisy, the first step was to block all traffic to port 445, so no more attacks were coming in, allowing us to analyze a 'frozen' system.
Next thing was objgraph, porting to python3 was trivial thanks to 2to3, afterwards I added some code to store images with random names, so I could create multiple images in a loop, you can grab it here.

At first, we got an overview which handlers failed to unload:

import objgraph
for i in ['ftpdownloadhandler','tftpdownloadhandler','emuprofilehandler','cmdshellhandler','storehandler','uniquedownloadihandler','surfidshandler','logsqlhandler','p0fhandler']:
	print("type %s count %i" % ( i, objgraph.count(i)))
type ftpdownloadhandler count 1
type tftpdownloadhandler count 1
type emuprofilehandler count 2
type cmdshellhandler count 1
type storehandler count 2
type uniquedownloadihandler count 1
type surfidshandler count 2
type logsqlhandler count 2
type p0fhandler count 0

emuprofilehandler, storehandler, surfidshandler, logsqlhandler had more than one instance, so they failed to unload on sighup.

Getting a list of the entities did not help at all:

for i in ['emuprofilehandler','storehandler''surfidshandler','logsqlhandler']:
	print("type %s items %s" % ( i, objgraph.by_type(i)))
type emuprofilehandler items [<emu.emuprofilehandler object at 0xb389c0ec>, <emu.emuprofilehandler object at 0xb719edac>]
type storehandler items [<store.storehandler object at 0xb389c16c>, <store.storehandler object at 0xb71a208c>]
type surfidshandler items [<surfids.surfidshandler object at 0xb389c1ac>, <surfids.surfidshandler object at 0xb71af3ac>]
type logsqlhandler items [<logsql.logsqlhandler object at 0xb3896d6c>, <logsql.logsqlhandler object at 0xb6f9dfec>]

But it allowed drawing graphs of the objects referencing:

for i in ['emuprofilehandler','storehandler','surfidshandler','logsqlhandler']:
	l = objgraph.by_type(i)
	objgraph.show_refs(l)

and the backrefs:

for i in ['emuprofilehandler','storehandler','surfidshandler','logsqlhandler']:
	l = objgraph.by_type(i)
	objgraph.show_backrefs(l)

I received a copy of the images created by objgraph and was rather confused, the objects had no references, so why on earth did python keep them alive?

objgraph backrefs for storehandler
objgraph backrefs for storehandler
objgraph backrefs for logsqlhandler
objgraph backrefs for logsqlhandler

We went for the questionable objects reference counts:

for i in ['ftpdownloadhandler','tftpdownloadhandler','emuprofilehandler','cmdshellhandler','storehandler','uniquedownloadihandler','surfidshandler','logsqlhandler','p0fhandler']:
	l = objgraph.by_type(i)
	for j in l:
		print("obj %s refs %i" % (j, sys.getrefcount(j)))

and things got scary:

obj <ftp.ftpdownloadhandler object at 0xb71fa06c> refs 5
obj <tftp.tftpdownloadhandler object at 0xb389c04c> refs 5
obj <emu.emuprofilehandler object at 0xb389c0ec> refs 5
obj <emu.emuprofilehandler object at 0xb719edac> refs 7
obj <cmd.cmdshellhandler object at 0xb389c14c> refs 5
obj <store.storehandler object at 0xb389c16c> refs 6518
obj <store.storehandler object at 0xb71a208c> refs 10909
obj <test.uniquedownloadihandler object at 0xb389c18c> refs 5
obj <surfids.surfidshandler object at 0xb389c1ac> refs 13031
obj <surfids.surfidshandler object at 0xb71af3ac> refs 17422
obj <logsql.logsqlhandler object at 0xb3896d6c> refs 13032
obj <logsql.logsqlhandler object at 0xb6f9dfec> refs 17424

the logsql handler had more than 13k references(!), even though the images revealed nothing pointing to it.
My idea was memory corruption and I asked him to run the software in valgrind, even though it would not solve the problem, it might help investigating it.

Reusing the gained knowledge about objgraph, I gave it a shot, and was able to reproduce the refcount increases by issuing a download. A successful download would increase the storehandlers and the logsqlhandlers refcount by one.

It was weird, I had a look on the code, the storehandler does not reference anything, so why did the refcount increase for each download?

To improve my cognitive skills I had some pre-christmas-beers in a local pub with some of my workmates. It worked pretty well, when I got home, I had already forgotten the problem.

Today, seeing my notes on the screen, I remembered there was something to take care of, and decided to give it one more shot.
At this point I was able to reproduce it, but had no idea what caused it.
I figured I'd need something to track when the refcount of the PyObject gets increased.
gdb was an option, if I had the address of the python objects, and besides from getting the address of the objects from the logs, objgraph would deliver them too.
For some reason gdb did not intercept SIGINT for dionaea, I think it is related to signalfd and the glib threadpool, so I disabled the sigint event callback and gdb would intercept the SIGINT signal.

I started dionaea, and ran the objgraph code to get the questionable objects address:

import objgraph
l = objgraph.by_type('storehandler')
for j in l:
  print("obj %s refs %i" % (j, sys.getrefcount(j)))
obj <store.storehandler object at 0xc804d0> refs 5

I used gdb to investigate the object and set a breakpoint on the ob_refcnt:

(gdb) p *(PyObject *)0xc804d0
$1 = {ob_refcnt = 1, ob_type = 0xdfdfe0}
(gdb) watch (*(PyObject *)0xc804d0)->ob_refcnt
Hardware watchpoint 1: (*(PyObject *)0xc804d0)->ob_refcnt
(gdb) c
Continuing.

Then I enforced a download:

from ftp import ftp
f = ftp()
f.download(None, 'anonymous','guest','ftp.kernel.org',21, 'welcome.msg', 'binary', "test")

After the download was done, the file was given to the store handler, and I got actions on my breakpoint:

Hardware watchpoint 1: (*(PyObject *)0xc804d0)->ob_refcnt
 
Old value = 1
New value = 2
__pyx_f_6python_c_python_ihandler_cb (__pyx_v_i=0xb63400, __pyx_v_ctx=0xc804d0) at binding.c:9329
9329		__Pyx_DECREF(((PyObject *)__pyx_v_handler));
 
Hardware watchpoint 1: (*(PyObject *)0xc804d0)->ob_refcnt
 
Old value = 2
New value = 3
PyMethod_New (func=0xc7e5a0, self=0xc804d0) at Objects/classobject.c:70
 
Hardware watchpoint 1: (*(PyObject *)0xc804d0)->ob_refcnt
 
Old value = 3
New value = 4
method_call (func=0xc7e5a0, arg=0xc93b10, kw=0x0) at Objects/classobject.c:316
 
Hardware watchpoint 1: (*(PyObject *)0xc804d0)->ob_refcnt
 
Old value = 4
New value = 5
PyEval_EvalCodeEx (co=0xc75c30, globals=<value optimized out>, locals=<value optimized out>, args=0xc83218, argcount=<value optimized out>, kws=<value optimized out>, kwcount=0, defs=0x0, defcount=0, kwdefs=0x0, closure=0x0)
 
Hardware watchpoint 1: (*(PyObject *)0xc804d0)->ob_refcnt
 
Old value = 5
New value = 4
0x00007ffff31cd3e7 in tupledealloc (op=0xc83200) at Objects/tupleobject.c:221
 
Hardware watchpoint 1: (*(PyObject *)0xc804d0)->ob_refcnt
 
Old value = 4
New value = 3
0x00007ffff326ed3c in method_dealloc (im=0x785200) at Objects/classobject.c:175
 
Hardware watchpoint 1: (*(PyObject *)0xc804d0)->ob_refcnt
 
Old value = 3
New value = 2
 
Continuing.
>>> 
>>>

So, there was no memory corruption, there was a simple refcount leak in the python binding. I updated Cython to 0.12, reran everything, problem persisted. The C code Cython generated from the this pyx code would leak a reference on ctx:

cdef void c_python_ihandler_cb (c_incident *i, void *ctx) except *:
	cdef ihandler handler
	cdef incident pi
	handler = <ihandler>ctx
	pi = NEW_C_INCIDENT_CLASS(incident)
	pi.thisptr = i
	INIT_C_INCIDENT_CLASS(pi,pi)
	origin = pi.origin
	if isinstance(origin, bytes):
		origin = origin.decode(u'ascii')
	elif not isinstance(origin, unicode):
		raise ValueError(u"requires text/bytes input, got %s" % type(origin))
	origin = origin.replace(u".",u"_")
	try:
		method = getattr(handler, u"handle_incident_" + origin)
	except:
		handler.handle_incident(pi)
		return
	method(pi)

As the uncrustified version of the generated code is about 500 lines and pretty hard to get, I decided to change path and look for a minimal trigger for the bug instead:

i=incident("dionaea.download.complete")
i.report()

the storehandler showed an refcount increase by one:

obj <store.storehandler object at 0xcff4d0> refs 6

the last 7 minutes

I reduced the cython code to:

cdef void c_python_ihandler_cb (c_incident *i, void *ctx) except *:
	cdef ihandler handler
	cdef incident pi
	handler = <ihandler>ctx
	pi = NEW_C_INCIDENT_CLASS(incident)
	pi.thisptr = i
	INIT_C_INCIDENT_CLASS(pi,pi)
	origin = pi.origin
	if isinstance(origin, bytes):
		origin = origin.decode(u'ascii')
	elif not isinstance(origin, unicode):
		raise ValueError(u"requires text/bytes input, got %s" % type(origin))
	origin = origin.replace(u".",u"_")
	handler.handle_incident(pi)

The bug persisted, I read the Cython generated code, added printfs, found nothing, I even thought about filing a Cython bug, finally removing the except * from the function declaration, fixed it. So, it had to do with exceptions, I had a look on the traceback exception handler, and I was missing 3 Py_XDECREFs there.

My initial trigger, the ftp line, caused an exeception as the connection ref was None, where it should not be None for logsql, this was not on purpose - a better trigger would not have reproduced the symptoms.

Thats it, once again no lupus.