
the git-daemon activity for the dionaea.git repository, pull and uniq hosts/day, basically 5-10 users update their software daily.
Often the most complex part in data visualization is the processing before you can provide the data in a format your visualization software understands.
I choose the git-daemon logs as an example of such an case.
One could have used sshd logs as an example too, but I choose this, as I'm pretty sure there is no parser for the git-daemon logfiles.
In doubt, I'm pretty confident, one could adjust this git-daemon parser to deal with sshd too.
If you want to use a native library in python, but there is no binding, you can 'try' to interface the library with ctypes.
As I wanted to play with bpf, which is part of libpcap, which lacks a python3 binding, I decided to try ctypes.
What I wanted to do:
We will create images showing the correlation of attacker-host, vulnerability, malware.
Basically, image will look like this:

I had to cheat to get the image to a valid size …

Presenting data in a human compatible way is a problem, rumors say at this stage of evolution pictures work best.
Therefore some hints how to create graphs using the dionaea logsql sqlite database.
There is malware downloading files from rapidshare to install on your drive.
Nothing new, I've had shellcode downloading files from rapidshare before
| first | last | hits | url |
| 2010-01-06 | 2010-01-07 | 2 | hxtp://rapidshare.com/files/331049304/hitman1 |
| 2010-01-08 | 2010-01-10 | 2 | hxtp://rapidshare.com/files/332058885/two |
| 2010-01-12 | 2010-01-12 | 1 | hxtp://rapidshare.com/files/333804484/roo |
| 2010-01-17 | 2010-01-17 | 1 | hxtp://rapidshare.com/files/335701706/uhit |
| 2010-01-20 | 2010-01-20 | 1 | hxtp://rapidshare.com/files/337582552/newtom |
| 2010-01-20 | 2010-01-20 | 1 | hxtp://rapidshare.com/files/337582552/newtom |
| 2010-01-21 | 2010-01-21 | 1 | hxtp://rapidshare.com/files/338398794/tomhas |
| 2010-01-21 | 2010-01-21 | 1 | hxtp://rapidshare.com/files/338403156/farhas |
| 2010-01-25 | 2010-01-25 | 1 | hxtp://rapidshare.com/files/340552045/tomd |
| 2010-01-27 | 2010-01-27 | 1 | hxtp://rapidshare.com/files/341701463/tsa |
| 2010-01-27 | 2010-01-27 | 1 | hxtp://rapidshare.com/files/341737994/xc |
| 2010-01-29 | 2010-01-30 | 2 | hxtp://rapidshare.com/files/342702954/dams |
but, the shellcode downloads the files directly.
As promised, I uploaded virustotal results for *every* file the paris db.
The packed sql data has 600k, to use:
bunzip paris-20091207-missionpack_avs.sql.bz2
sqlite3 logsql.sqlite < paris-20091207-missionpack_avs.sql
I can recommend sqliteman to for playing with the database.
I hacked a script to retrieve the virustotal results for the files mentioned in the paris database, and store the results in the paris database so I could query them.
Unfortunately dionaea does not submit to virustotal.com (yet), therefore there are signatures missing for 'some' (75%) files.
Afterwards I designed a queries to retrieve some stats about different things.
As I was interested in the share of Conficker attacks, I decided to retrieve some numbers from the paris database.
As I don't know which files count as Conficker, I had to rely on av vendor signatures.
Andrew Waite downloaded the sqlite datasets and blogged about his results running his mimic-nepstats.py script, as I was surprised about the time it took for the paris dataset, I had to investigate.
For me, the paris dataset took more than 30minutes, and I even rewrote some of the queries to make it faster, but he said it was done in about 3minutes.
So, I gave it a shot, and he was right, it was even faster then the 3 minutes he claimed, I could to it in about ~2minutes.
The only difference I could figure out, my initial test did not use the anonymized database.
I gave it a shot, and the not-anonymized database was rather sloppy compared to the anonymized db.
The steps to create the anonymized db involved dumping the original db and restoring the dump to a new database.
Microsoft Malware Protection Center recently had a news about Do and don’ts for p@$$w0rd$, but they just released some statistics about the data gathered. Thats common, raw data is dangerous for the decoys, nobody wants to reveal his honeypots address, and raw data is pretty large.
But as current technology allows data compression, and we are confident our anonymization allows protecting decoy and attackers, we decided to release raw data.
We offer two sqlite databases 1),
Please let us know, if you post/blog about it, so we can link it here.
A simple mail to nepenthesdev@gmail.com, or the still virgin #dionaea hashtag on twitter will do the trick.
Nepenthes had awful logging, huge logfiles, pretty useless for most people. Some people even started writing parsers for the logfiles to extract&convert the usefull information for use in a database.
For dionaea, I decided to stick with awful logging to textfiles, but provide a useful alternative which is easy to setup and maintain, feature rich and allows retrieving information in a useful way, so you don't have to grep.
Therefore, SQLite is used to write usefull information down to disk in the logsql.py script.
I know, SQLite is not PostgreSQL, PostgreSQL is superior in many ways, but it requires some more steps to setup, where SQLite just works out of the box.
SQLite does not support concurrency, but as dionaea does not access the database simulaneaously, there were no problems with database-concurrency.
On the other hand, if it works with SQLite, it will work with PostgreSQL too, all you'll have to do is adjust some things.
The definition of useful information is undefined, therefore I decided to go for things I want to see for now:
connections
exploits
malware offers
malware downloads