data visualisation

new files
Presenting data in a human compatible way is a problem, rumors say at this stage of evolution pictures work best.
Therefore some hints how to create graphs using the dionaea logsql sqlite database.

sqlite

As dionaea embeds sqlite, it is rather easy to extract the data.

the 'uniq files' query

We want to graph the number of new uniq files a day, I used the following query to make sqlite create the dataset which can be used to graph.

SELECT 
	strftime('%Y-%m-%d',connection_timestamp,'unixepoch','localtime') AS date,
	MAX(downloads.download) AS total,
	COUNT(downloads.download) AS today,
	(
		SELECT 
			COUNT(DISTINCT x.download_md5_hash) 
		FROM 
			downloads AS x 
		NATURAL JOIN 
			connections AS c 
		WHERE 
			strftime('%Y-%m-%d',c.connection_timestamp,'unixepoch','localtime') 
			<= strftime('%Y-%m-%d',connections.connection_timestamp,'unixepoch','localtime')
	) AS uniq,
	(
		SELECT 
			COUNT(*) 
		FROM
			(
				SELECT
					a.download_md5_hash
				FROM 
					downloads AS a
				JOIN 
					connections AS b ON(a.connection = b.connection)
				GROUP BY 
					a.download_md5_hash
				HAVING 
					strftime('%Y-%m-%d',MIN(b.connection_timestamp),'unixepoch','localtime') 
					= strftime('%Y-%m-%d',MAX(connections.connection_timestamp),'unixepoch','localtime')
			)
	)AS uniq_this_day,
	(
		SELECT 
			COUNT(*) 
		FROM
			(
				SELECT
					MIN(a.download) AS download
				FROM 
					downloads AS a
				JOIN 
					connections AS b ON(a.connection = b.connection)
				GROUP BY 
					a.download_md5_hash
				HAVING 
					strftime('%Y-%m-%d',MIN(b.connection_timestamp),'unixepoch','localtime') 
				= strftime('%Y-%m-%d',connections.connection_timestamp,'unixepoch','localtime')
			) AS newdownloads
			NATURAL JOIN downloads
		WHERE 
			download_url LIKE 'logxmpp://%'
	)AS uniq_this_day_via_logxmpp
FROM 
	downloads
NATURAL JOIN 
	connections
GROUP BY
	date
ORDER BY
	date DESC;

uniqfiles.sql

performance

The queries performance is horrible, for each day we run 3 subqueries, 2 of these subqueries have another subquery. The grouping by the generated date-string is suboptimal too, still the query takes about a second on a moderate sized database. Running the query on the paris and berlin databases took ~90s for berlin, and ~679s for paris. When working with really large databases, it might be usefull to create a temp table with the data which is required in the representation which is used.

create the dataset

sqlite3 /opt/dionaea/var/dionaea/logsql.sqlite
.timer on
.output uniqfiles.txt
.read uniqfiles.sql
CPU Time: user 1.850000 sys 0.000000

gnuplot

As the “new” files distribution is more or less random, the graphs look like a saw, which makes them hard to get, therefore I decided to smooth the graphs for these numbers, rendering the graph inexact, but giving a good overview of the tendency.

plot new uniq files

new files

set terminal png size 600,120 nocrop butt font "/usr/share/fonts/truetype/ttf-liberation/LiberationSans-Regular.ttf" 8
set output "newfiles.png"
set xdata time
set timefmt "%Y-%m-%d"
set format x "%b %d"
set ylabel "total"
set y2label "new"
set y2tics
set datafile separator "|"
plot 'uniqfiles.txt' using 1:4 title "uniq" with lines, "" using 1:6 smooth bezier title "new uniq" with lines axes x1y2, "" using 1:5 smooth bezier title "new xmpp" with  lines axes x1y2

plot total files

total files

set terminal png size 800,160 nocrop butt font "/usr/share/fonts/truetype/ttf-liberation/LiberationSans-Regular.ttf" 8
set output "totalfiles.png"
set xdata time
set timefmt "%Y-%m-%d"
set format x "%b %d"
set ylabel "total"
set y2label "downloads"
set y2tics
set datafile separator "|"
plot 'uniqfiles.txt' using 1:2 title "total" with lines, "" using 1:3 smooth bezier title "downloads" with lines axes x1y2

Comments

1

Nice work as usual :D

You always manage to beat me to the punch, had Daviz on my radar for a while for similar jobs but haven't managed to find the time to get anything workable yet.

Andrew Waite
2010/06/06 13:52


2010/06/06/data_visualisation.txt · Last modified: 2010/06/15 01:00 by common
chimeric.de = chi`s home Creative Commons License Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0