sqlite performance

Andrew Waite downloaded the sqlite datasets and blogged about his results running his mimic-nepstats.py script, as I was surprised about the time it took for the paris dataset, I had to investigate. For me, the paris dataset took more than 30minutes, and I even rewrote some of the queries to make it faster, but he said it was done in about 3minutes. So, I gave it a shot, and he was right, it was even faster then the 3 minutes he claimed, I could to it in about ~2minutes.

The only difference I could figure out, my initial test did not use the anonymized database. I gave it a shot, and the not-anonymized database was rather sloppy compared to the anonymized db. The steps to create the anonymized db involved dumping the original db and restoring the dump to a new database. So, I spent some cycles on comparing performance on the original database and a restored dump from the original database. As I've found ANALYZE while digging for reasons in the performance difference, I decided to test ANALYZE impact on sqlite performance.

the benchmark

I think everything but sqliteversion, amount of ram installed and disk speed is unimportant, therefore …

Hardware&Software:

  • hard
    • 8G RAM
    • quadcore intel
    • a single sata drive (~100MB/s read/write)
  • soft
    • ubuntu 9.10/x86_64
    • sqlite 3.6.16 (from apt)

So, I prepared 4 files:

  • o - the original file as gatherd from dionaea
  • oa - like o, with ANALYZE
  • d - dumped o, reimported into empty db
  • da - like a, with ANALYZE

and 3 tasks, two queries for each, one query written by Andrew Waite, and a version by myself, both queries are supposed to give the same results.

For the results, the uppermost row defines the database by name as mentioned previously, aw is Andrew Waite's query to solve the problem, mk is my version.

There are 3 results for each database, minimal value, maximal value and average. Each query was run 10 times, the maximal value usually indicates the time of the first query on the db, the following iterations are faster as the operating system caches the required parts of the database.

If the time taken is less than a second, the format is ss.ms, so you can see the milliseconds, if the time is more than a second, the format is MM:SS, so you can see the minutes and seconds.

In general I use the max time, as we can not rely on having the data cached already.

IMPORTANT all databases contain the same data, use the same database engine, the only difference is in providing ANALYZE informations and the database's past, the o database grew incremental, the d databases were created in a single transaction.

the first date

The first query taken was the query to retrieve the initial date of the first malware download.

This the the query Andrew took to retrieve the date:

SELECT 
	connections.connection_timestamp 
FROM 
	connections, 
	downloads 
WHERE 
	downloads.connection = connections.connection 
ORDER BY 
	connections.connection_timestamp 
LIMIT 1

I decided to join the downloads&connection table naturally, and retrieve the MIN value instead.

SELECT 
	MIN(connections.connection_timestamp) 
FROM 
	connections 
	NATURAL JOIN downloads

Andrews approach to retrieve the first value and limit by one is valid too, and given the ordering of the data even pretty smart.

results

o oa d da
firstdate min max avg min max avg min max avg min max avg
mk 00.00 00.01 00.00 00.00 00.01 00.00 00.00 00.01 00.00 00.00 00.01 00.00
aw 00.00 00.01 00.00 00.00 00.01 00.00 00.00 00.01 00.00 00.00 00.01 00.00

Basically, both querys scale fine on the 4G database, far less than a seconds for each query, and if we'd look at microseconds Andrews query would be faster.

unique ips of attackers causing a successful downloads

Retrieving the number of unique hosts attacking the sensor and causing a successful download is more complex.

This is Andrews query:

SELECT 
	connections.remote_host 
FROM 
	connections, downloads 
WHERE 
	downloads.connection = connections.connection 
GROUP BY 
	connections.remote_host

In fact, he does not query the number of hosts, he queries for all hosts, and counts them by iterating through the result set.

I decided to join naturally, and have the database count instead of iterating through the results set myself.

SELECT 
	COUNT(DISTINCT connections.remote_host) 
FROM 
	connections 
	NATURAL JOIN downloads

results

This results are really surprising:

o oa d da
uips min max avg min max avg min max avg min max avg
mk 00:13 00:52 00:19 00:02 00:47 00:09 00:13 00:19 00:14 00:02 00:08 00:03
aw 00:17 04:18 00:42 00:17 08:35 01:08 00:16 01:47 00:25 00:16 00:33 00:18

Andrews initial query on the o database takes 4:18 minutes, the same query takes 8:35 minutes on the oa database, so adding the ANALYZE information can even harm your performance. On the da database, the same query takes 33 seconds.

My own query scales better, on the orig database it takes 52 seconds, which is an improvement to of faktor 4, on the dump with ANALYZE it takes 8 seconds, an improvement of factor 30 compared to Andrews's query on the original dump.

unique downloads

Again, Andrews query,

SELECT 
	download_md5_hash 
FROM 
	downloads 
GROUP BY 
	download_md5_hash

Again, I decided to have the database count the number of values, instead of counting them myself:

SELECT 
	COUNT(DISTINCT download_md5_hash) 
FROM 
	downloads

results

o oa d da
udown min max avg min max avg min max avg min max avg
mk 00:01 00:50 00:05 00:01 00:48 00:06 00:01 00:01 00:01 00.994 00:01 00:01
aw 00.298 01:06 00:07 00.299 02:07 00:13 00.293 00:20 00:02 00.293 00:02 00.472

There is no really large performance difference within the queries, but there is a really large difference depending on the database used, on the o database the query takes about 60 seconds, where it takes about 1 second on the da database.

more findings

It even takes longer to dump the o database:

time sqlite3 o '.dump' | pv > /dev/null 
2.72GB 0:03:48 [12.2MB/s] [     <=>                       ]

real	3m48.544s
user	1m56.790s
sys	0m18.220s
time sqlite3 da '.dump' | pv > /dev/null 
2.72GB 0:02:01 [22.9MB/s] [     <=>                       ]

real	2m1.508s
user	1m48.570s
sys	0m11.000s

conclusion

As the da database was faster than o or oa in every case, it is a good idea to dump the database and recreate it from the dump, if you plan to query it. The da speedup is upto factor 60, so it is definitely worth the time. Creating da is faster then creating oa, and da is faster in any case, where oa performance may even be worse than o. o's size matched d's size, as well as oa's size matched da's size, so the problem's got to be data fragmentation, o got written step by step, data was added to each table, where d was written per table in a single transaction.

reproduce

I'm sorry you won't be able to reproduce this, as I can't offer the o database for download, but here are the steps and the script required to reproduce if you got an valid candiate for an o database yourself.

creating the dbs

copy o & oa
time cp ../paris/taint/logsql.sqlite.orig o
real	1m24.006s
user	0m0.040s
sys	0m11.420s

time cp ../paris/taint/logsql.sqlite.orig oa

real	1m37.954s
user	0m0.040s
sys	0m10.540s
create d by dumping & restoring o, da by copying d
time sqlite3 o '.dump' | pv | sqlite3 d
2.72GB 0:06:29 [7.15MB/s] [     <=>                       ]
real	12m20.656s
user	11m59.950s
sys	1m1.650s


time cp d da
real	1m24.255s
user	0m0.060s
sys	0m10.890s
ANALYZE oa and da
time sqlite3 da "ANALYZE"
real	2m28.563s
user	0m23.340s
sys	0m5.190s


time sqlite3 oa "ANALYZE"
real	34m25.555s
user	0m26.570s
sys	0m26.190s
final dbs
ls -al
-rw-r--r-- 1 common common 4336886784 2009-12-11 22:20 d
-rw-r--r-- 1 common common 4336889856 2009-12-11 22:30 da
-rw-r--r-- 1 common common 4336886784 2009-12-11 22:05 o
-rw-r--r-- 1 common common 4336889856 2009-12-11 23:04 oa

the script

Maybe somebody wants to run similar tests, therefore here is the 'script'. I generates some tables in dokuwiki syntax at the end, adjust to fit your needs.

#!/opt/dionaea/bin/python3
import sqlite3
import sys
from functools import reduce
from datetime import datetime, timedelta
import random
 
queries = {
	'uniquedownloads': { 
		'aw' : { 'query': 'SELECT download_md5_hash FROM downloads GROUP BY download_md5_hash'},
		'mk' : { 'query': 'SELECT COUNT(DISTINCT download_md5_hash) FROM downloads'}
	},
	'uniqueips': { 
		'aw' : { 'query': 'SELECT connections.remote_host FROM connections, downloads WHERE downloads.connection = connections.connection GROUP BY connections.remote_host'},
		'mk' : { 'query': 'SELECT COUNT(DISTINCT connections.remote_host) FROM connections NATURAL JOIN downloads'}
	},
	'firstdate': { 
		'aw' : { 'query': 'SELECT connections.connection_timestamp FROM connections, downloads WHERE downloads.connection = connections.connection ORDER BY connections.connection_timestamp LIMIT 1'},
		'mk' : { 'query': 'SELECT MIN(connections.connection_timestamp) FROM connections NATURAL JOIN downloads'}
	}
}
 
print(queries)
 
 
dbs = sys.argv[1:]
 
rdbs = list(dbs) # copy 
random.shuffle(rdbs)
 
print(rdbs)
print(dbs)
#exit(0)
for name in queries:
	for gen in queries[name]:
		for db in rdbs:
			conn = sqlite3.connect(db)
			c = conn.cursor()
			for i in range(0,10):
				query = queries[name][gen]['query']
				start = datetime.today()
				print(query)
				c.execute(query)
				j=0
				for row in c:
					j = j+1
				print(j)
				stop = datetime.today()
				diff = stop - start
				print(diff)
				if 'results' not in queries[name][gen]:
					queries[name][gen]['results'] = dict()
				if db not in queries[name][gen]['results']:
					queries[name][gen]['results'][db] = dict()
				if 'all' not in queries[name][gen]['results'][db]:
					queries[name][gen]['results'][db]['all'] = list()
				# first run is not accounted, lets cache the database before taking numbers
				print("i is %i" % (i,))
#				if i >= 1:
				queries[name][gen]['results'][db]['all'].append(diff)
print(queries)
 
for name in queries:
	for gen in queries[name]:
		for db in dbs:
			print( "name %s gen %s db %s" % (name, gen, db))
			queries[name][gen]['results'][db]['max'] = max(queries[name][gen]['results'][db]['all'])
			queries[name][gen]['results'][db]['min'] = min(queries[name][gen]['results'][db]['all'])
			queries[name][gen]['results'][db]['avg'] = reduce(lambda x, y: x+y, queries[name][gen]['results'][db]['all'], timedelta()) // len(queries[name][gen]['results'][db]['all'])
 
# print db names
print("| ", end='')
for db in dbs:
	print("|  %s  ||" % (db,), end='')
print("|")
 
# min max avg
for name in queries:
	print("| " , end='')
	for gen in queries[name]:
		for db in dbs:
			for t in queries[name][gen]['results'][db]:
				if t == 'all':
					continue
				print("|  %s  " % (t,) , end='')
		print("|")
		break
	break
 
 
# names
#for name in queries:
#	print("^ %s | " % (name,), end='')
#	for gen in queries[name]:
#		for db in queries[name][gen]['results']:
#			print("||| ", end='')
#		print("")
#		break
 
def pd(d):
	x = d.seconds*1000000 + d.microseconds
	hours, remainder = divmod(x, 3600 * 1000000)
	minutes, remainder = divmod(remainder, 60 * 1000000)
	seconds, remainder = divmod(remainder, 1000000)
	milliseconds, remainder = divmod(remainder, 1000)
 
	if seconds > 0:
		return '%02i:%02i' % (minutes, seconds)
	return '%02i.%02i' % (seconds, milliseconds)	
 
 
# duration_formatted = '%s:%s:%s' % (hours, minutes, seconds)
 
 
for name in queries:
 
# query name
	print("^ %s | " % (name,), end='')
	for gen in queries[name]:
		for db in dbs:
			print("||| ", end='')
		print("")
		break
# results
	for gen in queries[name]:
		print("|| %s " % (gen,), end='')
		for db in dbs:
#			print("| %9.1f | %9.1f | %9.1f " % (
#									round((queries[name][gen]['results'][db]['min'].seconds*1000000 + queries[name][gen]['results'][db]['min'].microseconds)/1000, 1),
#									round((queries[name][gen]['results'][db]['max'].seconds*1000000 + queries[name][gen]['results'][db]['max'].microseconds)/1000, 1),
#									round((queries[name][gen]['results'][db]['avg'].seconds*1000000 + queries[name][gen]['results'][db]['avg'].microseconds)/1000, 1)
#									),
#					end=''
#				)
			print("| %s | %s | %s " % (
									pd(queries[name][gen]['results'][db]['min']),
									pd(queries[name][gen]['results'][db]['max']),
									pd(queries[name][gen]['results'][db]['avg'])
									),
					end=''
				)
 
		print('|')

Comments



2009/12/12/sqlite_performance.txt · Last modified: 2010/06/15 11:50 by common
chimeric.de = chi`s home Creative Commons License Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0