Andrew Waite downloaded the sqlite datasets and blogged about his results running his mimic-nepstats.py script, as I was surprised about the time it took for the paris dataset, I had to investigate.
For me, the paris dataset took more than 30minutes, and I even rewrote some of the queries to make it faster, but he said it was done in about 3minutes.
So, I gave it a shot, and he was right, it was even faster then the 3 minutes he claimed, I could to it in about ~2minutes.
The only difference I could figure out, my initial test did not use the anonymized database.
I gave it a shot, and the not-anonymized database was rather sloppy compared to the anonymized db.
The steps to create the anonymized db involved dumping the original db and restoring the dump to a new database.
So, I spent some cycles on comparing performance on the original database and a restored dump from the original database.
As I've found ANALYZE while digging for reasons in the performance difference, I decided to test ANALYZE impact on sqlite performance.
I think everything but sqliteversion, amount of ram installed and disk speed is unimportant, therefore …
Hardware&Software:
hard
soft
ubuntu 9.10/x86_64
sqlite 3.6.16 (from apt)
So, I prepared 4 files:
o - the original file as gatherd from dionaea
oa - like o, with ANALYZE
d - dumped o, reimported into empty db
da - like a, with ANALYZE
and 3 tasks, two queries for each, one query written by Andrew Waite, and a version by myself, both queries are supposed to give the same results.
For the results, the uppermost row defines the database by name as mentioned previously, aw is Andrew Waite's query to solve the problem, mk is my version.
There are 3 results for each database, minimal value, maximal value and average.
Each query was run 10 times, the maximal value usually indicates the time of the first query on the db, the following iterations are faster as the operating system caches the required parts of the database.
If the time taken is less than a second, the format is ss.ms, so you can see the milliseconds, if the time is more than a second, the format is MM:SS, so you can see the minutes and seconds.
In general I use the max time, as we can not rely on having the data cached already.
IMPORTANT all databases contain the same data, use the same database engine, the only difference is in providing ANALYZE informations and the database's past, the o database grew incremental, the d databases were created in a single transaction.
The first query taken was the query to retrieve the initial date of the first malware download.
This the the query Andrew took to retrieve the date:
SELECT
connections.connection_timestamp
FROM
connections,
downloads
WHERE
downloads.connection = connections.connection
ORDER BY
connections.connection_timestamp
LIMIT 1
I decided to join the downloads&connection table naturally, and retrieve the MIN value instead.
SELECT
MIN(connections.connection_timestamp)
FROM
connections
NATURAL JOIN downloads
Andrews approach to retrieve the first value and limit by one is valid too, and given the ordering of the data even pretty smart.
| | o | oa | d | da | |
| firstdate | min | max | avg | min | max | avg | min | max | avg | min | max | avg | |
| mk | 00.00 | 00.01 | 00.00 | 00.00 | 00.01 | 00.00 | 00.00 | 00.01 | 00.00 | 00.00 | 00.01 | 00.00 |
| aw | 00.00 | 00.01 | 00.00 | 00.00 | 00.01 | 00.00 | 00.00 | 00.01 | 00.00 | 00.00 | 00.01 | 00.00 |
Basically, both querys scale fine on the 4G database, far less than a seconds for each query, and if we'd look at microseconds Andrews query would be faster.
Retrieving the number of unique hosts attacking the sensor and causing a successful download is more complex.
This is Andrews query:
SELECT
connections.remote_host
FROM
connections, downloads
WHERE
downloads.connection = connections.connection
GROUP BY
connections.remote_host
In fact, he does not query the number of hosts, he queries for all hosts, and counts them by iterating through the result set.
I decided to join naturally, and have the database count instead of iterating through the results set myself.
SELECT
COUNT(DISTINCT connections.remote_host)
FROM
connections
NATURAL JOIN downloads
This results are really surprising:
| | o | oa | d | da | |
| uips | min | max | avg | min | max | avg | min | max | avg | min | max | avg | |
| mk | 00:13 | 00:52 | 00:19 | 00:02 | 00:47 | 00:09 | 00:13 | 00:19 | 00:14 | 00:02 | 00:08 | 00:03 |
| aw | 00:17 | 04:18 | 00:42 | 00:17 | 08:35 | 01:08 | 00:16 | 01:47 | 00:25 | 00:16 | 00:33 | 00:18 |
Andrews initial query on the o database takes 4:18 minutes, the same query takes 8:35 minutes on the oa database, so adding the ANALYZE information can even harm your performance.
On the da database, the same query takes 33 seconds.
My own query scales better, on the orig database it takes 52 seconds, which is an improvement to of faktor 4, on the dump with ANALYZE it takes 8 seconds, an improvement of factor 30 compared to Andrews's query on the original dump.
Again, Andrews query,
SELECT
download_md5_hash
FROM
downloads
GROUP BY
download_md5_hash
Again, I decided to have the database count the number of values, instead of counting them myself:
SELECT
COUNT(DISTINCT download_md5_hash)
FROM
downloads
| | o | oa | d | da | |
| udown | min | max | avg | min | max | avg | min | max | avg | min | max | avg | |
| mk | 00:01 | 00:50 | 00:05 | 00:01 | 00:48 | 00:06 | 00:01 | 00:01 | 00:01 | 00.994 | 00:01 | 00:01 |
| aw | 00.298 | 01:06 | 00:07 | 00.299 | 02:07 | 00:13 | 00.293 | 00:20 | 00:02 | 00.293 | 00:02 | 00.472 |
There is no really large performance difference within the queries, but there is a really large difference depending on the database used, on the o database the query takes about 60 seconds, where it takes about 1 second on the da database.
It even takes longer to dump the o database:
time sqlite3 o '.dump' | pv > /dev/null
2.72GB 0:03:48 [12.2MB/s] [ <=> ]
real 3m48.544s
user 1m56.790s
sys 0m18.220s
time sqlite3 da '.dump' | pv > /dev/null
2.72GB 0:02:01 [22.9MB/s] [ <=> ]
real 2m1.508s
user 1m48.570s
sys 0m11.000s
As the da database was faster than o or oa in every case, it is a good idea to dump the database and recreate it from the dump, if you plan to query it.
The da speedup is upto factor 60, so it is definitely worth the time.
Creating da is faster then creating oa, and da is faster in any case, where oa performance may even be worse than o.
o's size matched d's size, as well as oa's size matched da's size, so the problem's got to be data fragmentation, o got written step by step, data was added to each table, where d was written per table in a single transaction.
I'm sorry you won't be able to reproduce this, as I can't offer the o database for download, but here are the steps and the script required to reproduce if you got an valid candiate for an o database yourself.
time cp ../paris/taint/logsql.sqlite.orig o
real 1m24.006s
user 0m0.040s
sys 0m11.420s
time cp ../paris/taint/logsql.sqlite.orig oa
real 1m37.954s
user 0m0.040s
sys 0m10.540s
time sqlite3 o '.dump' | pv | sqlite3 d
2.72GB 0:06:29 [7.15MB/s] [ <=> ]
real 12m20.656s
user 11m59.950s
sys 1m1.650s
time cp d da
real 1m24.255s
user 0m0.060s
sys 0m10.890s
time sqlite3 da "ANALYZE"
real 2m28.563s
user 0m23.340s
sys 0m5.190s
time sqlite3 oa "ANALYZE"
real 34m25.555s
user 0m26.570s
sys 0m26.190s
ls -al
-rw-r--r-- 1 common common 4336886784 2009-12-11 22:20 d
-rw-r--r-- 1 common common 4336889856 2009-12-11 22:30 da
-rw-r--r-- 1 common common 4336886784 2009-12-11 22:05 o
-rw-r--r-- 1 common common 4336889856 2009-12-11 23:04 oa
Maybe somebody wants to run similar tests, therefore here is the 'script'.
I generates some tables in dokuwiki syntax at the end, adjust to fit your needs.
#!/opt/dionaea/bin/python3
import sqlite3
import sys
from functools import reduce
from datetime import datetime, timedelta
import random
queries = {
'uniquedownloads': {
'aw' : { 'query': 'SELECT download_md5_hash FROM downloads GROUP BY download_md5_hash'},
'mk' : { 'query': 'SELECT COUNT(DISTINCT download_md5_hash) FROM downloads'}
},
'uniqueips': {
'aw' : { 'query': 'SELECT connections.remote_host FROM connections, downloads WHERE downloads.connection = connections.connection GROUP BY connections.remote_host'},
'mk' : { 'query': 'SELECT COUNT(DISTINCT connections.remote_host) FROM connections NATURAL JOIN downloads'}
},
'firstdate': {
'aw' : { 'query': 'SELECT connections.connection_timestamp FROM connections, downloads WHERE downloads.connection = connections.connection ORDER BY connections.connection_timestamp LIMIT 1'},
'mk' : { 'query': 'SELECT MIN(connections.connection_timestamp) FROM connections NATURAL JOIN downloads'}
}
}
print(queries)
dbs = sys.argv[1:]
rdbs = list(dbs) # copy
random.shuffle(rdbs)
print(rdbs)
print(dbs)
#exit(0)
for name in queries:
for gen in queries[name]:
for db in rdbs:
conn = sqlite3.connect(db)
c = conn.cursor()
for i in range(0,10):
query = queries[name][gen]['query']
start = datetime.today()
print(query)
c.execute(query)
j=0
for row in c:
j = j+1
print(j)
stop = datetime.today()
diff = stop - start
print(diff)
if 'results' not in queries[name][gen]:
queries[name][gen]['results'] = dict()
if db not in queries[name][gen]['results']:
queries[name][gen]['results'][db] = dict()
if 'all' not in queries[name][gen]['results'][db]:
queries[name][gen]['results'][db]['all'] = list()
# first run is not accounted, lets cache the database before taking numbers
print("i is %i" % (i,))
# if i >= 1:
queries[name][gen]['results'][db]['all'].append(diff)
print(queries)
for name in queries:
for gen in queries[name]:
for db in dbs:
print( "name %s gen %s db %s" % (name, gen, db))
queries[name][gen]['results'][db]['max'] = max(queries[name][gen]['results'][db]['all'])
queries[name][gen]['results'][db]['min'] = min(queries[name][gen]['results'][db]['all'])
queries[name][gen]['results'][db]['avg'] = reduce(lambda x, y: x+y, queries[name][gen]['results'][db]['all'], timedelta()) // len(queries[name][gen]['results'][db]['all'])
# print db names
print("| ", end='')
for db in dbs:
print("| %s ||" % (db,), end='')
print("|")
# min max avg
for name in queries:
print("| " , end='')
for gen in queries[name]:
for db in dbs:
for t in queries[name][gen]['results'][db]:
if t == 'all':
continue
print("| %s " % (t,) , end='')
print("|")
break
break
# names
#for name in queries:
# print("^ %s | " % (name,), end='')
# for gen in queries[name]:
# for db in queries[name][gen]['results']:
# print("||| ", end='')
# print("")
# break
def pd(d):
x = d.seconds*1000000 + d.microseconds
hours, remainder = divmod(x, 3600 * 1000000)
minutes, remainder = divmod(remainder, 60 * 1000000)
seconds, remainder = divmod(remainder, 1000000)
milliseconds, remainder = divmod(remainder, 1000)
if seconds > 0:
return '%02i:%02i' % (minutes, seconds)
return '%02i.%02i' % (seconds, milliseconds)
# duration_formatted = '%s:%s:%s' % (hours, minutes, seconds)
for name in queries:
# query name
print("^ %s | " % (name,), end='')
for gen in queries[name]:
for db in dbs:
print("||| ", end='')
print("")
break
# results
for gen in queries[name]:
print("|| %s " % (gen,), end='')
for db in dbs:
# print("| %9.1f | %9.1f | %9.1f " % (
# round((queries[name][gen]['results'][db]['min'].seconds*1000000 + queries[name][gen]['results'][db]['min'].microseconds)/1000, 1),
# round((queries[name][gen]['results'][db]['max'].seconds*1000000 + queries[name][gen]['results'][db]['max'].microseconds)/1000, 1),
# round((queries[name][gen]['results'][db]['avg'].seconds*1000000 + queries[name][gen]['results'][db]['avg'].microseconds)/1000, 1)
# ),
# end=''
# )
print("| %s | %s | %s " % (
pd(queries[name][gen]['results'][db]['min']),
pd(queries[name][gen]['results'][db]['max']),
pd(queries[name][gen]['results'][db]['avg'])
),
end=''
)
print('|')