In December and January, I’ve continued working on my master thesis, while simultaneously preparing for my exams in January (which I passed without problems).
In a previous blog post, I had indicated that I ran into problems while parsing dates: Qt uses the system locale for this, but on Mac OS X there turned out to be a severe performance problem with that functionality. I solved that by developing QCachingLocale
, which is a class that introduces a caching layer to prevent said performance degradations.
Further parsing {#further-parsing}
Now, parsing the date was of course only one tiny part of the problem: I also had to parse the episodes information embedded in each Episodes log file line (which is trivial), as well as map the IP address to a physical location and an ISP and map the user-agent string to a platform and actual browser.
Finally, we also want to map the episode duration to either duration:slow
, duration:acceptable
or duration:fast
. This is called ādiscretization’: continuous values (in our case: durations) are mapped to discrete values.
QBrowsCap: user-agent string → platform + browser {#QBrowsCap}
After spending a very long time looking for a C++ library to map user-agent strings to their corresponding browser name and version (and platform), I had to give up. No such library existed.
Because it is impossible to write a single, standardized routine that parses this information from the user-agent string, I had to rely on BrowsCap, the Browsers Capabilities project. This is the same dataset the PHP language relies on to identify browsers.
I have developed a C++ library that relies on BrowsCap to do just that ā I baptized it āQBrowsCapā because it is specifically optimized to be used with Qt-based projects.
QBrowsCap makes it easy to download this dataset, and keep it up-to-date. It parses the browscap.csv
file and stores it in a SQLite database, which allows for faster mapping of user-agent strings (BrowsCap relies on āglobbing’, and SQLite has built-in support for this). To maximize performance, it uses an in-memory hash table. QBrowsCap has also been made thread-safe, to allow for concurrent (i.e. by using a MapReduce-approach, implemented in C++/Qt with Qt’s QtConcurrent
) user-agent details lookup by multiple threads (therefor allowing greater user-agent details lookup speeds).
Sample result
The user-agent string Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
is mapped to
("ua:WinXP",
"ua:WinXP:IE",
"ua:WinXP:IE:6",
"ua:WinXP:IE:6:0")
QGeoIP: IP address → location + ISP {#QGeoIP}
Unfortunately, no library was available for Qt to map IP addresses to physical locations either. Fortunately, I did find a C library which I made easier to use by wrapping it in a Qt-friendly manner ā I called the end result āQGeoIPā. QGeoIP also simplifies the build process of the C library by using Qt’s build system instead of an arcane Makefile. QGeoIP uses MaxMindās libGeoIP
.
I encountered one major problem with using libGeoIP
though: GeoIP_delete()
does not actually free the memory consumed by libGeoIP
ās in-memory cache. I tried debugging this, but I can’t get it to work. Maybe it’s an OS X-specific issue? I did not try it on Linux. Likely related to these memory release issues, there is the problem that it seems to be impossible to make QGeoIP
work in a thread-safe manner, thus unfortunately not allowing for concurrent IP to location + ISP mapping by using multiple threads.
Sample result
The IP address 218.56.155.59
is mapped to
("location:AS",
"location:AS:China",
"location:AS:China:Shandong",
"location:isp:China:AS4837 CNCGROUP China169 Backbone")
EpisodesDurationDiscretizer: continuous episode durations → discrete speeds {#EpisodesDurationDiscretizer}
The Episodes timing information contains <episode name>:<episode duration>
pairs. It’s far more difficult to perform association rule mining to continuous data than to discrete data; therefor we discretize the continous data in these pairs: the episode durations. As explained in the introduction, we want to discretize these continuous episode durations to discrete speeds: duration:slow
, duration:acceptable
or duration:fast
.
To do this, I wrote the EpisodesDurationDiscretizer
class (.h
/.cpp
) class, which accepts a .csv file that defines the mappings. Such a .csv file looks like this:
domready,fast,150,acceptable,1000,slow
frontend,fast,100,acceptable,1500,slow
headerjs,fast,100,acceptable,1000,slow
footerjs,fast,100,acceptable,1000,slow
css,fast,100,acceptable,500,slow
DrupalBehaviors,fast,100,acceptable,200,slow
tabs,fast,10,acceptable,20,slow
ToThePointShowHideChangelog,fast,10,acceptable,20,slow
As you probably derived yourself, the first column contains the episode name, the second column contains the āspeed nameā for the fastest discretization, which goes from 0 ms to the value in the third column. As many discretizations as desired can be defined. In our case, there are three discretization levels for each episode. For example, these are the three discretization levels for the domready
episode durations:
- āfastā: 0ā150 ms
- āacceptableā: 151ā1000 ms
- āslowā: 1001āā ms
Sample result
The Episodes timing information
css:203,headerjs:94,footerjs:500,domready:843,tabs:110,ToThePointShowHideChangelog:15,DrupalBehaviors:141,frontend:1547
is mapped to
(("episode:css", "duration:acceptable"),
("episode:headerjs", "duration:fast"),
("episode:footerjs", "duration:acceptable"),
("episode:domready", "duration:acceptable"),
("episode:tabs", "duration:slow"),
("episode:ToThePointShowHideChangelog", "duration:acceptable"),
("episode:DrupalBehaviors", "duration:acceptable"),
("episode:frontend", "duration:slow"))
End result {#end-result}
Now that we can map meaningless strings and numbers to meaningful items, we can apply association rule mining. But more on that in a future blog post.
We will end here with looking at the result of a single line as it gets parsed and processed by my master thesisā code.
We begin with:
"218.56.155.59 [Sunday, 14-Nov-2010 06:27:03 +0100] "?ets=css:203,headerjs:94,footerjs:500,domready:843,tabs:110,ToThePointShowHideChangelog:15,DrupalBehaviors:141,frontend:1547" 200 "http://driverpacks.net/driverpacks/windows/xp/x86/chipset/10.09" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)" "driverpacks.net"
This gets parsed and processed into:
("episode:css", "duration:acceptable", "url:http://driverpacks.net/driverpacks/windows/xp/x86/chipset/10.09", "location:AS", "location:AS:China", "location:AS:China:Shandong", "location:isp:China:AS4837 CNCGROUP China169 Backbone", "ua:WinXP", "ua:WinXP:IE", "ua:WinXP:IE:6", "ua:WinXP:IE:6:0")
("episode:headerjs", "duration:fast", "url:http://driverpacks.net/driverpacks/windows/xp/x86/chipset/10.09", "location:AS", "location:AS:China", "location:AS:China:Shandong", "location:isp:China:AS4837 CNCGROUP China169 Backbone", "ua:WinXP", "ua:WinXP:IE", "ua:WinXP:IE:6", "ua:WinXP:IE:6:0")
("episode:footerjs", "duration:acceptable", "url:http://driverpacks.net/driverpacks/windows/xp/x86/chipset/10.09", "location:AS", "location:AS:China", "location:AS:China:Shandong", "location:isp:China:AS4837 CNCGROUP China169 Backbone", "ua:WinXP", "ua:WinXP:IE", "ua:WinXP:IE:6", "ua:WinXP:IE:6:0")
("episode:domready", "duration:acceptable", "url:http://driverpacks.net/driverpacks/windows/xp/x86/chipset/10.09", "location:AS", "location:AS:China", "location:AS:China:Shandong", "location:isp:China:AS4837 CNCGROUP China169 Backbone", "ua:WinXP", "ua:WinXP:IE", "ua:WinXP:IE:6", "ua:WinXP:IE:6:0")
("episode:tabs", "duration:slow", "url:http://driverpacks.net/driverpacks/windows/xp/x86/chipset/10.09", "location:AS", "location:AS:China", "location:AS:China:Shandong", "location:isp:China:AS4837 CNCGROUP China169 Backbone", "ua:WinXP", "ua:WinXP:IE", "ua:WinXP:IE:6", "ua:WinXP:IE:6:0")
("episode:ToThePointShowHideChangelog", "duration:acceptable", "url:http://driverpacks.net/driverpacks/windows/xp/x86/chipset/10.09", "location:AS", "location:AS:China", "location:AS:China:Shandong", "location:isp:China:AS4837 CNCGROUP China169 Backbone", "ua:WinXP", "ua:WinXP:IE", "ua:WinXP:IE:6", "ua:WinXP:IE:6:0")
("episode:DrupalBehaviors", "duration:acceptable", "url:http://driverpacks.net/driverpacks/windows/xp/x86/chipset/10.09", "location:AS", "location:AS:China", "location:AS:China:Shandong", "location:isp:China:AS4837 CNCGROUP China169 Backbone", "ua:WinXP", "ua:WinXP:IE", "ua:WinXP:IE:6", "ua:WinXP:IE:6:0")
("episode:frontend", "duration:slow", "url:http://driverpacks.net/driverpacks/windows/xp/x86/chipset/10.09", "location:AS", "location:AS:China", "location:AS:China:Shandong", "location:isp:China:AS4837 CNCGROUP China169 Backbone", "ua:WinXP", "ua:WinXP:IE", "ua:WinXP:IE:6", "ua:WinXP:IE:6:0")
As you can see, this single Episodes log file line results in eight transactions. The careful reader will have noticed this matches the number of episodes in the original Episodes log file line. More specifically, each episode gets its own transaction, along with its corresponding discretized speed and all request metadata (URL, location, ISP, platform, browser). (Note that this is a simple example; in the actual implementation, the HTTP status code is also included if it’s not a 200 status code and a ua:isMobile
item is included in the transaction if it’s a mobile user agent.)
This is because we want to find associations for specific episodesā speeds. Hence we need a transaction for each episode with its speed, plus all possible environmental factors that can cause this particular speed. On these resulting transactions, we can then apply association rule mining.
Conclusion {#conclusion}
Both QBrowsCap
and QGeoIP
are unlicensed (as is the master thesis in its whole), so feel free to use them in your applications and contribute back! They also include unit tests and tiny sample applications that can easily be built with Qt Creator.
When integrated with my master thesisā EpisodesParser
, I’m able to achieve over 4,000 parsed & processed lines per second on my 2.66 GHz Core 2 Duo machine, resulting in Ā±40,000 transactions. Not bad! :)
ua parsing
I believe the best UA parsing code in the world is in Browserscope:
http://www.browserscope.org/ua
I regularly compare this code to other parsers and for a simple UA String to Browser Name parser it outperforms everything else.
BrowserScope's code base is too big
BrowserScopeās code base is too big to be manageable by a single person without prior deep knowledge (thatās me) and especially to port it all to C++. If it were sufficiently documented to make it clear which parts change relatively frequently (because they depend on new user agents appearing) and which parts donāt, then it might have been doable to port it. But in its current state, Iām afraid itās just too hacky. (And calling Python code from within my C++ code ā¦ that would just have been too ugly.)
I did consider it, but evaluated the option as being too unwieldy, convoluted, non-futureproof (BrowsCap updates are easy, BrowserScopeās UA parsing updates have to be done by hand).
When BrowserScopeās UA parsing becomes more feasible, itāll be very easy to just replace
QBrowsCap
withQBrowserScopeUA
! :)ua parsing
We are using the boomerang library from Yahoo in our webpages to measure the page load times. (currently running on our staging servers). Our apache logs get collected (near real time) using Flume agents and get stored in an OpenTSDB instance.
We have also written our own framework to parse the UA, GeoIP and ISP info from the log lines.
Looking forward to your datamining work!
I did not know Flume & OpenTSDB
Thank you so much for your excellent comment, Jurgen!
I had never heard of Flume nor OpenTSDB. Both are very interesting.
If I understand it correctly after my quick skim, Flume uses Apache Hadoop to parallelize the workload. In fact, Iām doing something like that: Iām using QtConcurrent to apply a MapReduce approach within my own codebase (i.e. without external dependencies ā I updated the blog post to reflect this), but without the āReduceā step: no reduction is necessary, only regular processing. I had thought about using a full-fledged Apache Hadoop-setup, but considered it too much work to add yet another dependency (and learn to install, configure and work with it). It appears like Flume requires less set-up, so it may be a viable option in the future.
OpenTSDB appears to be insanely useful and unbelievable powerful. But again, itād be yet another dependency, which in itself apparently has boatloads of dependencies ā it requires a fairly elaborate Java environment. And I personally severely dislike working with Java: Iāve only had bad experiences with it over the past 5 years. If Iād integrate with OpenTSDB, it seems the scope of my master thesis implementation would grow too unwieldy.
Nevertheless, had I known about these right from the beginning, I might have built upon them. Right now, itās too late to start changing it all.
Is your framework to parse the UA, GeoIP and ISP open source? Did you write it yourself, or did you fork it from another open source project? And possibly most importantly: what are you using it for? For the next site of the K.U. Leuven, maybe?
Looking forward to further feedback from you! :)
RE: I did not know Flume & OpenTSDB
P.S.: Have a look at the following framework for mining your log data: http://mahout.apache.org/ ;-)
Thanks for the intros! And thanks for Apache Mahoutā¦
Thanks for the additional links with introductory videos :) And once more, you stun me with a piece of software I did not know about: Iād never heard about Apache Mahout beforeā¦ Nor did I find it in my search 1.5 year ago for frameworks that I could leverage.
You can find my current association rule mining code here on GitHub.
Consider this a bug report
Consider this a bug report since Iām too lazy to sign up at GitHub.
QGeoIP.cpp line 76 calls
GeoIP_time_zone_by_country_and_region
usingr->country_code
andr->region
before checking ifr
is null on line 78.In addition, line 120 attempts to open the databases using the
GEOIP_MMAP_CACHE
flag, which is unavailable on Windows. This results in a segfault at the first lookup attempt, so a different flag or OS checking may be wise.Your bug report is much
Your bug report is much appreciated!
However, as you can tell by looking at the code on GitHub, it’s been two years since I wrote that code. I’d be happy to apply a patch or merge your fork, but I’m not going to fix this myself.
I did open an issue on GitHub with this exact information though: https://github.com/wimleers/QGeoIP/issues/1 ā hopefully that will help or even entice future users of the code :)