
25 July, 2011

The last blog post I wrote about my master thesis was on June 1st. The final blog post has been long overdue. To the (very few) readers interested in the technical details, I apologize for the long delay in writing about the last part.
That last blog post was about FP-Growth. This one is about FP-Stream. Whereas FP-Growth can analyze static data sets for patterns, FP-Stream is capable of finding patterns over data streams. FP-Stream relies on the FP-Growth for significant parts, but it’s considerably more advanced. So, in essence, this phase only adds the capability to mine over a stream of data. While that may sound like it is not much, the added complexity of achieving this turns it into a fairly large undertaking.

1 June, 2011

The previous blog post covering my master thesis was about the libraries I wrote for detecting browsers and locations: QBrowsCap and QGeoIP.
On the very day that was published, I reached the first implementation milestone, which implied that it was already finding causes of slow page loads, but not over exactly specified periods of time, but rather over each chunk of 4,000 lines that was read from an Episodes log file. To achieve this, an implementation of the FP-Growth algorithm was completed, which was then modified to add support for item constraints.

FP-Growth {#FP-Growth}

Thoroughly explaining the FP-Growth algorithm would lead us too far. Hence, I’ll include a brief explanation below. For details, I refer to the original paper, ā€œMining frequent patterns without candidate generationā€ by J. Han, J. Pei, Y. Yin and R. Mao which can easily be downloaded when searched for through Google Scholar.

1 March, 2011

In December and January, I’ve continued working on my master thesis, while simultaneously preparing for my exams in January (which I passed without problems).
In a previous blog post, I had indicated that I ran into problems while parsing dates: Qt uses the system locale for this, but on Mac OS X there turned out to be a severe performance problem with that functionality. I solved that by developing QCachingLocale, which is a class that introduces a caching layer to prevent said performance degradations.

Further parsing {#further-parsing}

Now, parsing the date was of course only one tiny part of the problem: I also had to parse the episodes information embedded in each Episodes log file line (which is trivial), as well as map the IP address to a physical location and an ISP and map the user-agent string to a platform and actual browser.
Finally, we also want to map the episode duration to either duration:slow, duration:acceptable or duration:fast. This is called ā€˜discretization’: continuous values (in our case: durations) are mapped to discrete values.

21 November, 2010

QCachingLocale speeds up Qt’s slow QSystemLocale::query() calls by caching the answers. This seems to be particularly necessary on Mac OS X 10.6.

The other day I was working on my master thesis, on the parser that is going to parse Episodes log files. I had finished a rough version that parses all fields on an Episodes log line. Unfortunately, performance turned out to be extremely poor: 4.8 seconds for parsing 1000 lines.

After a bit of research, it became clear that it was the call to QDateTime::fromString() that was the cause of the performance issues. Unable to figure it out on my own ā€” I tried for an hour or so, I hopped onto the #qt IRC channel and I posted a simple test case that could reproduce the problem: