School year: first master year, first semester.
For the “Project Databases” course, we are assigned one big/complex project to work on, for those majoring in databases. In the academic year 2009–2010, the goal was to build a tool to analyze which habitats should be connected to maximize the survival of the organisms in the habitats. This project was developed in collaboration with a related Belgian biology research institution, as part of an ecology investigation.
More specifically, we would have to develop software capable of doing single-source shortest path analysis (and multiple-source shortest path) to track how squirrels could most safely travel from one habitat to another. Support for multiple cost-distance models was a requirement. We would be working with real-world input data.
Four groups of four people each independently developed the same type of software to help analyze the research data. The basic requirements were the same for all groups, but each group could add additional features. The best (most likely the best performing software) would then be used in the actual research.
Our input data consisted of ESRI ASCII grid files, one of which was a real-world sample (weighing in at 68.4 MB), containing a grid (graph) of ±14.5 million points (nodes).
Our implementation:
- Python, to be able to prototype faster
- Unit tests, to guarantee correctness of the parser and the SSSP (Single-Source Shortest Path) implementation
- C(++) for performance-critical parts (the heavy calculations, but also parsing and drawing)
- Single Source Shortest Path: <10 seconds for 16 million nodes on commodity hardware 1
- Point-to-point, but also point-to-area/area-to-point 2
- Cross-platform
Our implementation was tied for speed when using a single source, but blew the others out of the water for multiple sources. We also had useful features none of the others could match, most notably the ability to work with areas instead of points. Other team members: Arno Barzan, Dominique Fonteyn and Sam Vanderstraeten.
A C++ implementation of Dijkstra’s algorithm, using a binary heap. Several other optimizations were used as well. 2009-era commodity hardware, e.g. a 2.6 GHz Core 2 Duo laptop. ↩
An area is defined as a collection of nodes (points) with the same resistance value. Relies on breadth-first search. ↩