I’ve recently had to undertake spatial data analysis and one of the more challenging tasks was to identify repeat vs unique vehicle journeys. While QGIS is fantastic for visualising routes through heatmaps and density plots, one of my requirements was to provide quantitative metrics around the proportion of repeat vs unique journeys.

To achieve this, I used several open-source Python tools:

- Fiona for importing the journey data stored as Shapefiles, represented as LineStrings
- Rtree for quickly identifying potentially relevant journeys through a geospatial index (to avoid calculating unnecessary and costly intersection operations)
- Shapely for calculating the intersection/difference between each journey and historical journeys

It took me several variations before I discovered an optimal solution, dropping the run time from 1+ day to under an hour. The lessons learned included:

- Preprocessing data to eliminate TopologyException: found non-noded intersection between LINESTRING
- Using an R-Tree to quickly eliminate geometry that is not overlapping the bounds of the geometry of interest
- Subtracting historical journeys (calculating the difference) from the current journey and determining the uniqueness, rather than finding the intersection of the current journey with the (cascaded) union of historical journeys to find “repeatness”
- Adding a tiny buffer to dramatically increase difference calculating speed (the LineString/Polygon difference operation is faster than LineString/LineString operation – at least for complex LineStrings)

Below is some representative code that illustrates the above methodology:

import fiona from shapely.geometry import mapping, shape from rtree import index INPUT = "journeys.shp" BUFFER_SIZE = 1 # Should be appropriate for the coordinate reference system of the INPUT file records = list(fiona.open(INPUT)) # Sort by date to identify "historical" routes records.sort(key=lambda record: record["properties"]["date"]) # Create indexes rtree_idx = index.Index() buffered_shape_cache = {} for pos, record in enumerate(records): shp = shape(record["geometry"]) rtree_idx.insert(pos, shp.bounds) buffered_shape_cache[record["properties"]["id"]] = shp.buffer(BUFFER_SIZE) for pos1, record in enumerate(records): current_geom = shape(record["geometry"]) current_geom_length = current_geom.length # Subtract all historical journeys for pos2 in rtree_idx.intersection(current_geom.bounds): # Only evaluate against historical journeys if pos2 ﹤ pos1: historical_geom = buffered_shape_cache[records[pos2]["properties"]["id"]] current_geom = current_geom.difference(historical_geom) if current_geom.length == 0: break remaining_length = current_geom.length print("id: {0}, length: {1}, remaining: {2}, proportion: {3}\n".format(record["properties"]["id"], current_geom_length, current_geom.length, current_geom.length / current_geom_length))

If the entire data set is too large to load into memory, it can be partitioned (e.g. geometrically halved or quartered) and the results safely combined (based on id) to give an accurate proportion.

## Leave a Reply