Spatial Analysis with Python

I’ve recently had to undertake spatial data analysis and one of the more challenging tasks was to identify repeat vs unique vehicle journeys. While QGIS is fantastic for visualising routes through heatmaps and density plots, one of my requirements was to provide quantitative metrics around the proportion of repeat vs unique journeys.

To achieve this, I used several open-source Python tools:

  • Fiona for importing the journey data stored as Shapefiles, represented as LineStrings
  • Rtree for quickly identifying potentially relevant journeys through a geospatial index (to avoid calculating unnecessary and costly intersection operations)
  • Shapely for calculating the intersection/difference between each journey and historical journeys

It took me several variations before I discovered an optimal solution, dropping the run time from 1+ day to under an hour. The lessons learned included:

  • Preprocessing data to eliminate TopologyException: found non-noded intersection between LINESTRING
  • Using an R-Tree to quickly eliminate geometry that is not overlapping the bounds of the geometry of interest
  • Subtracting historical journeys (calculating the difference) from the current journey and determining the uniqueness, rather than finding the intersection of the current journey with the (cascaded) union of historical journeys to find “repeatness”
  • Adding a tiny buffer to dramatically increase difference calculating speed (the LineString/Polygon difference operation is faster than LineString/LineString operation – at least for complex LineStrings)

Below is some representative code that illustrates the above methodology:

import fiona
from shapely.geometry import mapping, shape
from rtree import index

INPUT = "journeys.shp"
BUFFER_SIZE = 1 # Should be appropriate for the coordinate reference system of the INPUT file

records = list(fiona.open(INPUT))

# Sort by date to identify "historical" routes
records.sort(key=lambda record: record["properties"]["date"])

# Create indexes
rtree_idx = index.Index()
buffered_shape_cache = {}
for pos, record in enumerate(records):
    shp = shape(record["geometry"])
    rtree_idx.insert(pos, shp.bounds)
    buffered_shape_cache[record["properties"]["id"]] = shp.buffer(BUFFER_SIZE)

for pos1, record in enumerate(records):
    current_geom = shape(record["geometry"])
    current_geom_length = current_geom.length

    # Subtract all historical journeys
    for pos2 in rtree_idx.intersection(current_geom.bounds):
        # Only evaluate against historical journeys
        if pos2  pos1:
            historical_geom = buffered_shape_cache[records[pos2]["properties"]["id"]]
            current_geom = current_geom.difference(historical_geom)
        if current_geom.length == 0:
            break

    remaining_length = current_geom.length
    print("id: {0}, length: {1}, remaining: {2}, proportion: {3}\n".format(record["properties"]["id"], current_geom_length, current_geom.length, current_geom.length / current_geom_length))

If the entire data set is too large to load into memory, it can be partitioned (e.g. geometrically halved or quartered) and the results safely combined (based on id) to give an accurate proportion.

Data Category Spheres of Influence

I recently read the article “What is Master Data Management?” by Guy Holmes in the recent (Q2, 2015) PPDM Foundations Journal. There was one statement he made that thought-provokingly captured the definition of master data: “The key difference [from transactional data] is that master data elements tend to repeat in other data sets, and therefore consistency is key to quality”.

Differentiating transactional data from master data is a very common concept within the data management community. I also like to take it one step further and differentiate between master data and reference data, the latter being something that might “repeat” outside of an organisation, and therefore being a candidate for industry/international standardisation.

Continuing Guy’s line of reasoning: if (1) master data is shared across systems within an organization, and (2) reference data is shared across systems within an industry; then what category of data might be shared across systems but also across industries? Metadata?

I sketched a diagram around this thought, which indicates the “sphere of influence” of that data category – pragmatically, its applicability to be used/shared by systems.

data-category-spheres-of-influence

Creating new objects with Ember

I’ve been using Ember on my new projects and while it has a steep learning curve, isn’t as featured as a server-based web framework and documentation is scattered, it allows you to create a clean and maintainable frontend.

The applications I’ve been creating are largely CRUD and I couldn’t find any good guidance on how to structure Ember to create new objects, particularly since the TodoMVC guide had not been updated since Ember 1.10. The TodoMVC guide had previously indicated to create new objects during the create action. But I found that this pattern as-is would leave unsaved objects in the store if the server rejected the POST action and the user navigated back to an index view.

I ended up using an alternate approach of using a form object and instantiating the object in the route. The form object encapsulates client-side validation (rather than inserting this in the controller). As the route created the object, the route is also responsible for tearing down the object if it is not used. This avoids having unsaved records in the local store and not on the server.

Here is an example pattern that I use for “new” routes:

// app/routes/examples/new.js

export default Ember.Route.extend({
  model: function() {
    return this.store.createRecord('example');
  },
  deactivate: function() {
    var model = this.modelFor('examples.new');
    if (model.id === null) {
      this.store.unloadRecord(model);
    }
  }
});

Get database version in Rails

For one of my applications, I recently created a status report page which shows crucial information for debugging, including software versions, service connectivity and filesystem path writability. One of the items was a database version string, and the standard ActiveRecord interface does not support this concept. While version methods usually exist in the connection adapters, they are usually protected/private and have different naming conventions.

As the application can be run against multiple backends as well as on MRI or JRuby, I needed to check what adapter was in use, then query the database:

adapter = ActiveRecord::Base.connection.adapter_name
sql = case adapter
  when 'MSSQL'
    'SELECT @@VERSION'
  when 'MySQL', 'Mysql2', 'PostgreSQL'
    'SELECT VERSION()'
  when 'OracleEnhanced'
    'SELECT * FROM V$VERSION'
  when 'SQLite'
    'SELECT SQLITE_VERSION()'
end
ActiveRecord::Base.connection.select_value(sql)

Combining DataTable’s Ajax and Javascript sources

I often use DataTables as it provides a lot of out-of-the-box functionality for searching, ordering, paginating and use of Ajax data sources. However, using the server-side processing example means the HTML page will load, and then there is a short wait until DataTables fetches the data from the server using Ajax. This can make an application feel slower, since the user has to wait for the page to load, and then wait again for the data to arrive.

While DataTables can load data from a Javascript source, it seems this option is ignored if we want to use server-side processing.

We can eliminate the second Ajax call by embedding the Javascript object into the HTML that would result from the second call. The object should have exactly the same structure that would normally be returned by the server-side processing source, including the recordsFiltered and recordsTotal attributes to give correct counts:

var intialData = {
  "draw": 1,
  "recordsTotal": 57,
  "recordsFiltered": 57,
  "data": [
    [
      "Airi",
      "Satou",
      "Accountant",
      "Tokyo",
      "28th Nov 08",
      "$162,700"
    ],
    [
      "Angelica",
      "Ramos",
      "Chief Executive Officer (CEO)",
      "London",
      "9th Oct 09",
      "$1,200,000"
    ],
...
}

We then need to use the Ajax option as a function and check if we are performing the first or a subsequent render.

$('#example').dataTable({
  processing: true,
  serverSide: true,
  ajax: function(data, callback, settings) {
    if (typeof initialData.done === 'undefined') {
      callback(initialData);
      indexData.done = true;
    } else {
      $.ajax({
          url: "../server_side/scripts/server_processing.php"
        })
        .done(function(data) {
          callback(data);
        });
    }
  }
});

Optimising index view performance

I recently came across a scenario where my index views in Rails were taking a long time to render when there were large data sets. The problem stemmed having a lot of data presented that were from models/tables other than the primary model/table of interest.

For example, an “asset” index view may have the following columns:

  • Name (from asset.name)
  • Type (from asset.type.name)
  • Site (from asset.site.name)
  • Organisation (from asset.site.organisation.name)
  • System (from asset.system.name)
  • Status (from asset.status.name)

The first (naive) approach I used was to query using eager loading:

Asset.includes(:type, {site: :organisation}, :system, :status)
     .references(:type)
     .search_by('types.name', 'Pump')
     .order(:name).page(1).per_page(1000)

Using a data set of around 500,000 items, rack-mini-profiler indicated this took about 6.7 seconds to render the action with 5.6 seconds of that consumed by the SQL query.

When investigating, I noticed this was using a single query with LEFT OUTER JOINs. I then forced ActiveRecord to use separate queries to load the related models by using Rails 4’s preload instead of includes and used joins instead of references to force ActiveRecord to use a more performant INNER JOIN:

Asset.preload(:type, {site: :organisation}, :system, :status)
     .joins(:type)
     .search_by('types.name', 'Pump')
     .order(:name).page(1).per_page(1000)

This dropped the action render time to 1.7 seconds with 0.7 seconds consumed by the six SQL queries.

For my index views, I actually don’t need the entire object graph to be hydrated – I only really needed the names of the related object. So I tried using pluck, which returns an 2-dimensional array of values:

Asset.includes(:type, {site: :organisation}, :system, :status)
     .joins(:type).search_by('types.name', 'Pump')
     .order(:name).page(1).per_page(1000)
     .pluck(:name, 'types.name', 'sites.name', 'organisations.name', 'systems.name', 'statuses.name')

This actually increased the action render time to 2.1 seconds with 1.7 seconds consumed by ActiveRecord and the SQL query. The SQL query now uses LEFT OUTER JOINs again, which as we saw in the original case, is slow. But the differential between the total and SQL had decreased since ActiveRecord models did not have to be created.

While using the second approach would provide a significant speed boost without changing any code, I wondered if we could use a combination of the second and third approaches – use separate SQL queries instead of a single with LEFT OUTER JOINs and simply pass back an array and not have to hydrate ActiveRecord models.

The result: Preload Pluck.

On the same data set, Preload Pluck drops the action render time to 1.5 second with 0.7 seconds consumed by the SQL queries.

YMMV as the results above are for that particular data set, the queries I’ve used and the way my PostgreSQL database is setup.

Calling a SOAP 1.1 Web Service using WS-Security and HTTPS

I recently had to update an .NET application to enable support for calling SOAP 1.1 Web Services using the WS-Security UsernameToken support over HTTPS with a self-signed TLS certificate. In the end, I had to use a custom binding, since there wasn’t a built in one that suited my requirements; for example, basicHttpBinding supports SOAP 1.1 but not HTTPS while wsHttpBinding supports HTTPS but only using SOAP 1.2.

Let’s break it down into steps:

Adding SOAP headers to send the username and password

The simplest way of adding credentials to every request is to add the SOAP header XML in your App.config:

<system.serviceModel>
  <client>
    <endpoint address="http://example.com/MyWebService" binding="basicHttpBinding" contract="IMyWebService">
      <headers>
        <wsse:UsernameToken xmlns:wsse='http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd' >
          <wsse:Username>Username</wsse:Username>
          <wsse:Password>Password</wsse:Password>
        </wsse:UsernameToken>
      </headers>
    </endpoint>
  </client>
</system.serviceModel>

Specify HTTPS and SOAP 1.1

In your App.config:

<system.serviceModel>
  <bindings>
    <customBinding>
      <binding name="wsHttpSoap11">
        <textMessageEncoding messageVersion="Soap11"/>
        <httpsTransport/>
      </binding>
    </customBinding>
  </bindings>
  <client>
    <endpoint address="https://example.com/MyWebService" binding="customBinding" bindingConfiguration="wsHttpSoap11"  contract="IMyWebService">
...
    </endpoint>
  </client>
</system.serviceModel>

At this point, I received the error message “Could not establish trust relationship for the SSL/TLS secure channel with authority” because I was using a self-signed certificate. I had designated the certificate as trusted, and importantly, import it as a Trusted Root Certification Authority. This can be done using the following steps:

  1. Start -> Run -> mmc.exe
  2. File -> Add/Remove Snap-in…
  3. Add Certificates snap-in
  4. Select the Trusted Root Certification Authority folder
  5. Action -> All Tasks -> Import…

Apache2 configuration for Rails

I recently had to stand up a Rails production server that used Apache2 as a reverse proxy. While it was relatively easy proxying requests to the Thin server using ProxyPass, assets would not show up, even with a DocumentRoot specified. It seems that ProxyPass proxies all requests, and the configuration needs to specifically ignore proxying certain requests using the exclamation mark syntax. The configuration looked like:

<VirtualHost *:80>
        ServerName example.com
        DocumentRoot /var/www/app/public
        ProxyPassMatch ^/(assets/|(404|422|500)\.html|favicon\.ico|robots\.txt) !
        ProxyPass / http://localhost:3000/
        ProxyPassReverse / http://localhost:3000/
</VirtualHost>

HTML5 presentation with motion background

I was recently looking for a way to present our church’s worship lyrics on top of a motion background. I naturally turned to PowerPoint, but quickly found out that the only way to achieve a seamless transition between slides is to combine all the lyrics on a single slide and use animations to show and hide the lyric text.

This was a clunky solution, and there is a software market specifically for this use case. However, instead of splurging on new software, I wondered if my simple requirements could be met using a HTML5 web page. The only requirements were that:

  • A full screen, edge-to-edge video could be set as the background. It should automatically start and loop.
  • The keyboard arrow and space keys can be used for navigation between slides.
  • Multi-line text must be centered vertically and horizontally on the screen.

I initially looked into using reveal.js and impress.js. It was unclear if the former already supported video backgrounds or if it was still in development only. The latter seemed extremely complicated for my simple use case.

Looking into browser presentations using jQuery, I came across this article which provided a way to navigate between slides represented as divs. There was additional code to support visual navigation buttons in a footer which I didn’t need, so consequently stripped out. It also used a old style way of adding a jQuery keyboard event handler, so I updated that. I also replaced a simple jQuery hide/show with a fadeOut/fadeIn for a little jazz.

I then looked into the video requirement and found this article which had exactly what I needed to create an edge-to-edge video with autoplay and looping. Relevant code:

<video autoplay loop id="bgvid">
  <source src="motion-background.mp4" type="video/mp4">
</video>

#bgvid {
  position: fixed;
  right: 0; bottom: 0;
  min-width: 100%; min-height: 100%;
  width: auto; height: auto;
  z-index: -100;
  background-size: cover;
}

I then needed a way to both vertically and horizontally center text, and I investigated what Flexbox could provide since I had full control of what browser was used (although Chrome has supported Flexbox for a while). While the CSS was extremely simple, I wasn’t sure why my paragraphs on each slide were displayed inline. It turned out that I needed an extra div, so instead of this structure:

<div class="slide">
  <p>Line 1</p>
  <p>Line 2</p>
  <p>Line 3</p>
</div>

I needed this structure:

<div class="slide">
  <div class="content">
    <p>Line 1</p>
    <p>Line 2</p>
    <p>Line 3</p>
  </div>
</div>

Not having any motion backgrounds on hand, I found that www.motionbackgrounds.co had some decent free videos that I could use to test out the concept.

Combining the above, it all looked pretty good in full screen mode. The only deficiency is the lack of a presenter view, where the presentation can be started full screen on a separate monitor. There appears to be an solution using Firefox, but it didn’t seem robust when I tried it.

The source code and example presentation is available on GitHub. I’ve included a motion background created using Premiere tutorial to avoid copyright issues.