Interrogating Public Police Data with open-source technology and serverless methods

Interrogating Public Police Data with open-source technology and serverless methods

##* Update, the dashboards are currently Online (July 6th).  This was built as an educational tools and to demonstrate the art of the possible.  If you are interested in funding or rebuilding within your own cloud subscription please contact me via @virtual_simon *##

I’ll preface with a few of the resources I’ve used so far in my Azure study journey

I wrote previously about using open source technologies to interrogate public data, having used Docker images and kubernetes to build this solution I wanted to scale and democratise access to this data, so I used the resources I had available to do just that.

Simple Solution Architecture

For the simple solution I’d built above to function for the public, the following additional considerations and elements needed to be provided for;

  • Scalable database for police data
  • Database for Metabase configuration data
  • Scalable container platform
  • Data importing capability
  • Access for the public

Another consideration is that I didn’t want to waste money.  For example as I had a total of two images I wanted to run, that didn’t really lend itself to spinning up a kubernetes cluster or in fact any virtual machines.  To keep the costs down I’d use serverless methods wherever I could.  I was discussing my plans and a mysterious benefactor spoke to me and mentioned that they had an Azure subscription with unused credits that I was welcome to use.

I’m sure that both AWS and GCP have solutions that mirror Azure. However, as I have a benefactor for cost reasons I’d build this in Azure.

Solution

Interrogating Public Police Data with open-source technology and serverless methods

  • Scalable database for police data
    • This need would be met by Azure SQL, utilising a general purpose serverless pricing tier.
  • Database for Metabase configuration data
    • I wanted to separate the configuration data from the publication data so I’ve kept this within as a container instance running a Postgres image.  The configuration data for Metabase is small enough that I can keep this stateless and restore the running config in seconds.
  • Scalable container platform
  • Data importing capability
    • Azure Data Factory can be configured to read data from APIs, CSV and other databases and inject that into a staggering array of solutions. Billed on a pipeline basis.
  • Access for the public
    • ACI is configurable to allow public access on defined tcp ports.

Data Manipulation

The RAW data from the police data site arrives in many thousands of CSV files separated by dataset (stop and search, crimes or outcomes) and regional police authority. As per the previous solution I’ve made use of PowerShell to manipulate this data into single CSV file per data set ready for upload.  First to sort through the files and move them to working directories based upon the dataset CopyFilestoWorkDir.ps1 and the next to merge them into a single file per dataset CSVFileMerge.ps1.

Once the files captured in this way are merged they are uploaded to Azure BLOB storage

Azure Data Factory

Azure Data Factory injects the public police data into the Azure SQL tables. Pipelines exist for CSVs hosted within BLOB storage and API calls to Policing data.  Pipelines can be built and manually triggered or data flows can be configured to orchestrate the way data flows into and out of many systems.

Azure SQL

Due to the size of the public policing data this is stored in a serverless Azure SQL DB. For reference the reported crimes table is over 9 million rows and the reported outcomes 6 million rows.

Azure Data Studio simplifies working with Azure SQL, especially if like me you don’t run windows.

 

Town information is derived from a column ‘LSOA_Name’ where it forms part of this data point.  LSOA stands for Lower Layer Super Output Areas.  LSOAs are built from groups of contiguous Output Areas and are automatically generated to be as consistent in population size as possible, and typically contain from four to six output areas. The Minimum population is 1000 and the mean is 1500.

LSOA is available within the tables for outcome and reported crime. An SQL query copies ‘LSOA_Name’ data filtering numerical characters to a new column named ‘Town’, to allow simple location searching on the outcome and reported crime datasets.  Apart from that no change is made to the raw data.

/* LSOA_name fields also available in [dbo].[DF_recorded-street-crime]*/
UPDATE [dbo].[DF_reported-outcomes]
SET Town = LEFT(LSOA_name, PATINDEX('%[^ A-z]%', LSOA_name)-1);

As the LSOA field contains both alphabetical and numeric characters PATINDEX is used in conjunction with SET and LEFT to populate data based only on the characters A-Z.  I thought that translating the LSOA data to a Town name in this manner makes the data more accessible to the public.

Postgres

Postgres DB runs from an ACI hosted image deployed using a serverless billing model, that allows the solution to flex to 4 cores and 16 GB if required.

 

It hosts a police DB that can be used for test purposes and the Metabase configuration DB.  The container is stateless, the Metabase DB can be recreated on reboot and restored as needed.

Metabase

Metabase is opensource business intelligence software. Perfect for asking questions of publicly available data.  This runs from ACI using a serverless billing model, that like the Postgres DB allows it to flex to 4 cores and 16 GB if needed.  The solution is again stateless all environment variables and configuration load at time of reboot.

"environmentVariables": [
{"name": "MB_DB_TYPE",
 "value": "postgres"},
{ "name": "MB_DB_DBNAME",
 "value": "metabase"},
{ "name": "MB_DB_PORT",
"value": "5432"},
{"name": "MB_DB_USER",
"value": "metabase"},
{"name": "MB_DB_PASS",
"value": "pass"},
{ "name": "MB_DB_HOST",
 "value": "DB location" },
{"name": "JAVA_TOOL_OPTIONS",
"value": "-Xmx12g"
}]

The above is not valid JSON, it is meant to illustrate the environmental variables being passed.

Operation Overview

To replay, the public police data architecture with added solution flow.

 

  1. Azure Data Factory pipelines import data from BLOB and API sources to tables within an Azure SQL DB
  2. Azure SQL DB hosts three Data Factory populated tables for recorded crime, reported outcomes and stop and search data
  3. Metabase configuration data is persisted within Postgres
  4. Administrators and members can build and publish visualisations and dashboards of the data
  5. Metabase hosts these public dashboards

Outcome

Three publicly hosted dashboards to view Reported CrimeReported Outcomes and Stop and Search information across England, Wales and Northern Ireland.

I’ve added images of the example dashboards below, I’ve picked filters at random.

Stop and Search

 

Reported Outcomes

 

Reported Crime

 

Summary

I’ve been able to take the solution that I pulled together and developed locally inside of Docker and kubernetes and port it to the public cloud.  I wanted to scale and democratise access to this data and I think through the links provided in this article I’ve been able to just that.  Well at least until my benefactor runs out of credits on the account.  Although the solution with the architecture does not appear to be that expensive.

 

If there is interest in running this as a full time solution, I will investigate how it can be properly funded.

If you want to know more then you can check out Github or @ me on Twitter.

Thanks

Simon