Interrogating data using modern applications

Interrogating data using modern applications

I love working with data and with visualisations. I am often at my happiest when working with a large data set trying to find the answer to questions or trying to present data in a way that make it more accessible without losing its meaning.  In the UK, government departments are duty bound to publish figures about all manner of things from health and education to the economy and crime.  This data is generally available through the government data website.  Some data is available through specific sites and even allows API integration, the police maintain a data set accessible in this manner, data.police.uk.

Stop and Search data for Maidstone, Kent

Stop and Search data for the Maidstone, Kent area. Displayed using open source technologies

Goal

How could I use modern applications and infrastructure to ask questions of the policing data?

  • Provide a solution to analyse the policing data
  • Use Open Source software where ever possible
  • Use modern methods, if I have to install an operating system I’ve failed!

Data

As mentioned the data is available from data.police.uk. The site provides a comprehensive set of methods for how to interact with this publicly accessible API.  Each of the available data sets for logged street crime, outcomes, stop and search data can be programatically downloaded month on month as  a *.zip file;

https://data.police.uk/data/archive/[year]-[month].zip

Or any of the listed API methods can be used to access the data, for example to list out all stop and search activities undertaken in Avon and Somerset during January 2019, the API is;

https://data.police.uk/api/stops-force?force=avon-and-somerset&date=2019-01

To experiment with the various API calls available postman is a great utility.

To get started I created a custom request via this link, and pulled down all crime data for the last 18 months across all regional police forces.  This brought down a zip file that was just under a GB pre extraction and 3.32 GB post.  After extraction the data was stored by month, with each forces crime, stop and outcome data maintained within, across a total of 2500 files.

I don’t have the patience to sort and import 2500 files, so I needed a way to combine these into one file.

PowerShell File Operations

Copy Files to Working Directory

The first task I wanted to accomplish was sort through all the exported data types, crime, outcomes, stop and locate them together in one working directory.  The files already have a month included in the file name along with the data type, so I had no need to inspect inside the files and could achieve this in a straight forward manner through pattern matching.

#Script to copy data to working locations 
#$source is the path to the inflated PoliceData
$source = "path to source file"

#the following are the desired paths copy Street level, Stopand Search and Outcome inflated files to post pattern matching 
$streetdata = "path to source file/Street"
$searchdata = "path to source file/StopandSearch"
$outcomesdata = "path to source file/Outcomes"

#Pattern matching and copy
$folders = Get-ChildItem -Path $source
foreach ($folder in $folders) {
Get-ChildItem -Path $folder | Where-Object Name -like *street.csv | Copy-Item -Destination $streetdata
    }
foreach ($folder in $folders) {
    Get-ChildItem -Path $folder | Where-Object Name -like *search.csv | Copy-Item -Destination $searchdata
    }
foreach ($folder in $folders) {
    Get-ChildItem -Path $folder | Where-Object Name -like *outcomes.csv | Copy-Item -Destination $outcomesdata
   }

This script scans the location set in the source variable for the folder and filename patterns associated with the Police Data download for Street Crime, Outcomes and Stop and Search.

Based on pattern matching it copies files to a working directory so they can be merged ready for data import to a DB.

Merge all files to a single CSV

I already had some PowerShell I could use to merge each of the crime, outcome and stop data sets into a single CSV file, thanks to the work I did with VMware guestinfo.appinfo, which I could use here without any real change.

#Script to merge csv files 
#useful to create one import set to a DB, PowerBi or similar

$getFirstLine = $true
$timestamp = Get-Date -Format yy-MM-dd-hh-mm

#Specify path to the source and output directories.  This script will look in the source location and merge all found files to a single file in the output directory appending a timestamp to the file name
get-childItem "path to source" | ForEach-Object {
    $filePath = $_

    $lines = Get-Content $filePath  
    $linesToWrite = switch($getFirstLine) {
           $true  {$lines}
           $false {$lines | Select-Object -Skip 1}

    }

    $getFirstLine = $false
    Add-Content "path to output/-merge-$timestamp.csv" $linesToWrite
    }

This script merges all CSV files found within a source location and merges them into a single file ready for import to a DB.

Postgres Database to hold the extracted data

There are approximately 500,000 crimes reported every month across all regional police forces in the UK and the street-crime file created above had over 8.5 million rows of information.  If you have the misfortune to try opening this in excel, let me know how long it takes your machine to crash (excel has an upper row limit of a little over 1 million). Whilst we can extract this public data using the police public API, interrogating and querying this information from a CSV file is impractical. It is for this reason I’ve created and am using a postgres DB to hold the information.

Postgres and Docker

Postgres is available as a Docker image, which means I’ve no need to go looking for a server OS or to install the binaries within my desktop device. Pull the docker image.

docker pull postgres

Run postgres DB inside of a docker container, substituting the password for a secure value

docker run --name postgres -p 5432:5432 -e POSTGRES_PASSWORD=password -d postgres
Postgres DB management

There are a few options for DB management either via a graphical or command line interface, pgcli is available for the command line

brew install pgcli

Connecting via the command line with pgcli

pgcli -h localhost =p 5432 -U postgres -W -d dbname

If you prefer a graphical interface pgAdmin is available for graphical management.  With the tool available make a connection to the Docker hosted postgres DB.

The data I’m working with needs a home, so I created a blank DB to house that data.

CSVKIT – Or how to avoid creating tables by hand…

Csvkit is a suite of utilities for converting to and working with CSV, the king of tabular file formats.

How to install csvkit on a machine with python in place

csvkit can be installed by running the command below, assuming python is installed already alongside pip. Much better information can be found from this documentation

pip install csvkit
How to use it

csvkit includes the following resources;

csvclean csvgrep csvlook csvsql csvcut csvjoin csvpy csvstack csvformat csvjson csvsort csvstat

#For usage instructions of these scripts run --help

My primary usecase is to output the SQL query format required to support importing data held within a CSV file.

./csvsql "path to csv.csv" -d ,
Expected output
CREATE TABLE "tablename" (
 "Crime ID" VARCHAR NOT NULL,
 "Month" VARCHAR NOT NULL,
 "Reported by" VARCHAR NOT NULL,
 "Falls within" VARCHAR NOT NULL,
 "Longitude" DECIMAL,
 "Latitude" DECIMAL,
 "Location" VARCHAR NOT NULL,
 "LSOA code" VARCHAR,
 "LSOA name" VARCHAR,
 "Outcome type" VARCHAR NOT NULL
);

This query can be run against an existing DB to create the required table to support the CSV file.

Leaving a table that looks like this;

Importing Data

With the CSV files and the tables prepared we can import the data into the postgres DB.  Importing via the CLI or GUI is equally straightforward.

pgcli \copy [tablename] to/from [filename] | Copy data between a file and a table.

Metabase

Metabase is an open source business intelligence tool. It lets you ask questions about your data, and displays answers in formats that make sense, whether that’s a bar graph or a detailed table.  Think of it a PowerBI without the need for an Office 365 subscription!

I’m going to run Metabase inside of a local kubernetes cluster.  The Metabase installation for Kubernetes is provided via Helm.

Helm

Install helm via your favourite package manager I’m using a mac so will use homebrew e.g.

brew install helm

Post installation add a stable repository to Helm

helm repo add stable https://kubernetes-charts.storage.googleapis.com
Deploy metabase

By default, Metabase’s backend database (H2) is stored inside container, and will be lost after container restart. To avoid this create and edit the metabase-config.yaml to point to an available database. Further instruction is available here along with an example YAML file.  I’ve got the postgres DB holding the policing data which I can use to hold this configuration data.  The YAML for my installation edits the backend database section to point at my DB;

# Backend database
database:
  # Database type (h2 / mysql / postgres), default: h2
  type: postgres
  # encryptionKey: << YOUR ENCRYPTION KEY >>
  ## Only need when you use mysql / postgres
  host: 192.168.1.221
  port: 5432
  dbname: metabase
  username: postgres
  password: password

I need to create the blank database within postgres for the metabase installation to find and use.  I also need to create a namespace to use on my kubernetes cluster;

kubectl create namespace metabase

To install into that namespace pull configuration from the metabase-config.yaml file, I simply run.

helm install metabase -f metabase-config.yaml stable/metabase -n metabase

Follow the installation progress using watch and when all resources report as ready, use kubectl port-forward to access the metabase application.

On first connection Metabase will walk through creating an account on the containerised service and connecting to your data.  In the details below the IP address is the IP of my workstation, where the docker postgres DB is accessible on tcp port 5432.

Using Metabase

You work with metabase by asking it questions of the data, which can then be saved and added to dashboards alongside filters and other queries.  To build up a question simply click on the data you are interested in to build up the desired filter and then choose how to summarise and display the information.  For example in the query below, I’ve built up a filter on a crime type of ‘Burglary’ in January 2020.

I can then choose to summarise it by the reporting region and visualise the resultant data in a bar chart.

Burglaries by reporting police force in January 2020

Questions can be collated to create dashboards of information for a quick overview of multiple data sets.  For example below I have a dashboard highlighting all crimes reported by month, all crimes by reporting force, All reported outcomes by outcome type and all ‘no suspect’ or ‘unable to prosecute’ crimes by force.

 

Dashboard showing total reported crime by month, by reporting force, reported outcomes and outcomes with no suspect or no prosecution.

All of the presented information is click through and can be interrogated more, filtered in different ways and the data downloaded into common formats.

Summary

20 years ago I deployed a solution to allow the public interrogation of streetworks data, that needed ten or more servers to run Oracle DBs, ArcGIS platforms, Citrix and Web front ends in addition to power, cooling and switches. A decade ago to interrogate this type of data, I would need to deploy a few servers, still one for the backend DB, one for GIS data and another for the business analytics software.  Most recently this type of analysis has moved to the cloud, with services like PowerBI, Tableau, Cloud DBs and subscription mapping data.  However, any barrier to investigating these data sets has been removed, with the continuing rise in containerised computing.

I set myself the task of using modern applications and to stick with open source technologies in order to interrogate publicly available large data sets.  With little more than my laptop, open source software, some recycled scripts, docker and kubernetes. I’ve been able to build a simple repeatable set of methods, that can be used to visualise and interrogate any public data set.

In the UK links to many public data sources can be found here.  Other notes and resources can be found at the following Github repository

Thanks

Simon