Overview
PostGIS is the extension to PostgreSQL that provides for graphical information services objects to be utilized by the database. (PostGIS is pretty similar to what you get from Oracle Spatial.) The geocoding is supported by a PostGIS "extra" called Tiger Geocoder (it is in fact the only "extra"). Tiger is a product of the United States Census Bureau.
For PostgreSQL, the Tiger Geocoder is maintained in two schemas: tiger
contains the code and the skeletal data structures. tiger_data
has the
national and state data which is downloaded from the Census.
Version
Recommended: 10.x
Setup
Ensure that your PG install is clean; perhaps best to re-install
Boxen
# Stop any running PG; may want to do this through boxen-services
pg_ctl -D /opt/boxen/homebrew/var/postgres stop
# Remove the data directory
rm -fR /opt/boxen/homebrew/var/postgres/*
# Remove any old PostGIS and PostgreSQL
brew unlink postgis
brew unlink postgresql
brew uninstall postgis
brew uninstall postgresql
Brew
For Iora with ChirpStrap,
brew services stop postgresql@9.4
brew uninstall postgresql@9.4
rm -fR /usr/local/var/postgresql\@9.4
brew list | grep postgres # expect see nothing
Install PostGIS
brew install postgresql
brew services start postgresql
# to verify your install
psql postgres
\q
# It can be handy to create a database with the name of your user so
# you can just type `psql`:
psql -d postgres -c "create database ${USER}"
brew install postgis
Install the extensions
For the official instructions, see http://postgis.net/docs/postgis_installation.html#install_tiger_geocoder_extension
# Define a name for your database
export DB=icisstaff_development
psql -d $DB
create extension postgis;
create extension postgis_topology;
create extension fuzzystrmatch;
create extension address_standardizer;
create extension postgis_tiger_geocoder;
Verify the address standardizer
select num, street, city, state, zip
from parse_address('1 Devonshire Place, Boston, MA 02109');
should produce a breakdown of street, city, state, zip:
num | street | city | state | zip |
---|---|---|---|---|
1 | Devonshire Place | Boston | MA | 02109 |
Should also be able to . . .
select na.address, na.streetname,na.streettypeabbrev, na.zip
from normalize_address('1 Devonshire Place, Boston, MA 02109') as na;
which produces:
address | streetname | streettypeabbrev | zip |
---|---|---|---|
1 | Devonshire | Pl | 02109 |
Load from the Census
Set up your defaults for generated loader scripts
Make a directory for your Tiger "staging" data (the raw files downloaded from the Census):
mkdir -p $HOME/gisdata/temp
When the loader scripts are generated, Tiger consults the
tiger.loader_variables
and tiger.loader_platform
tables. First set up
tiger.loader_variables
so it knows about the "staging" folder:
psql -d $DB -c "update tiger.loader_variables set staging_fold = '$HOME/gisdata'"
Now you will create a row in tiger.loader_platform
that fits better with
the Brew PostgreSQL setup. The key for these rows is called os
-- create
one where os
is the name of the database.
psql -d $DB -X -c "delete from tiger.loader_platform where os = '$DB'"
psql -d $DB <<EOF
insert into tiger.loader_platform(os, declare_sect, pgbin, wget, unzip_command, psql, path_sep,
loader, environ_set_command, county_process_command)
select '$DB',
E'TMPDIR="\${staging_fold}/temp/"\nUNZIPTOOL=unzip\nWGETTOOL=wget\nPSQL=\'/usr/local/bin/psql -d ${DB} \'\nSHP2PGSQL=shp2pgsql\ncd \${staging_fold}\n',
pgbin, wget, unzip_command, psql, path_sep,
loader, environ_set_command, county_process_command
FROM tiger.loader_platform
WHERE os = 'sh';
EOF
Generate the scripts to load the geocoder data
Generate the national data, on which state data depends:
psql -d $DB -X -c "SELECT Loader_Generate_Nation_Script('$DB')" -tA > $HOME/gisdata/nation_script_load.sh
Generate the data for the states you want -- here, Massachusetts and Minnesota:
psql -d $DB -X -c "SELECT Loader_Generate_Script(ARRAY['MA', 'MN'], '$DB')" -tA > $HOME/gisdata/ma_mn_states_load.sh
It is tempting to try to do this for all states . . .
psql -d $DB -X -c "SELECT Loader_Generate_Script(ARRAY['AL' , 'AK' , 'AS' , 'AZ' , 'AR' , 'CA' , 'CO' , 'MP' , 'CT' , 'DE' , 'DC' , 'FL' , 'GA' , 'GU' , 'HI' , 'ID' , 'IL' , 'IN' , 'IA' , 'KS' , 'KY' , 'LA' , 'ME' , 'MD' , 'MA' , 'MI' , 'MN' , 'MS' , 'MO' , 'MT' , 'NE' , 'NV' , 'NH' , 'NJ' , 'NM' , 'NY' , 'NC' , 'ND' , 'OH' , 'OK' , 'OR' , 'PA' , 'PR' , 'RI' , 'SC' , 'SD' , 'TN' , 'TX' , 'VI' , 'UT' , 'VT' , 'VA' , 'WA' , 'WV' , 'WI' , 'WY'], '$DB')" -tA > $HOME/gisdata/all_states_load.sh
But experience suggests that you should load each state separately, so that if you have to, you can re-run a state:
states=(AL AK AS AZ AR CA CO MP CT DE DC FL GA GU HI ID IL IN IA KS KY LA ME MD MA MI MN MS MO MT NE NV NH NJ NM NY NC ND OH OK OR PA PR RI SC SD TN TX VI UT VT VA WA WV WI WY)
for state in "${states[@]}"
do
psql -d $DB -X -c "SELECT Loader_Generate_Script(ARRAY['${state}'], '$DB')" -tA > $HOME/gisdata/${state}_state_load.sh
done
Make the scripts runnable . . .
chmod +x $HOME/gisdata/*.sh
Generate
Run the loader scripts.
cd $HOME/gisdata
./nation_script_load.sh
./AZ_state_load.sh
./MN_state_load.sh
# etc.
Now find lat/long for an address you know in one of the states
select
g.rating,
st_y(g.geomout) as lat,
st_x(g.geomout) as lon,
(addy).address as stno,
(addy).streetname as street,
(addy).streettypeabbrev as styp,
(addy).location as city,
(addy).stateabbrev as st,
(addy).zip
from geocode(normalize_address('931 Portland Ave, Saint Paul, MN'), 1) as g;
If the data load has run correctly, you should see:
rating | lat | lon | stno | street | styp | city | st | zip |
---|---|---|---|---|---|---|---|---|
9 | 44.9429540910341 | -93.1394249137542 | 931 | Portland | Ave | St. Paul | MN | 55104 |
Sample queries
Here are a few ways to leverage the geocoding.
Basics
The key functions are
-
normalize_address
- http://postgis.net/docs/Normalize_Address.html - Converts a messy address to a uniform one broken down into its parts; can then be used as input to geocode. NOTE: Does not require the downloaded Census data. -
geocode
- http://postgis.net/docs/Geocode.html - Converts an address to mapping data. Can take as input a string or the result fromnormalize_address
. One thing I don't discuss here is that it is possible to constrain the results with a "geometry filter," so that if the input string does not have city and state, you can make the lookup happen within some city and state.-
st_x
andst_y
- http://postgis.net/docs/ST_X.html - Get latitude or longitude from "point" data. (These are actually from postGIS, not the Tiger extension.)
-
-
pprint_addy
- http://postgis.net/docs/Pprint_Addy.html - Pretty print an address fromnormalize_address
. Does not require the downloaded Census data. -
reverse_geocode
- http://postgis.net/docs/Reverse_Geocode.html - Find an address for a lat/long location. (Not discussed here.) -
st_distance_sphere
- http://postgis.net/docs/manual-1.5/ST_Distance_Sphere.html - Compute distance as the bird flies. (Not discussed here.)
The following examples use a sample table like this:
drop table if exists people;
create table people (pid int, first text, last text, address text);
insert into people (pid, first, last, address) values (1, 'John', 'Norman', '931 Portland, Saint Paul, MN');
insert into people (pid, first, last, address) values (2, 'Idan', 'Ben-Arieh', '6 valley rd, Natick MA 01760');
insert into people (pid, first, last, address) values (3, 'Jane', 'Doe', '3048 W San Juan Drive, Tucson AZ 85713');
insert into people (pid, first, last, address) values (4, 'Doug', 'Doucey', '400 W Girley, Prescott, AZ');
First off, many of the PostGIS functions return PostgreSQL "composite types": to destructure, you enclose the returned composite in parentheses, like so:
select
address,
(normalize_address(address)).streetname,
(normalize_address(address)).zip
from people;
Notice that the normalize_address
function is being called twice. Ideally
we would like to call it once. A common pattern you will see is some kind
of renamed query result or subquery:
select address, (na).streetname, (na).zip from
(select
address,
normalize_address(address) as na
from people) people_with_normalized_address;
Another example of destructuring, with geocoding:
select
address,
st_y((g).geomout) as latitude,
st_x((g).geomout) as longitude,
(g).rating,
pprint_addy((g).addy) as normalized_address
from
(select
address,
geocode(normalize_address(address),1) as g
from people) people_with_geocoded_address;
Select the original data, augmenting with additional geo columns
Aside: I imported this data from csv with a neat tool called pgfutter
: https://github.com/lukasmartinelli/pgfutter.
Here the table with the original data is import.patients
with a primary/unique key of patient_demographics_patient_guid
and the original address in patient_demographics_address
.
select import.patients.*,
geocoded.latitude as Latitude,
geocoded.longitude as Longitude,
geocoded.rating as "Geocode Rating",
geocoded.normalized_address as "Normalized Address"
from
import.patients
left join
(select
patient_demographics_patient_guid,
st_y((geo).geomout) as latitude,
st_x((geo).geomout) as longitude,
(geo).rating,
pprint_addy((geo).addy) as normalized_address
from (select
patient_demographics_patient_guid,
(geocode(normalize_address(patient_demographics_address),1)) as geo
from import.patients) addresses) geocoded
on geocoded.patient_demographics_patient_guid = import.patients.patient_demographics_patient_guid;
(On my 2014-era MBP, this takes about 0.16 seconds / row; about 015 seconds
with an index on patient_demographics_patient_guid
-- create index on import.patients (patient_demographics_patient_guid);
.)
Add columns for geocoding, and only geocode when never geocoded
Finally, in this example, we add columns to a table without geocoding.
Then we only geocode rows where the rating
column is null. This would be
a nice overnight task to get geocoded all new rows that have never been
geocoded.
alter table import.patients
add column latitude double precision,
add column longitude double precision,
add column rating integer,
add column normalized_address varchar;
(To get the data types, I used the pg_typeof
column on a query that
returned the different items.)
And here's the query (this is verbatim from the documentation).
Note the use of index 3
-- this is worth eyeballing before running on all
rows.
update import.patients
set (latitude, longitude, rating, normalized_address)
= (
st_y((g.geo).geomout),
st_x((g.geo).geomout),
coalesce((g.geo).rating,-1),
pprint_addy((g.geo).addy)
)
from (select patient_demographics_patient_guid
from import.patients
where rating is null order by patient_demographics_patient_guid limit 3) as a
left join (select patient_demographics_patient_guid, (geocode(patient_demographics_address,1)) as geo
from import.patients As ag
where ag.rating is null order by patient_demographics_patient_guid limit 3) As g on a.patient_demographics_patient_guid = g.patient_demographics_patient_guid
where a.patient_demographics_patient_guid = import.patients.patient_demographics_patient_guid;
(About 0.20 seconds/row without indexing.)
Visualization
Here's a quick way to get a visualization on your laptop.
First, install R.
brew tap homebrew/science
brew install R
Export your geocoded table to a CSV.
psql -c "copy import.patients to '$HOME/Desktop/patients.csv' with (format csv, header)"
In R,
install.packages('ggmap')
library(ggmap)
patients <- read.csv(file="/Users/jgn/Desktop/patients.csv", header=TRUE)
The fastest way to get started with ggmap
is to use the qmplot
function.
My advice is to read the docs and play around.
qmplot(longitude, latitude, data=patients, source='stamen')
qmplot(longitude, latitude, data=patients, source='google', maptype = 'roadmap')
qmplot(longitude, latitude, data=patients, source='google', maptype = 'roadmap', zoom=10)
qmplot(longitude, latitude, data=patients, source='google', maptype = 'roadmap', geom = 'density2d')
qmplot(longitude, latitude, data=patients, source='google', maptype = 'roadmap', alpha=I(0.5), zoom=10, geom='density2d')