Overview

PostGIS is the extension to PostgreSQL that provides for graphical information services objects to be utilized by the database. (PostGIS is pretty similar to what you get from Oracle Spatial.) The geocoding is supported by a PostGIS "extra" called Tiger Geocoder (it is in fact the only "extra"). Tiger is a product of the United States Census Bureau.

For PostgreSQL, the Tiger Geocoder is maintained in two schemas: tiger contains the code and the skeletal data structures. tiger_data has the national and state data which is downloaded from the Census.

Version

Recommended: 10.x

Setup

Ensure that your PG install is clean; perhaps best to re-install

Boxen

# Stop any running PG; may want to do this through boxen-services
pg_ctl -D /opt/boxen/homebrew/var/postgres stop

# Remove the data directory
rm -fR /opt/boxen/homebrew/var/postgres/*

# Remove any old PostGIS and PostgreSQL
brew unlink postgis
brew unlink postgresql
brew uninstall postgis
brew uninstall postgresql

Brew

For Iora with ChirpStrap,

brew services stop postgresql@9.4
brew uninstall postgresql@9.4
rm -fR /usr/local/var/postgresql\@9.4
brew list | grep postgres     # expect see nothing

Install PostGIS

brew install postgresql
brew services start postgresql

# to verify your install
psql postgres
\q

# It can be handy to create a database with the name of your user so
# you can just type `psql`:
psql -d postgres -c "create database ${USER}"

brew install postgis

Install the extensions

For the official instructions, see http://postgis.net/docs/postgis_installation.html#install_tiger_geocoder_extension

# Define a name for your database
export DB=icisstaff_development

psql -d $DB
  create extension postgis;
  create extension postgis_topology;
  create extension fuzzystrmatch;
  create extension address_standardizer;
  create extension postgis_tiger_geocoder;

Verify the address standardizer

select num, street, city, state, zip
 from parse_address('1 Devonshire Place, Boston, MA 02109');

should produce a breakdown of street, city, state, zip:

num street city state zip
1 Devonshire Place Boston MA 02109

Should also be able to . . .

select na.address, na.streetname,na.streettypeabbrev, na.zip
	from normalize_address('1 Devonshire Place, Boston, MA 02109') as na;

which produces:

address streetname streettypeabbrev zip
1 Devonshire Pl 02109

Load from the Census

Set up your defaults for generated loader scripts

Make a directory for your Tiger "staging" data (the raw files downloaded from the Census):

mkdir -p $HOME/gisdata/temp

When the loader scripts are generated, Tiger consults the tiger.loader_variables and tiger.loader_platform tables. First set up tiger.loader_variables so it knows about the "staging" folder:

psql -d $DB -c "update tiger.loader_variables set staging_fold = '$HOME/gisdata'"

Now you will create a row in tiger.loader_platform that fits better with the Brew PostgreSQL setup. The key for these rows is called os -- create one where os is the name of the database.

psql -d $DB -X -c "delete from tiger.loader_platform where os = '$DB'"
psql -d $DB <<EOF
insert into tiger.loader_platform(os, declare_sect, pgbin, wget, unzip_command, psql, path_sep,
		   loader, environ_set_command, county_process_command)
select '$DB',
  E'TMPDIR="\${staging_fold}/temp/"\nUNZIPTOOL=unzip\nWGETTOOL=wget\nPSQL=\'/usr/local/bin/psql -d ${DB} \'\nSHP2PGSQL=shp2pgsql\ncd \${staging_fold}\n',
  pgbin, wget, unzip_command, psql, path_sep,
  loader, environ_set_command, county_process_command
  FROM tiger.loader_platform
  WHERE os = 'sh';
EOF

Generate the scripts to load the geocoder data

Generate the national data, on which state data depends:

psql -d $DB -X -c "SELECT Loader_Generate_Nation_Script('$DB')" -tA > $HOME/gisdata/nation_script_load.sh

Generate the data for the states you want -- here, Massachusetts and Minnesota:

psql -d $DB -X -c "SELECT Loader_Generate_Script(ARRAY['MA', 'MN'], '$DB')" -tA > $HOME/gisdata/ma_mn_states_load.sh

It is tempting to try to do this for all states . . .

psql -d $DB -X -c "SELECT Loader_Generate_Script(ARRAY['AL' , 'AK' , 'AS' , 'AZ' , 'AR' , 'CA' , 'CO' , 'MP' , 'CT' , 'DE' , 'DC' , 'FL' , 'GA' , 'GU' , 'HI' , 'ID' , 'IL' , 'IN' , 'IA' , 'KS' , 'KY' , 'LA' , 'ME' , 'MD' , 'MA' , 'MI' , 'MN' , 'MS' , 'MO' , 'MT' , 'NE' , 'NV' , 'NH' , 'NJ' , 'NM' , 'NY' , 'NC' , 'ND' , 'OH' , 'OK' , 'OR' , 'PA' , 'PR' , 'RI' , 'SC' , 'SD' , 'TN' , 'TX' , 'VI' , 'UT' , 'VT' , 'VA' , 'WA' , 'WV' , 'WI' , 'WY'], '$DB')" -tA > $HOME/gisdata/all_states_load.sh

But experience suggests that you should load each state separately, so that if you have to, you can re-run a state:

states=(AL AK AS AZ AR CA CO MP CT DE DC FL GA GU HI ID IL IN IA KS KY LA ME MD MA MI MN MS MO MT NE NV NH NJ NM NY NC ND OH OK OR PA PR RI SC SD TN TX VI UT VT VA WA WV WI WY)
for state in "${states[@]}"
do
  psql -d $DB -X -c "SELECT Loader_Generate_Script(ARRAY['${state}'], '$DB')" -tA > $HOME/gisdata/${state}_state_load.sh
done

Make the scripts runnable . . .

chmod +x $HOME/gisdata/*.sh

Generate

Run the loader scripts.

cd $HOME/gisdata
./nation_script_load.sh
./AZ_state_load.sh
./MN_state_load.sh
# etc.

Now find lat/long for an address you know in one of the states

select
	g.rating,
	st_y(g.geomout) as lat,
	st_x(g.geomout) as lon,
	(addy).address as stno,
	(addy).streetname as street,
	(addy).streettypeabbrev as styp,
	(addy).location as city,
	(addy).stateabbrev as st,
	(addy).zip
from geocode(normalize_address('931 Portland Ave, Saint Paul, MN'), 1) as g;

If the data load has run correctly, you should see:

rating lat lon stno street styp city st zip
9 44.9429540910341 -93.1394249137542 931 Portland Ave St. Paul MN 55104

Sample queries

Here are a few ways to leverage the geocoding.

Basics

The key functions are

The following examples use a sample table like this:

drop table if exists people;
create table people (pid int, first text, last text, address text);
insert into people (pid, first, last, address) values (1, 'John', 'Norman', '931 Portland, Saint Paul, MN');
insert into people (pid, first, last, address) values (2, 'Idan', 'Ben-Arieh', '6 valley rd, Natick MA 01760');
insert into people (pid, first, last, address) values (3, 'Jane', 'Doe', '3048 W San Juan Drive, Tucson AZ 85713');
insert into people (pid, first, last, address) values (4, 'Doug', 'Doucey', '400 W Girley, Prescott, AZ');

First off, many of the PostGIS functions return PostgreSQL "composite types": to destructure, you enclose the returned composite in parentheses, like so:

select
  address,
  (normalize_address(address)).streetname,
  (normalize_address(address)).zip
from people;

Notice that the normalize_address function is being called twice. Ideally we would like to call it once. A common pattern you will see is some kind of renamed query result or subquery:

select address, (na).streetname, (na).zip from
  (select
    address,
    normalize_address(address) as na
   from people) people_with_normalized_address;

Another example of destructuring, with geocoding:

select
  address,
  st_y((g).geomout) as latitude,
  st_x((g).geomout) as longitude,
  (g).rating,
  pprint_addy((g).addy) as normalized_address
from
  (select
    address,
    geocode(normalize_address(address),1) as g
   from people) people_with_geocoded_address;

Select the original data, augmenting with additional geo columns

Aside: I imported this data from csv with a neat tool called pgfutter: https://github.com/lukasmartinelli/pgfutter.

Here the table with the original data is import.patients with a primary/unique key of patient_demographics_patient_guid and the original address in patient_demographics_address.

select import.patients.*,
  geocoded.latitude as Latitude,
  geocoded.longitude as Longitude,
  geocoded.rating as "Geocode Rating",
  geocoded.normalized_address as "Normalized Address"
  from
    import.patients
      left join
        (select
      	   patient_demographics_patient_guid,
      	   st_y((geo).geomout) as latitude,
      	   st_x((geo).geomout) as longitude,
      	   (geo).rating,
      	   pprint_addy((geo).addy) as normalized_address
      	 from (select
      		  patient_demographics_patient_guid,
      		  (geocode(normalize_address(patient_demographics_address),1)) as geo
      	    from import.patients) addresses) geocoded
      on geocoded.patient_demographics_patient_guid = import.patients.patient_demographics_patient_guid;

(On my 2014-era MBP, this takes about 0.16 seconds / row; about 015 seconds with an index on patient_demographics_patient_guid -- create index on import.patients (patient_demographics_patient_guid);.)

Add columns for geocoding, and only geocode when never geocoded

Finally, in this example, we add columns to a table without geocoding. Then we only geocode rows where the rating column is null. This would be a nice overnight task to get geocoded all new rows that have never been geocoded.

alter table import.patients
  add column latitude double precision,
  add column longitude double precision,
  add column rating integer,
  add column normalized_address varchar;

(To get the data types, I used the pg_typeof column on a query that returned the different items.)

And here's the query (this is verbatim from the documentation).

Note the use of index 3 -- this is worth eyeballing before running on all rows.

update import.patients
  set (latitude, longitude, rating, normalized_address)
    = (
        st_y((g.geo).geomout),
        st_x((g.geo).geomout),
        coalesce((g.geo).rating,-1),
        pprint_addy((g.geo).addy)
      )
from (select patient_demographics_patient_guid
      from import.patients
      where rating is null order by patient_demographics_patient_guid limit 3) as a
    left join (select patient_demographics_patient_guid, (geocode(patient_demographics_address,1)) as geo
    from import.patients As ag
    where ag.rating is null order by patient_demographics_patient_guid limit 3) As g on a.patient_demographics_patient_guid = g.patient_demographics_patient_guid
where a.patient_demographics_patient_guid = import.patients.patient_demographics_patient_guid;

(About 0.20 seconds/row without indexing.)


Visualization

Here's a quick way to get a visualization on your laptop.

First, install R.

brew tap homebrew/science
brew install R

Export your geocoded table to a CSV.

psql -c "copy import.patients to '$HOME/Desktop/patients.csv' with (format csv, header)"

In R,

install.packages('ggmap')
library(ggmap)
patients <- read.csv(file="/Users/jgn/Desktop/patients.csv", header=TRUE)

The fastest way to get started with ggmap is to use the qmplot function. My advice is to read the docs and play around.

qmplot(longitude, latitude, data=patients, source='stamen')
qmplot(longitude, latitude, data=patients, source='google', maptype = 'roadmap')
qmplot(longitude, latitude, data=patients, source='google', maptype = 'roadmap', zoom=10)
qmplot(longitude, latitude, data=patients, source='google', maptype = 'roadmap', geom = 'density2d')
qmplot(longitude, latitude, data=patients, source='google', maptype = 'roadmap', alpha=I(0.5), zoom=10, geom='density2d')