Center for Geographic Analysis Harvard University
Additional navigation

You are here

Geocoding and producing Census FIPS codes for 53 million addresses: A combination approach using ArcGIS and PostgreSQL

For a recent project we had to geocode 53.1 million addresses (in CSV format), and determine the 2000 and 2010 U.S. Census block group FIPS codes for each address. The Dell PC used for all processing was running 64-bit Windows 7 Enterprise SP1 operating system machine with:
i7-2600 CPU @ 3.40GHz *2 and 16 GB of RAM.

First, the CSV file was geocoded using the ESRI Business Analyst 2013 for Desktop USA Geocoding Service address locator, using ArcGIS 10.2 software. The processing took about 106 hours (4.5 days) to run. Results were saved into a file geodatabase that ended up being 12 GB in size. The geocoding rate started out at 10,000 per hour, but then steadily increased to as much as 900,000 per hour, averaging about 500,000 per hour for the duration. 99.4% of the addresses were matched.

Next, determining the census block group 2000 and 2010 FIPS codes for each address was necessary. ArcGIS has several ways to accomplish this, including the tools spatial join, identity, and intersect. The 2000 and 2010 block group polygon boundaries were copied from the ESRI Data and Maps datasets into the file geodatabase holding the geocoded addresses. Each block group file has just over 200,000 polygons. The spatial join, identity, and intersect tools were run on the addresses and 2000 block groups, and each time ArcGIS threw an "Out of Memory" error. Having run into this error before (ArcGIS running out of memory when attempting analyses with datasets numbering in the millions), a simple python script (included in "Python Script" section below) was written to process a subset of the full dataset; each subset represented a single US State. Although this approach was extremely easy to implement, it was not so efficient due to the long processing time of roughly 20 hours of processing time per state.

Therefore we turned to a solution using the PostgreSQL 9.2 enterprise level database able to do spatial analyses using the PostGIS 2.1 extension. Importing the addresses and census block group polygons into PostgreSQL took 30 hours. Processing the block group 2000 and 2010 determination for all 53.1 million address locations using the PostGIS "Intersects" command took 50 hours. See the "PostgreSQL workflow" section below, where each command is listed.

Conclusions:

  • While ArcGIS 10.2 was capable of geocoding 53.1 million addresses, it ran out of memory when trying to perform point in polygon determination both with and without the use of scripting in a file geodatabase format. This is not to say this task is impossible in ArcGIS. Some other things to try would be 1) Using a machine with more RAM, 2) Trying different python scripts in a Hadoop framework, 3) Explore using Python multithreading capability, Connect to the data in PostgresSQL from ArcGIS.
  • PostgreSQL with the PostGIS extension was able to efficiently handle this large data analysis task.
  • Having the access to and ability to use multiple GIS is often beneficial when tackling difficult spatial analyses.

Python Script:

import arcpy
from arcpy import env
env.overwriteOutput = True
env.workspace = r"~\geo_fw1.gdb"
arcpy.MakeFeatureLayer_management("z01_diff", "test_lyr")
arcpy.MakeFeatureLayer_management("blkgrp2000", "blkgrp2000_lyr")
arcpy.MakeFeatureLayer_management("blkgrp2010", "blkgrp2010_lyr")
f = open('workfile_5.txt', 'w')
f.write("ID,"+ "blk2000," + "blk2010\n")
for a in range(1, 662690):
query = "OBJECTID = " + str(a)
arcpy.SelectLayerByAttribute_management("test_lyr", "NEW_SELECTION", query)
arcpy.SelectLayerByLocation_management ("blkgrp2000_lyr", "INTERSECT", "test_lyr")
arcpy.SelectLayerByLocation_management ("blkgrp2010_lyr", "INTERSECT", "test_lyr")
with arcpy.da.SearchCursor("blkgrp2000_lyr", ("FIPS")) as cursor:
for row in cursor:
blk2000 = row[0]
with arcpy.da.SearchCursor("blkgrp2010_lyr", ("FIPS")) as cursor:
for row in cursor:
blk2010 = row[0]
f.write(str(a) + "," + blk2000 + "," + blk2010 + "\n")
f.close()

PostgreSQL workflow:

#The following part is done in terminal
#import addresses, and census block groups from file geodatabase into postgres
ogr2ogr -overwrite -f "PostgreSQL" PG:"host=localhost user=postgres dbname=mypgdb password=postgres" "~/53Milliondata.gdb" "Geocoding_Result"
ogr2ogr -overwrite -f "PostgreSQL" PG:"host=localhost user=postgres dbname=mypgdb password=postgres" "~/Census.gdb" "blkgrp2010"
ogr2ogr -overwrite -f "PostgreSQL" PG:"host=localhost user=postgres dbname=mypgdb password=postgres" "~/Census.gdb" "blkgrp2000"

#connect to the database
#create (spatial) index (optional as importing will automatically build index) to increase performance
CREATE INDEX geocoding_result_gist
ON geocoding_result
USING GIST (wkb_geometry);

#Intersection for 2010 boundary
CREATE TABLE geofips AS
SELECT geocoding_result.*,blkgrp2010.fips
FROM geocoding_result,blkgrp2010
WHERE ST_Intersects(geocoding_result.wkb_geometry,blkgrp2010.wkb_geometry);

#Rename new field "fips" to fips 2010
ALTER TABLE geofipsall RENAME COLUMN fips TO fips2010;

#Intersection for 2000 boundary
CREATE TABLE geofipsall AS
SELECT geofips.*,blkgrp2000.fips
FROM geofips,blkgrp2000
WHERE ST_Intersects(geofips.wkb_geometry,blkgrp2000.wkb_geometry);

#Rename new field "fips" to fips 2000
ALTER TABLE geofipsall RENAME COLUMN fips TO fips2000;

#Add Column fips2010, fips2000 to geocoding_result
ALTER TABLE geocoding_result ADD COLUMN fips2010 varchar(12);
ALTER TABLE geocoding_result ADD COLUMN fips2000 varchar(12);

#Select non-geocoded results
CREATE TABLE Ustatus AS
SELECT * FROM geocoding_result WHERE status='U';

#Union two tables together
CREATE TABLE geocoding_final_result AS (
SELECT * FROM geofipsall
UNION
SELECT * FROM ustatus);

#export result to csv
COPY geocoding_final_result(loc_name,status,x,y,match_addr,altpatid,address,city,state,zip_code,fips2010,fips2000)
TO 'C:\Temp\geocoding_final_result_new.csv' DELIMITER ',' CSV HEADER;

#export a sample result for reviewing
COPY (SELECT loc_name,status,x,y,match_addr,altpatid,address,city,state,zip_code,fips2010,fips2000
FROM geocoding_final_result LIMIT 1000)
TO 'C:\Temp\geocoding_final_result_1000.csv' DELIMITER ',' CSV HEADER;

#end

Geocoding by Jeff Blossom, PostgresSQL by Zhenyang Hua, Python by Giovanni Zambotti.