ArcUser Online
 

October - December 2007
Search ArcUser
 
ArcUser Main Current Issue Previous Issues Subscribe Advertise Submit An Article
 

Improving Geocoding for Unusual Road Names
Florida county resolves unique road name issues using the Geocoding Development Kit
By Jay L. Johnson, GISP, Tallahassee-Leon County GIS

Summary

Anyone who works with geocoding quickly discovers that all addresses are not created equal. Unique local road names, unusual name styles, and nonstandardized road types can all cause the ArcGIS geocoding process to produce poor results. Results can be greatly improved by customizing the process to account for local idiosyncrasies. Esri's Geocoding Development Kit (GDK) provides the tools to make these customizations.

click to enlarge
Editing a classification file. Use semicolons to comment out street types not used locally, such as JUNCTION, and insert local street-type keywords not included with the Esri default file, such as KNOLL.

Leon County is home to Florida's capital city, and its rich history is reflected in the area's inhabitants who range from the Apalachee Indians to the Spanish, French, and English who successively occupied it. Nowhere is this rich heritage as vivid as Leon County's legacy of unique local road names. Roads such as Chowkeebin Nene, Calle de Santos Road, Rue de Lafitte, and Old St. Augustine Road are a testament to the imprint these cultures have left on the county. Overlaying this diverse history is the footprint of ongoing development, whose architects often have vivid imaginations when it comes to street naming.

This melting pot of road names, however romantic, poses practical problems when geocoding addresses. There are several distinct types of road names for which the default Esri geocoding process will frequently fail:

  • Road names with directional keywords in the street name field (e.g., North by Northwest Rd).
  • Streets where Saint is abbreviated as ST in the street name field (e.g., Old St Augustine Rd).
  • Road names containing street-type keywords in the street name field rather than in the street-type field (e.g., Hill Lane). This is a particular problem with foreign road names, where the foreign street type has often been mistakenly classified as a part of the road name within the official master address database (e.g., Avenida de la Luna).
  • Road names with nonstandard street types (e.g., Chimney Swift Hollow, where Hollow is a street type).

Esri's geocoding process relies heavily on the recognition of keywords to properly parse an address into different parts called tokens. The objective of the Leon County Geocoding Development Kit (GDK) customizations was to improve this parsing for the addresses within the county's master address database. These known "good" addresses are the official source of parcel site addresses for Leon County and are assigned by Leon County's Growth Management Department as mandated by local ordinance. Theoretically, 100 percent of the addresses in the master address database should geocode properly against a corresponding parcel-based locator because they are both based on the same data.

A secondary objective of the GDK customizations was to improve recognition of frequently misspelled street names. The exotic spellings of some local roads make misspellings a virtual certainty (e.g., Yashuntafun Rd). Esri geocoding uses a soundex algorithm to match input addresses against a locator service. This is not infallible. Certain commonly observed mistakes in source addresses, including the confusion of prefix direction with suffix direction and some common misspellings, are beyond the ability of the soundex algorithm to handle successfully.

Project Methodology

How ArcGIS parsed an address before and after implementing the pattern file edits

The GDK was downloaded from Esri's Developer Network (EDN) Web site. The kit contains all a user needs to customize the geocoding process including the Geocoding Rule Base Developer Guide, the geocoding rule bases of the current release of ArcGIS, an interactive standardizer (STANEDIT.EXE) that is used for syntax checking and debugging of standardization pattern rules, and the standardizer pattern rule encryption program (ENCODEPAT.EXE). After installing the GDK, modifications were made to the classification file, us_addr.cls, and the pattern file, us_addr.pat. These files support geocoding locators for several of the most commonly used U.S. style addresses (U.S. Streets, U.S. One Range, and U.S. One Address locators).

In general, changes to the classification file were made to

  • Remove street-type keywords that are not used as street types in Leon County but which may be present in the street name (e.g., JUNCTION).
  • Add unique local street-type keywords (e.g., KNOLL).
  • Provide limited correction of common misspellings (e.g., Vetrans for Veterans).

Next, changes to the pattern file provided improvements in road-specific pattern recognition. Certain road names in Leon County are intrinsically baffling to the default Esri geocoding routines (e.g., North by Northwest Rd); the GDK provides users with the tools to create custom patterns to recognize and properly parse addresses containing these confusing road names on a case-by-case basis. Edits to the pattern file are actually made to the unencrypted version of the file (us_addr.xat), then ENCODEPAT.EXE is used to encrypt the file to the version ArcGIS uses (us_addr.pat).

By making changes in the classification file and the pattern file, it is possible to improve geocoding rates on problematic roads without requiring changes to the master address database. A balancing act is required when making changes in these two files. For example, in Leon County the placement of street-type keywords is not always consistent (e.g., Ride is sometimes used as a street type and sometimes used as part of the street name), so judgment must be used in determining whether to retain Ride as a street-type keyword in the classification file or to remove it. This determination will, in turn, affect which specific road names must be dealt with on a case-by-case basis by adding pattern recognition routines to the pattern file. After the pattern and classification files have been modified and the pattern file has been encrypted, the files are copied over the default versions in the Program Files\ArcGIS\Geocode directory.

Significant Improvements

road sign
Unique road names like this one make geocoding addressing challenging.

For benchmark testing, 97,834 addresses were extracted from the Leon County parcel address layer. These addresses were geocoded against a parcel-based locator service using the Esri default classification and pattern files. The geocoding settings were left as default (spelling sensitivity=80, minimum match score=60, ties allowed). Even though all the addresses in the benchmark test had exact string matches to addresses in the locator, 1,131 of the addresses did not geocode successfully. This clearly illustrates the magnitude of improperly parsed addresses in Leon County when using the default ArcGIS classification and pattern files.

Next, these same addresses were geocoded using the customized classification and pattern files. Only 59 of the records did not geocode successfully; the customizations to the classification and pattern files resulted in a 95 percent reduction in the number of unmatched addresses. The remaining unmatched records were primarily addresses containing confusing unit numbers for apartments and condos.

 
A Quick Test
Determine if your parcel base suffers from geocoding problems.
  1. Build a standard parcel-based locator.
  2. Create a table of valid parcel addresses. Filter out any addresses not beginning with a number.
  3. Geocode the valid parcel address table using the locator.
  4. Evaluate the failure rate. In a large address database, even a 1 percent failure rate can indicate serious issues.

Further testing was conducted to gauge the improvements when geocoding typical user-supplied addresses in a real-world environment. There were 7,019 addresses pulled from the Leon County Animal Services service request database that were geocoded against a standard composite locator (a parcel-based locator with no ties allowed, followed by a street centerline-based locator allowing ties). In this test, the GDK customized files reduced the number of unmatched records by 29 percent.

Conclusion

The Geocoding Development Kit enables GIS professionals to improve geocoding match rates for address datasets containing unique local address styles by modifying the Esri default classification and pattern files. The return on investment of the time required to understand the GDK and implement the necessary local modifications can be easily justified by the reduction of manual matching efforts throughout an entire organization. History and culture leave a unique signature on local road names, no matter what corner of the world you work in. So look around—you are certain to find the GDK can improve your geocoding rates too.

For more information, contact

About the Author

While implementing the GDK, Jay Johnson provided GIS support to the Tallahassee-Leon County (Florida) Interlocal GIS program. He has more than 12 years of professional GIS experience and received his master's degree in GIS from the University of Colorado at the Denver College of Engineering and Applied Science. He recently relocated to Reno, Nevada.

References

Esri's Geocoding Development Kit (visit edn.esri.com and search the Downloads section for geocoding.)

Contact Us | Privacy | Legal | Site Map