Adventures in Address Parsing
Published on Sept 12, 2019
What is address parsing?
Address parsing transforms a human-readable address (e.g. “1600 Pennsylvania Avenue NW, Washington, D.C. 20500”) into machine-readable parts (“1600” is the house number, “Pennsylvania” is the street name, “Avenue” is the street type, “NW” is a street directional modifier, “Washington” is the name of the city, “DC” is the region, and “20500” is the postal code).
Why do we need to parse addresses?
Generally, the database will have rows like this:
If the database were to store the entire address as a single line, then it would be difficult to find entries. For example, if someone entered the address exactly correct, except used “Ave.” instead of “Avenue,” then a simple text search would not match. Likewise, if someone omitted the zip code, then there would not be an exact string match. In other words, parsing addresses into components enables normalization of various fields, which allows the database to be much more compact and manageable. Address parsing also gives computers software some hope of correcting user errors.
Theoretically, you can design a geocoding API that requires clients to submit parsed addresses, so they would submit “Washington” as the name of the city, “DC” as the region, “20500” as the postal code, and perhaps “1600 Pennsylvania Avenue NW” as the street address (or perhaps broken up further). However, clients likely have dirty data themselves; for example, clients might ask users to enter data, and some subset of users will likely enter data incorrectly (e.g. combining multiple fields into one). Clients might also be concerned about losing orders or afraid that requiring users to enter data in separate fields is tedious. Most geocoding services include address parsing so each client does not have to enforce data validation or write its own address parsing software.
Some challenges with parsing addresses
Since we decided to focus on addresses in the United States, we could tailor our parsing software to one country’s style of addresses. On the surface, how hard could this piece be? Seems like you just take the first number as the house number, everything after that until the next comma as the street name and type, and use commas to separate the city and state, and optionally pick up the postal code after that. It turns out that address parsing is not so easy; it’s actually filled with special cases.
For example, some places spell out their street address, such as:
One Lyons Street
Dedham, MA 02026
Or, there are some addresses in New York City that have unusual house numbers:
30-09 34th St
Astoria, NY 11103
That hyphen is not a mistake -- I’ve lived in the US practically my entire life and I do not remember seeing such an address. Apparently, the number before the hyphen is the block (30th Avenue at 34th Street), and the number after the hyphen is the street number on the block, which is more apparent when you zoom in close on an OpenStreetMap tile.
Why isn’t the house number just 3009 then? No idea. Maybe earlier New Yorkers just wanted to be different. There’s also an address like 12188-B N Meridian St, Carmel, IN 46032, which is right next to 12188 N Meridian St, Carmel, IN 46032 -- maybe they built an expansion to serve a different purpose. None of these examples seem like that big of a deal, but if your parser is expecting a number instead of a string, it can separate the address in unexpected ways.
There are some addresses that are on highways:
19707 Highway 59 Humble, TX 77338
For context as to why this can be problematic, consider that sometimes, people write addresses with suite numbers without indicating a unit designation (such as Suite or Unit). So, in this case, it’s possible to construe this address as
Street name: Highway
Postal code: 77338
You might wonder who could possibly mistake the original address as suggesting 59 as a suite number. Computers can. Compounding that error, perhaps because people think it’s obvious to humans when a number following the street name is a unit number, they might omit the suite label. This isn’t a trivial problem, since even in August 2019, Nominatim, the search interface for OpenStreetMap, is unable to return any results for that address (yes, it is possible that they lack data, but in this case, it seems more likely that they were unable to parse the address in a way that they could they could find it in their database).
You might think that a simple way of handling this problem is to carve out an exception for the word “Highway” -- there’s no way that there’s an actual normal street named Highway, right?* Right? So, let’s say that you handle that case. It turns out that that address could also be referred to as “19707 US-59 Humble, TX 77338” (sometimes the hyphen is omitted). Alright, maybe we should add “US” as a synonym for “Highway” and normalize out the hyphen, if one exists. You go along your merry way, and you discover some inconsiderate bloke abbreviated Highway as HWY. Ok, another synonym. How about “I” for “Interstate”? Ok, two more synonyms -- how hard can this be? Oh, but the fun is just beginning. Remember how people sometimes omit the unit designation and just write the suite number? Some streets are named I Street (for example, see Washington, DC). So, a parser might incorrectly parse the address for Elkhorn Southbound Rest Area (“20202 I-5, Sacramento, CA 95837”) as 20202 (street number) I (street), suite 5. The hyphen between “I” and “5” indeed helps disambiguate the case, but ultimately, this is an example where address parsing is more accurate when done in the context of real data. Knowing that there is no address 20202 on I Street in Sacramento (even though the street exists) can help an address parser more confidently suggest that the street is I-5, not I Street.
It turns out that there are state highways too. So, CA-1 might also be referred to as Highway 1 in California, which is an entirely different route than US Highway 1, which runs along part of the East Coast. Other synonyms for state highway include “Route”, “Rte”, and “State Route.”
Some addresses require what we call directional modifiers (North, South, East, West). For example, 505 W. Olive Avenue, Sunnyvale, CA is different from 505 E. Olive Avenue, Sunnyvale, CA. Other times, you could omit the directional modifier and still get to the right place (e.g. 10305 Main Street Houston, TX 77025 -- technically, it’s South Main Street, but since there appears to not be a North Main Street nearby, there is no loss of precision when omitting the “South”). To add to the fun, some street names are simply a direction, as in 500 North Street Altamonte Springs, FL.
So far, we have been dealing with city names that are only a single word. Let’s change that up:
1137 Druid Circle Lake Wales, FL 33853
Is the street name “Druid Circle Lake” in the city of Wales, FL or is it “Druid Circle” in the city of Lake Wales, FL? This one is particularly fun because Lake is a valid street type (and has its own abbreviation), so both interpretations could be valid (and yes, in case you were wondering, there are some streets that use more than three words, such as Martin Luther King Boulevard). Fortunately, a quick consultation with your handy-dandy, readily available list of US cities shows that there is no Wales in Florida, only Lake Wales -- another example of how address parsing does better in the context of real data. Wouldn’t it just be easier if users always entered the data correctly?
How Nominatim parses addresses for OpenStreetMap
We have considered a lot of different angles involved with address parsing. Let’s take a closer look at one approach. Nominatim is open-source software used to parse and geocode addresses using OpenStreetMap data. The software does an admirable job for a difficult challenge: handling address formats for countries around the world, while working with user-entered data that is likely not normalized. One thing that is super neat about the software is that you can append
&debug=1 and gain visibility into how they parse addresses. For example, we can see how it handles an address like “1480 Falcon Sunnyvale CA.” First, the software normalizes the query, which apparently means making everything lower case. Then, the software splits the query up into different words, or tokens, in a process called tokenization. Then, we get to the search candidates, and finally, we see the results of each attempt. In this case, it appears that Nominatim went through 21 attempts before it was satisfied. The number of loops might have been less if it found the house number. The automatically generated queries seem complicated because of OpenStreetMap’s schema.
Sounds like too much work?
We include address parsing in our geocoding, so if you had a project where you needed to standardize your addresses, you can use our API to geocode your address and get your address parsed into various components; the latitude and longitude coordinates would be included for free! One downside of this approach is that if you submit an invalid address (e.g. a street that does not actually exist), our API might return suggestions, which might not be what you want. If you want pure address parsing functionality without validation, let us know and we can expose the appropriate API.