Environmental Data Resources, Inc. (EDR) helps people connect with the world’s largest and most accurate database of environmental and historical land use records. Here, I work with Prof. Weiqing Gu during my time at Harvey Mudd College together with liaisons from EDR: Zachary Fisk, Matthew Purcell, and Richard White.
ML Model for US Address Parsing
In handling with lots of historical land use records, our team is interested in the problem of matching US address records that refer to the same locations. The motivation is the format of addresses is not always the same, especially in the past and during the present. Additionally, the address format is not standardized (i.e., changing owner, having different street names, and using different notations).
For instance, these three addresses should be considered the same.
LA UNI SCH DIST, STEVENSON JR 725 S INDIANA ST N/A LOS ANGELES, CA 90023
725 SOUTH INDIANA ST, A SCHOOL LABORATORY, STEVENSON JUNIOR LOS ANGELES, CA 90023
LAUSD/STEVENSON MS 725 S INDIANA ST LOS ANGELES, CA 90023
To address the problem of address matching, we decompose the problem into two subproblems: address segmentation into a standardized format and then address matching on the standardized format.
My team and I designed a Hidden Markov Model and a Support Vector Machine for automatically parsing addresses into standardized format. Then, we solved the address matching problem by introducing a distance function on the standardized format. Our team implemented the entire pipeline in Python and the ML models in Scikit-learn. We deployed our models to serve all EDR’s data on the ElasticSearch clusters.
Last updated: Jan 16, 2022