A dbplyr-based Address Matching Package | R-bloggers

Read original article here

A dbplyr-based Address Matching Package
Posted on October 16, 2022 by Adnan Shroufi in R bloggers | 0 Comments
[This article was first published on R tips – NHS-R Community , and kindly contributed to R-bloggers ]. (You can report issue about the content on this page here )
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Share Tweet
The challenge
Matching address records from one table to another is a common and often repeated task. This is easy when address strings can be matched exactly, although not so easy when they cannot be matched exactly. An overarching issue is that an address string may be spelt (or misspelt) in multiple ways across multiple records. Despite this, we may want to know which records are likely to be same address in another table, even though these addresses do not share the exact same spelling.
The solution
To this end, the NHSBSA Data Science team have created an address matching package called {addressMatchR}. Using this package, we can now standardise all our address matching activities and save time using the same functions for a variety of use cases. With the code and functions openly available, we hope that other NHS organisations can benefit from this package as well.
The details
This package enables two database tables to be matched against each other, with the only requirement being each table has a postcode and address field. The output will be a matched dataset where matches are categorised as being Exact or Non-Exact, with Non-Exact matches being scored, so that the quality of these Non-Exact matches can be considered. You can download the package using the following code:
install.packages("devtools") devtools::install_github("nhsbsa-data-analytics/addressMatchR")
This package has been created and configured to work with database tables, as people or teams often need or want to match addresses in bulk. This could be tens or even hundreds of million records, which may not be feasible within a local R environment. For that reason, all of the functions have been configured to work with {dbplyr}, so the data being matched never ‘leaves’ the database.
For those encountering {dbplyr} for the first time, it is a package which enables users to use remote database tables as if they are in-memory data frames by automatically converting {dplyr} code into SQL. The advantage of this is that {dplyr} functions can be used to query a database and process the output using succinct and easy-to-read code. A disadvantage is that {dbplyr} code sometimes needs to be structured in a way to optimise how it is converted into SQL. More information on {dbplyr} can be found here: A dplyr backend for databases • dbplyr (tidyverse.org)
Connect to a database
The first thing to do when using {dbplyr} is to connect to a database, after which tables within a user’s schema can be queried. These objects are referred to as ‘lazyframes’ rather than dataframes, with the tables still being remote. We first need to establish a database connection, using our database connection details and authentication.
con % addressMatchR::tidy_single_line_address(df = ., col = ADDRESS_TWO)
We can now match the two cleaned address database tables against each other. To explain the format of the output, it is necessary to explain how the matching function itself works.
What happens under the hood?
The key features of the matching algorithm are listed below:
The function identifies two match types, Exact and Non-Exact matches
Exact matches are when the address and postcode are identical across datasets
All records not Exact matched are considered for a Non-Exact match
A Non-Exact match was to identify address strings deemed similar yet not identical
If an address could not be Non-Exact matched it was excluded from the output
Non-Exact matching is conducted on a postcode level. For example, an address from postcode ‘NE1 5DL’ would only be matched against other addresses that shared the same postcode.
The matching algorithm generates a matching score for each lookup-address that an address is matched against. For example, if 30 lookup-addresses share the same postcode, each of these 30 addresses would be scored against the address.
The scoring process splits an address into tokens (words) and then scores every token of an address against all the tokens of the addresses with a shared postcode.
Token-level scoring uses the Jaro-Winkler string similarity algorithm.
However, numerical tokens don’t use Jaro-Winkler and are scored slightly differently and given a higher weighting.
All of the token-level scores are aggregated to give a final score, for every lookup-address an address was matched against.
The best scoring Non-Exact match is then selected. If multiple properties have the same best match score, they are all included as joint best Non-Exact matches.
For the purposes of this blog, the above process has been heavily simplified. For a far more detailed and thorough explanation of the matching process, read section four of this RPubs article , which also describes a use case of the address matching function within an analysis of care home prescribing.
Match the data
The matching function only requires each of the lazyframes to be specified, along with their postcode and address column names. We can store the matched output in a new lazyframe.
results_db

Images Powered by Shutterstock

The Data Daily

A dbplyr-based Address Matching Package | R-bloggers