Data Quality: Part I — Names
Data quality is important. Unfortunately, it doesn’t seem like Virginia’s lower courts do much to keep their data clean. I‘m going to go over some of the inconsistencies that I’ve found, starting with names. Here’s an example.
This individual has at least five unique spellings of his name. Middle name, punctuation, and generational designation inconsistencies like this are common throughout the court data. Misspellings are also fairly common, like the city name in this example. If you want to group cases by individual, a simple group by name approach won’t produce a great result. I’ve come up with a more complex approach that seems to produce decent results. All criminal case data available at VirginiaCourtData.org now has a person_id field, which represents a unique id assigned to each individual criminal defendant in the dataset. In addition, another set of files can be downloaded that includes all district and circuit criminal cases organized by person_id instead of most recent hearing date. The rest of this article is about the methodology used to generate the person_id.
To help tell people with similar names apart, I only compared names with the same date of birth and sex. The date of birth doesn’t include the year and has it’s own data quality issues, so it’s not perfect, but more useful than not. I also limited comparisons to names that start with the same two letters. The name field has a typical structure of Last, First Middle. Assuming that the first two letters of the last name are correct significantly reduces the number of comparisons that need to be done.
I saved a small subset of data to a CSV file and set about writing the matching algorithm. I decided to use SeatGeek’s fuzzy string matching library, FuzzyWuzzy. After weighing the options, it seemed like their partial_ratio function worked best. Their use case is extremely similar to mine.
Inconsistent substrings are a common problem for us. To get around it, we use a heuristic we call “best partial” when two strings are of noticeably different lengths (such as the case above). If the shorter string is length m, and the longer string is length n, we’re basically interested in the score of the best matching length-m substring.
In testing, this worked well, but there were still come edge cases I wanted to try to catch. Matching scores above 90 were almost always true positives and below 80 were almost always true negatives, but scores between 80 and 90 were a toss up. I decided to match the Address field and use that score to rule in the case of an ambiguous name score.
The algorithm loops through every name in the list and compares each one to every following name until a match is found.
So that’s the matching algorithm. The most challenging part of this exercise was figuring out how to run this algorithm on all 17.5 million criminal cases in the database. I would need to pull cases from two tables, circuit and district, run the algorithm, and put the result back in the database. This may be intuitive for someone who has more database experience, but it took some trial and error for me.
I knew I could chunk the data into gender (2), date of birth (366), and first letter (26), which would only require me to work on about 1,000 cases at a time. In reality, there are way more male cases than female, but it’s all in the same ballpark. I figured that working on these small chunks would solve my initial problems of scale. It would take a while to work on all those chunks, but at least I could load the data into memory and get the work done.
I wrote some code to generate a person_id for each matched group. I used a big int seeded with the gender, dob, letter group and left the least significant billion left for incrementing. This allowed me to not have to worry about collisions between groups. I added a person_id column to each table and went to work on the database interface code.
I was already using SQLAlchemy and Psycopg to connect to the database, so I wrote a couple more functions to fetch the data I wanted to work on, match the names, generate person ids, and then update the rows one at a time.
It worked, but ran slowly. It was spending a lot of time fetching data, so I added some indexes, which helped significantly. I let it run over night. The next morning, I saw that it had only finished one month… not good. Most of the time was spent on selects and updates. I needed a new strategy if I wanted this to finish in less than a week.
I found a great blog that suggested putting the data to update in a temporary table and then issuing an update statement with an inner join. My exports already used the psql terminal to download large amounts of data quickly. I could use the same COPY command to move data out and back in, then just issue a single update. I decided that I would download all cases for a month of DOBs, then chunk them locally before running the name match. I would only need to run 12 COPY TO commands and 12 COPY FROM commands. That would definitely solve my IO problems, or so I thought.
I reimplemented the database pieces to use the psql terminal and to chunk the data locally. I also got a compute optimized EC2 instance that would be able to churn through the string comparisons more quickly. It downloaded a couple million cases in less than 30 seconds. It chunked and matched them all in less than 15 minutes. It uploaded the person ids in just a few seconds. Then I hit a another road block; the update was taking forever. I checked the RDS console and found that database was operating at 1,000 write IOPS. I read up on this metric and found that this was the limit for my instance and that it was going to drop down to 100 IOPS after 30 minutes. I looked into provisioned IOPS, but the cost was too high for this project and I didn’t want to have to change my database storage every time I wanted to run this script. Back to the drawing board.
After some googling, it finally occurred to me that my temp person_id table should be a permanent junction table. With the update statement removed, I was able to run the script and generate a person_id for all 17.5 million cases in a few hours. I’ve used this junction table in a project and it’s worked really well. I’m not sure it’s the best solution, but it’s one that works.