2018 and 2009 Court Data — And Some Duplicates Removed
I’m excited to announce that 2018 case data has been published to VirginiaCourtData.org.
District court data from 2009 has also been published. It was my hope to collect district court data going back to 2007, but I found that data older than 2009 is not available on the website.
6,133,208 new cases were added, bringing the total number of cases available for bulk download to 33,529,018.
Earlier this year, I got an email from someone who suspected that the data contained duplicate cases. Unfortunately, his suspicion was correct. There were duplicate cases in all four types of courts — criminal and civil, district and circuit. Nearly all of the duplicates were cases from 2016, some of the first cases I collected. My guess is there was a bug early on that I didn’t notice. Luckily, the impact was minor. Of the 27,395,810 cases that were posted for the last year, 116,701 (0.465%) were duplicates. For the most part, the affected cases were duplicated once or twice, with the exception of one case that was duplicated a couple thousand times. The rest of this post will be a technical discussion of how I removed the duplicates. I also kept data on exactly which cases were duplicated. Please contact me if you need it.
Thanks to Jay Phinizy for discovering this issue and bringing it to my attention.
My first mistake was not creating a unique index on FIPS code and Case Number. The Case Number is a unique identifier in each court. Since Case Numbers are not unique to all courts, the FIPS code of the court needs to be taken into account too. Since I didn’t create this constraint on my database, it was possible to scrape the same case twice and to create a duplicate database entry.
Finding these duplicate cases was easy. I was able to group all the cases by FIPS and Case Number, then only show the groups that had more than one row. I also pulled the most recent hearing date, so I could determine how recently I had duplicated a case.
After finding out how many individual cases were duplicated, I wanted to know how many rows I could expect to be deleted i.e. the sum of grouped rows less the number of groups. I ran the same query as before, but as a subquery, so I could aggregate the results.
Now I could go ahead and delete. I found that Postgres has a USING statement that would make this query really simple. For all cases with the same FIPS and Case Number, delete the one with the higher id.
I ran the query and waited…. and waited… and finally cancelled it after 20 minutes. The query to find the duplicate cases only took a second or two, so there had to be something wrong with the DELETE. I found a specific duplicate case and deleted it using the EXPLAIN ANALYZE function.
Unfortunately, I didn’t save the output of this function to show here, but what I found was that CASCADE ON DELETE was enabled, so deleting this one case caused all records in my child tables to be deleted. These are tables for hearings, services, etc. The problem was, these tables were only indexed on their id, not the case_id, which linked them back to this table. So every delete was taking 4 to 5 seconds because every record in every child table had to be examined. I added indexes on case_id to all of the child tables and ran the delete again. This time, it took less than 30 seconds to run the entire thing.
I repeated this on all four databases to remove the 116,701 duplicate cases.