This blog rolls out version 2.0 of my jhuEmulator.py utility.
The folks at Johns Hopkins University have done an awesome job at maintaining their Ops Dashboard site. They even share a folder of data snapshots as a CSV files at a minute before midnight, Greenwich time (2359 UTC). However, the data itself is anything but static, and anyone who needs the CSV files at a more frequent pitch, if not real-time, can be frustrated by the snapshot frequency.
Fortunately, the data is all available with real-time updates -- It's in the feature service layers hosted on services1.arcgis.com (ncov_cases and ncov_cases_US). But there's one complication: The file format changed with the 2020-03-23.csv file, so anyone who had tools to read the old format was left in the lurch.
I've written a Python utility which can:
Attached to this blog post is a zipfile of the jhuEmulator.py utility. The usage looks like this:
D:\covid-19>python jhuEmulator.py -h
usage: jhuEmulator.py [-h] [--adminLevel {0,1,2}] [--verbose VERBOSE]
[--interval INTERVAL] [--folder FOLDER]
[--csvFormat CSVFORMAT] [--stopPath STOPPATH]
[--skipTrivial SKIPTRIVIAL] [--usOnly USONLY]
[--confirmedOnly CONFIRMEDONLY]
[--topStates {5,10,15,20,25,30,35,40,45,all}]
{IMMEDIATE,DELAY,ONCE}
Emulate JHU COVID-19 data file (v2.0)
positional arguments:
{IMMEDIATE,DELAY,ONCE}
Execution mode
optional arguments:
-h, --help show this help message and exit
--adminLevel {0,1,2} (default = 2)
--verbose VERBOSE Verbose reporting flag (default = False)
--interval INTERVAL Data retrieval interval (default = 60m)
--folder FOLDER Folder path for data files (default = 'data')
--csvFormat CSVFORMAT
strftime format for data files
--stopPath STOPPATH File that indicates loop execution (default =
'stop.now')
--skipTrivial SKIPTRIVIAL
Defer writing insignificant changes flag (default =
False)
--usOnly USONLY Only export US data (default = False)
--confirmedOnly CONFIRMEDONLY
Only export rows with confirmed cases (default =
False)
--topStates {5,10,15,20,25,30,35,40,45,all}
Display sorted Confirmed/Deaths by US state (default =
0)
Can generate both CSV formats (before/after) 23-Mar-2020
The simplest use is a one-time execution (ONCE) mode:
D:\covid-19>python jhuEmulator.py ONCE
1910: 2369 rows written ( 838061 / 41261 / 175737 )
D:\covid-19\demo>head data\2020-03-31_1910Z.csv
FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key
45001,Abbeville,South Carolina,US,2020-03-31 18:31:52,34.223334,-82.461707,3,0,0,0,"Abbeville, South Carolina, US"
22001,Acadia,Louisiana,US,2020-03-31 18:31:52,30.295065,-92.414197,11,1,0,0,"Acadia, Louisiana, US"
51001,Accomack,Virginia,US,2020-03-31 18:31:52,37.767072,-75.632346,7,0,0,0,"Accomack, Virginia, US"
16001,Ada,Idaho,US,2020-03-31 18:31:52,43.452658,-116.241552,163,3,0,0,"Ada, Idaho, US"
19001,Adair,Iowa,US,2020-03-31 18:31:52,41.330756,-94.471059,1,0,0,0,"Adair, Iowa, US"
29001,Adair,Missouri,US,2020-03-31 18:31:52,40.190586,-92.600782,1,0,0,0,"Adair, Missouri, US"
40001,Adair,Oklahoma,US,2020-03-31 18:31:52,35.884942,-94.658593,4,0,0,0,"Adair, Oklahoma, US"
08001,Adams,Colorado,US,2020-03-31 18:31:52,39.874321,-104.336258,152,0,0,0,"Adams, Colorado, US"
17001,Adams,Illinois,US,2020-03-31 18:31:52,39.988156,-91.187868,2,0,0,0,"Adams, Illinois, US"
The default is --adminLevel=2 (new-style), but levels 1 and 0 are also supported (zero is the same format as one, but without any state/province data for the US, Canada, China, or Australia):
D:\covid-19>python jhuEmulator.py ONCE --adminLevel=1
1911: 315 rows written ( 838061 / 41261 / 175737 )
D:\covid-19>head data\2020-03-31_1910Z.csv
Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered,Latitude,Longitude
,Italy,2020-03-31T18:31:40,105792,12428,15729,41.871940,12.567380
,Spain,2020-03-31T18:31:40,94417,8269,19259,40.463667,-3.749220
New York,US,2020-03-31T18:31:52,75795,1550,0,42.165726,-74.948051
,Germany,2020-03-31T18:31:40,68180,682,15824,51.165691,10.451526
Hubei,China,2020-03-31T01:06:37,67801,3187,63153,30.975600,112.270700
,France,2020-03-31T18:31:40,52128,3523,9444,46.227600,2.213700
,Iran,2020-03-31T18:31:40,44605,2898,14656,32.427908,53.688046
,United Kingdom,2020-03-31T18:31:40,25150,1789,135,55.378100,-3.436000
New Jersey,US,2020-03-31T18:31:52,17126,198,0,40.298904,-74.521011
D:\covid-19>python jhuEmulator.py ONCE --adminLevel=0
1911: 180 rows written ( 838061 / 41261 / 175737 )
Note that successive executions in the same time window will overwrite the output file!
The IMMEDIATE and DELAY options are nearly the same, except the IMMEDIATE makes a new snapshot without delay, while the DELAY mode only operates at regular intervals (all with a 1-20 second random delay, to prevent slamming the service with synchronized queries). The next two quote blocks were collected from two sessions running at the same time (in different directories):
D:\Projects\covid-19>python jhuEmulator.py IMMEDIATE --interval=15m --adminLevel=1
1917: 315 rows written ( 838061 / 41261 / 175737 )
1930: 315 rows written ( 838061 / 41261 / 175737 )
1945: 315 rows written ( 838061 / 41261 / 175737 )
2000: 315 rows written ( 846156 / 41494 / 176171 )
2015: 315 rows written ( 846156 / 41494 / 176171 )
2030: 315 rows written ( 846156 / 41494 / 176171 )
2045: 315 rows written ( 846156 / 41494 / 176171 )
2100: 315 rows written ( 846156 / 41494 / 176171 )
2115: 315 rows written ( 846156 / 41494 / 176171 )
2130: 315 rows written ( 850583 / 41654 / 176714 )
2145: 315 rows written ( 850583 / 41654 / 176714 )
2200: 315 rows written ( 850583 / 41654 / 176714 )
2215: 315 rows written ( 850583 / 41654 / 176714 )
2230: 315 rows written ( 855007 / 42032 / 177857 )
D:\covid-19>python jhuEmulator.py DELAY --interval=15m --adminLevel=1 --skipTrivial=True
1930: 315 rows written ( 838061 / 41261 / 175737 )
2000: 315 rows written ( 846156 / 41494 / 176171 )
2130: 315 rows written ( 850583 / 41654 / 176714 )
2230: 315 rows written ( 855007 / 42032 / 177857 )
Note that the --skipTrivial=True flag is the mechanism to skip data export if no significant changes have occurred (changes to the Last_Updated field will not write a new file, but changes to any case count field will).
Assembling the new-style data format is tricky, because it needs to aggregate thousands of records from two different services (and remove the US duplicates from the ncov_cases service), the verbose mode gives an indication of what's happening:
D:\covid-19>python jhuEmulator.py ONCE --verbose=True
Querying 'ncov_cases_US' service (1/5)...
382.2 KB retrieved (476ms elapsed)
Querying 'ncov_cases_US' service (2/5)...
281.9 KB retrieved (250ms elapsed)
Querying 'ncov_cases_US' service (3/5)...
200.1 KB retrieved (176ms elapsed)
Querying 'ncov_cases_US' service (4/5)...
262.2 KB retrieved (204ms elapsed)
Querying 'ncov_cases_US' service (5/5)...
169.5 KB retrieved (161ms elapsed)
Querying 'ncov_cases' service...
163.6 KB retrieved (169ms elapsed)
Creating datafile '2020-03-31_1936Z.csv'...
2369 rows written ( 838061 / 41261 / 175737 )
For the US-centric audience, I added a --topStates flag (which must be used with --verbose=Y) to print a summary
D:\covid-19>python jhuEmulator.py ONCE --verbose=True --topStates=15
Querying 'ncov_cases_US' service (1/5)...
390.0 KB retrieved (429ms elapsed)
Querying 'ncov_cases_US' service (2/5)...
286.2 KB retrieved (168ms elapsed)
Querying 'ncov_cases_US' service (3/5)...
200.7 KB retrieved (189ms elapsed)
Querying 'ncov_cases_US' service (4/5)...
268.2 KB retrieved (163ms elapsed)
Querying 'ncov_cases_US' service (5/5)...
172.5 KB retrieved (157ms elapsed)
===== Top 15 States =====
State Confirmed Deaths Counties
New York 75798 1550 56
New Jersey 18696 267 22
California 8077 163 48
Michigan 7615 259 69
Massachusetts 6620 89 14
Florida 6338 77 54
Illinois 5994 99 54
Washington 5305 222 35
Louisiana 5237 239 61
Pennsylvania 4963 63 60
Georgia 3815 111 138
Texas 3726 53 131
Connecticut 3128 69 9
Colorado 2627 51 48
Tennessee 2391 23 82
*Others 25475 462 1247
Querying 'ncov_cases' service...
163.6 KB retrieved (132ms elapsed)
Creating datafile '2020-03-31_2220Z.csv'...
2405 rows written ( 855007 / 42032 / 177857 )
D:\covid-19>python jhuEmulator.py ONCE --verbose=True --topStates=15 --adminLevel=1
Querying 'ncov_cases' service...
163.6 KB retrieved (442ms elapsed)
===== Top 15 States =====
State Confirmed Deaths
New York 75798 1550
New Jersey 18696 267
California 8077 163
Michigan 7615 259
Massachusetts 6620 89
Florida 6338 77
Illinois 5994 99
Washington 5305 222
Louisiana 5237 239
Pennsylvania 4963 63
Georgia 3815 111
Texas 3726 53
Connecticut 3128 69
Colorado 2627 51
Tennessee 2391 23
*Others 25475 462
Creating datafile '2020-03-31_2221Z.csv'...
315 rows written ( 855007 / 42032 / 177857 )
The format difference is due to the fact that Admin2 reporting is by county, but doesn't populate Recovered and Active, while the Admin1 reporting includes Recovered (which would allow Active to be computed, but Recovered is now a lump-sum record in the ncov_cases service, so in practice, Active and Recovered are not available).
So, how do I know this script populates the same data shared on GitHub? Well, I wrote a validator utility to create two CovidSummary objects, then iterate from one searching the other for duplicate keys, and reporting missing rows, data mismatches, and unmatched rows, and the output was:
D:\covid-19>python validate.py
========================================================================
0 errors / 3429 lines
Okay, I cheated a bit, since the daily JHU CSV reports that the Diamond Princess and Grand Princess cruise ships are docked on Null Island, I treated 0.0 degrees latitude/longitude as a wildcard that matches any coordinate, and I only compare coordinates to 5 places, since string comparison doesn't work well with floating-point values, but everything else has aligned perfectly.
UPDATE @ 2100 EST: Unfortunately, sometime today, the US Admin2 jurisdictions without any confirmed cases were deleted from the ncov_cases_US service feed, the FIPS code displayed for the Northern Mariana Islands and US Virgin Islands disappeared, and the US territory without any confirmed cases (American Samoa) also disappeared (along with its FIPS code). I've tweaked the exporter to conform to this, and it runs cleanly now:
D:\covid-19>python validate.py
========================================
sum1 = data2\2020-03-31.csv
sum2 = data2\2020-04-01_0110Z.csv
0 errors / 2434 lines
But I'm expecting Guam to have its FIPS code removed in the coming days.
Since the validation tool implements Python classes to parse and search both old-style and new-style CSV files, I've attached that as well (as validator-v10.zip)
My next task is to write some code to exploit this near real-time data resource, and use it to maintain PostgreSQL tables to produce clones of the ncov_cases* services' data.
Now attached is jhuEmulator-v20.zip -- Changelog:
== Update 01-Apr @ 0100 EST ==
Long-running service execution is great for finding bugs...
Now attached is jhuEmulator-v21.zip -- Changelog:
== Update 01-Apr @ 0920 EST ==
Overnight service execution is great for finding bugs, but errors that arise after several hours are less fun...
Now attached is jhuEmulator-v22.zip -- Changelog:
Also attached is covidValidate_v11.zip -- Changelog:
== Update 01-Apr @ 2030 EST ==
The server glitched, dropping all US data, and I got to exercise some error code that doesn't normally see traffic, which I've now tweaked to be more resilient.
Attached is jhuEmulator-v23.zip -- Changelog:
== Update 08-Apr @ 1600 EST ==
I tweaked the exporter to always write data at UTC midnight, so that the validator has the same Last_Update value for comparison with the JHU published file. I had to improve date parsing in the CovidReport class (the corrected 2020-04-06.csv has a different date format). I'm still seeing some JHU data with incorrect 000xx FIPS codes, and the Admin1 and Admin2 feeds are out of sync for the Northern Mariana Islands, but otherwise looking good.
Attached is jhuEmulator-v24.zip -- Changelog:
Also attached is covidValidate_v12.zip -- Changelog:
== Update 12-Apr @ 2320 EST ==
The JHU daily CSV format changed again, adding five more fields (and moving FIPS later in the display order). I've got the covidValidate.py handling the new format, but am not yet generating it...
Attached is covidValidate_v30.zip -- Changelog:
- V
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.