Python utility to write COVID-19 CSV snapshots from arcgis.com feature services (v2)

1016
5
03-31-2020 06:20 PM
Esri Esteemed Contributor
3 5 1,016

This blog rolls out version 2.0 of my jhuEmulator.py utility.

The folks at Johns Hopkins University have done an awesome job at maintaining their Ops Dashboard site.  They even share a folder of data snapshots as a CSV files at a minute before midnight, Greenwich time (2359 UTC).  However, the data itself is anything but static, and anyone who needs the CSV files at a more frequent pitch, if not real-time, can be frustrated by the snapshot frequency.

Fortunately, the data is all available with real-time updates -- It's in the feature service layers hosted on services1.arcgis.com (ncov_cases and ncov_cases_US). But there's one complication: The file format changed with the 2020-03-23.csv file, so anyone who had tools to read the old format was left in the lurch.

I've written a Python utility which can:

  • Export feature service data in either the old (Province/State,...) or new (FIPS,...) CSV format
  • Export near real-time updates every 2 minutes, or as slowly as once every 12 hours
  • Choose to skip export if the number of confirmed cases, deaths, or recoveries don't change in any particular time-slice (sometimes the Last_Update value changes, but the values reported don't)

Attached to this blog post is a zipfile of the jhuEmulator.py utility.  The usage looks like this:

D:\covid-19>python jhuEmulator.py -h
usage: jhuEmulator.py [-h] [--adminLevel {0,1,2}] [--verbose VERBOSE]
                      [--interval INTERVAL] [--folder FOLDER]
                      [--csvFormat CSVFORMAT] [--stopPath STOPPATH]
                      [--skipTrivial SKIPTRIVIAL] [--usOnly USONLY]
                      [--confirmedOnly CONFIRMEDONLY]
                      [--topStates {5,10,15,20,25,30,35,40,45,all}]
                      {IMMEDIATE,DELAY,ONCE}

Emulate JHU COVID-19 data file (v2.0)

positional arguments:
  {IMMEDIATE,DELAY,ONCE}
                        Execution mode

optional arguments:
  -h, --help            show this help message and exit
  --adminLevel {0,1,2}  (default = 2)
  --verbose VERBOSE     Verbose reporting flag (default = False)
  --interval INTERVAL   Data retrieval interval (default = 60m)
  --folder FOLDER       Folder path for data files (default = 'data')
  --csvFormat CSVFORMAT
                        strftime format for data files
  --stopPath STOPPATH   File that indicates loop execution (default =
                        'stop.now')
  --skipTrivial SKIPTRIVIAL
                        Defer writing insignificant changes flag (default =
                        False)
  --usOnly USONLY       Only export US data (default = False)
  --confirmedOnly CONFIRMEDONLY
                        Only export rows with confirmed cases (default =
                        False)
  --topStates {5,10,15,20,25,30,35,40,45,all}
                        Display sorted Confirmed/Deaths by US state (default =
                        0)

Can generate both CSV formats (before/after) 23-Mar-2020‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

The simplest use is a one-time execution (ONCE) mode:

D:\covid-19>python jhuEmulator.py ONCE
  1910:   2369 rows written (  838061 /  41261 /  175737 )

D:\covid-19\demo>head data\2020-03-31_1910Z.csv
FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key
45001,Abbeville,South Carolina,US,2020-03-31 18:31:52,34.223334,-82.461707,3,0,0,0,"Abbeville, South Carolina, US"
22001,Acadia,Louisiana,US,2020-03-31 18:31:52,30.295065,-92.414197,11,1,0,0,"Acadia, Louisiana, US"
51001,Accomack,Virginia,US,2020-03-31 18:31:52,37.767072,-75.632346,7,0,0,0,"Accomack, Virginia, US"
16001,Ada,Idaho,US,2020-03-31 18:31:52,43.452658,-116.241552,163,3,0,0,"Ada, Idaho, US"
19001,Adair,Iowa,US,2020-03-31 18:31:52,41.330756,-94.471059,1,0,0,0,"Adair, Iowa, US"
29001,Adair,Missouri,US,2020-03-31 18:31:52,40.190586,-92.600782,1,0,0,0,"Adair, Missouri, US"
40001,Adair,Oklahoma,US,2020-03-31 18:31:52,35.884942,-94.658593,4,0,0,0,"Adair, Oklahoma, US"
08001,Adams,Colorado,US,2020-03-31 18:31:52,39.874321,-104.336258,152,0,0,0,"Adams, Colorado, US"
17001,Adams,Illinois,US,2020-03-31 18:31:52,39.988156,-91.187868,2,0,0,0,"Adams, Illinois, US"‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

The default is --adminLevel=2 (new-style), but levels 1 and 0 are also supported (zero is the same format as one, but without any state/province data for the US, Canada, China, or Australia):

D:\covid-19>python jhuEmulator.py ONCE --adminLevel=1
  1911:    315 rows written (  838061 /  41261 /  175737 )

D:\covid-19>head data\2020-03-31_1910Z.csv
Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered,Latitude,Longitude
,Italy,2020-03-31T18:31:40,105792,12428,15729,41.871940,12.567380
,Spain,2020-03-31T18:31:40,94417,8269,19259,40.463667,-3.749220
New York,US,2020-03-31T18:31:52,75795,1550,0,42.165726,-74.948051
,Germany,2020-03-31T18:31:40,68180,682,15824,51.165691,10.451526
Hubei,China,2020-03-31T01:06:37,67801,3187,63153,30.975600,112.270700
,France,2020-03-31T18:31:40,52128,3523,9444,46.227600,2.213700
,Iran,2020-03-31T18:31:40,44605,2898,14656,32.427908,53.688046
,United Kingdom,2020-03-31T18:31:40,25150,1789,135,55.378100,-3.436000
New Jersey,US,2020-03-31T18:31:52,17126,198,0,40.298904,-74.521011

D:\covid-19>python jhuEmulator.py ONCE --adminLevel=0
  1911:    180 rows written (  838061 /  41261 /  175737 )‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Note that successive executions in the same time window will overwrite the output file!

The IMMEDIATE and DELAY options are nearly the same, except the IMMEDIATE makes a new snapshot without delay, while the DELAY mode only operates at regular intervals (all with a 1-20 second random delay, to prevent  slamming the service with synchronized queries).  The next two quote blocks were collected from two sessions running at the same time (in different directories):

 

D:\Projects\covid-19>python jhuEmulator.py IMMEDIATE --interval=15m --adminLevel=1
  1917:    315 rows written (  838061 /  41261 /  175737 )‍‍
‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍  1930:    315 rows written (  838061 /  41261 /  175737 )
  1945:    315 rows written (  838061 /  41261 /  175737 )
  2000:    315 rows written (  846156 /  41494 /  176171 )
  2015:    315 rows written (  846156 /  41494 /  176171 )
  2030:    315 rows written (  846156 /  41494 /  176171 )
  2045:    315 rows written (  846156 /  41494 /  176171 )
  2100:    315 rows written (  846156 /  41494 /  176171 )
  2115:    315 rows written (  846156 /  41494 /  176171 )
  2130:    315 rows written (  850583 /  41654 /  176714 )‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍
  2145:    315 rows written (  850583 /  41654 /  176714 )
  2200:    315 rows written (  850583 /  41654 /  176714 )
  2215:    315 rows written (  850583 /  41654 /  176714 )
  2230:    315 rows written (  855007 /  42032 /  177857 )‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

D:\covid-19>python jhuEmulator.py DELAY --interval=15m --adminLevel=1 --skipTrivial=True‍
  1930:    315 rows written (  838061 /  41261 /  175737 )
  2000:    315 rows written (  846156 /  41494 /  176171 )
  2130:    315 rows written (  850583 /  41654 /  176714 )‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍
  2230:    315 rows written (  855007 /  42032 /  177857 )‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Note that the --skipTrivial=True flag is the mechanism to skip data export if no significant changes have occurred (changes to the Last_Updated field will not write a new file, but changes to any case count field will).

Assembling the new-style data format is tricky, because it needs to aggregate thousands of records from two different services (and remove the US duplicates from the ncov_cases service), the verbose mode gives an indication of what's happening:

D:\covid-19>python jhuEmulator.py ONCE --verbose=True

Querying 'ncov_cases_US' service (1/5)...
   382.2 KB retrieved (476ms elapsed)
Querying 'ncov_cases_US' service (2/5)...
   281.9 KB retrieved (250ms elapsed)
Querying 'ncov_cases_US' service (3/5)...
   200.1 KB retrieved (176ms elapsed)
Querying 'ncov_cases_US' service (4/5)...
   262.2 KB retrieved (204ms elapsed)
Querying 'ncov_cases_US' service (5/5)...
   169.5 KB retrieved (161ms elapsed)
Querying 'ncov_cases' service...
   163.6 KB retrieved (169ms elapsed)
Creating datafile '2020-03-31_1936Z.csv'...
     2369 rows written (  838061 /  41261 /  175737 )‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

For the US-centric audience, I added a --topStates flag (which must be used with --verbose=Y) to print a summary 

D:\covid-19>python jhuEmulator.py ONCE --verbose=True --topStates=15

Querying 'ncov_cases_US' service (1/5)...
   390.0 KB retrieved (429ms elapsed)
Querying 'ncov_cases_US' service (2/5)...
   286.2 KB retrieved (168ms elapsed)
Querying 'ncov_cases_US' service (3/5)...
   200.7 KB retrieved (189ms elapsed)
Querying 'ncov_cases_US' service (4/5)...
   268.2 KB retrieved (163ms elapsed)
Querying 'ncov_cases_US' service (5/5)...
   172.5 KB retrieved (157ms elapsed)

                      ===== Top 15 States =====

                State     Confirmed   Deaths  Counties
                New York     75798     1550        56
              New Jersey     18696      267        22
              California      8077      163        48
                Michigan      7615      259        69
           Massachusetts      6620       89        14
                 Florida      6338       77        54
                Illinois      5994       99        54
              Washington      5305      222        35
               Louisiana      5237      239        61
            Pennsylvania      4963       63        60
                 Georgia      3815      111       138
                   Texas      3726       53       131
             Connecticut      3128       69         9
                Colorado      2627       51        48
               Tennessee      2391       23        82
                 *Others     25475      462      1247

Querying 'ncov_cases' service...
   163.6 KB retrieved (132ms elapsed)
Creating datafile '2020-03-31_2220Z.csv'...
     2405 rows written (  855007 /  42032 /  177857 )

D:\covid-19>python jhuEmulator.py ONCE --verbose=True --topStates=15 --adminLevel=1

Querying 'ncov_cases' service...
   163.6 KB retrieved (442ms elapsed)

                 ===== Top 15 States =====

                State     Confirmed   Deaths
                New York     75798     1550
              New Jersey     18696      267
              California      8077      163
                Michigan      7615      259
           Massachusetts      6620       89
                 Florida      6338       77
                Illinois      5994       99
              Washington      5305      222
               Louisiana      5237      239
            Pennsylvania      4963       63
                 Georgia      3815      111
                   Texas      3726       53
             Connecticut      3128       69
                Colorado      2627       51
               Tennessee      2391       23
                 *Others     25475      462

Creating datafile '2020-03-31_2221Z.csv'...
      315 rows written (  855007 /  42032 /  177857 )‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

The format difference is due to the fact that Admin2 reporting is by county, but doesn't populate Recovered and Active, while the Admin1 reporting includes Recovered (which would allow Active to be computed, but Recovered is now a lump-sum record in the ncov_cases service, so in practice, Active and Recovered are not available).

So, how do I know this script populates the same data shared on GitHub?  Well, I wrote a validator utility to create two CovidSummary objects, then iterate from one searching the other for duplicate keys, and reporting missing rows, data mismatches, and unmatched rows, and the output was:

D:\covid-19>python validate.py‍‍
========================================================================

0 errors / 3429 lines‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Okay, I cheated a bit, since the daily JHU CSV reports that the Diamond Princess and Grand Princess cruise ships are docked on Null Island, I treated 0.0 degrees latitude/longitude as a wildcard that matches any coordinate, and I only compare coordinates to 5 places, since string comparison doesn't work well with floating-point values, but everything else has aligned perfectly. 

UPDATE @ 2100 EST: Unfortunately, sometime today, the US Admin2 jurisdictions without any confirmed cases were deleted from the ncov_cases_US service feed, the FIPS code displayed for the Northern Mariana Islands and US Virgin Islands disappeared, and the US territory without any confirmed cases (American Samoa) also disappeared (along with its FIPS code).  I've tweaked the exporter to conform to this, and it runs cleanly now:

D:\covid-19>python validate.py‍‍
========================================
sum1 = data2\2020-03-31.csv
sum2 = data2\2020-04-01_0110Z.csv

0 errors / 2434 lines‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

But I'm expecting Guam to have its FIPS code removed in the coming days.

Since the validation tool implements Python classes to parse and search both old-style and new-style CSV files, I've attached that as well (as validator-v10.zip)

My next task is to write some code to exploit this near real-time data resource, and use it to maintain PostgreSQL tables to produce clones of the ncov_cases* services' data.

Now attached is jhuEmulator-v20.zip -- Changelog:

  • Implemented adminLevel=2 with multiple queries to replicate
    three-level results in GitHub CSV files
  • Added --confirmedOnly flag to skip counties without
    Confirmed cases (obviated by feed change)
  • Added --topStates to report ordered state-wide impact
    (requires --verbose=True and non-zero --adminLevel)
  • Added FIPS lookup for US Possessions (Note: Puerto Rico = None)
  • Ignored US admin1 summary values in admin2 feed (2-digit FIPS)
  • Added validateKey to correct Combined_Key with missing spaces
    after commas
  • Removed FIPS from US possessions and suppressed US possessions
    without confirmed cases (as per 2020-03-31.csv)

== Update 01-Apr @ 0100 EST ==

Long-running service execution is great for finding bugs...

Now attached is jhuEmulator-v21.zip -- Changelog:

  • Fixed significantChange field comparison error with Admin1
    field names in Admin2 service

== Update 01-Apr @ 0920 EST ==

Overnight service execution is great for finding bugs, but errors that arise after several hours are less fun...

Now attached is jhuEmulator-v22.zip -- Changelog:

  • Fixed significantChange field comparison error for *all*
    Admin1 field names in the Admin2 service
  • Fixed paren alignment defect in error handling in pullData

Also attached is covidValidate_v11.zip -- Changelog:

  • Tightened CovidReport.compare() loop
  • Fixed range check defect in main

== Update 01-Apr @ 2030 EST ==

The server glitched, dropping all US data, and I got to exercise some error code that doesn't normally see traffic, which I've now tweaked to be more resilient.

Attached is jhuEmulator-v23.zip -- Changelog:

  • Fixed TypeError when service result is empty (None)

== Update 08-Apr @ 1600 EST ==

I tweaked the exporter to always write data at UTC midnight, so that the validator has the same Last_Update value for comparison with the JHU published file. I had to improve date parsing in the CovidReport class (the corrected 2020-04-06.csv has a different date format). I'm still seeing some JHU data with incorrect 000xx FIPS codes, and the Admin1 and Admin2 feeds are out of sync for the Northern Mariana Islands, but otherwise looking good.

Attached is jhuEmulator-v24.zip -- Changelog:

  • Tweak to force 0000Z download, even when --skipTrivial is
    enabled (for comparison with GitHub published files)

Also attached is covidValidate_v12.zip -- Changelog:

  • Cleaned up CovidReport.extract() to handle bad formatting
    gracefully and to support the 'm/d/y HH:MM' date used in
    the revised 2020-04-06.csv datafile
  • Loosened the Last_Update comparison code so that second-
    truncated timestamps can compare successfully

 

== Update 12-Apr @ 2320 EST ==

The JHU daily CSV format changed again, adding five more fields (and moving FIPS later in the display order). I've got the covidValidate.py handling the new format, but am not yet generating it...

Attached is covidValidate_v30.zip -- Changelog:

  • Count, but don't display, Combined_Key mismatches
  • Added parsing (but not yet validation) of fields added
    to 04-12-2020.csv

- V

5 Comments
Esri Esteemed Contributor

Note: covidValidate.py is reporting errors in 2020-04-03_0000Z.csv (vice the JHU 2020-04-02.csv):

========================================
sum1 = data2\2020-03-31.csv
sum2 = data2\2020-04-01_0110Z.csv

0 errors / 2434 lines

========================================
sum1 = data2\2020-04-01.csv
sum2 = data2\2020-04-02_0017Z.csv

0 errors / 2485 lines

========================================
sum1 = data2\2020-04-02.csv
sum2 = data2\2020-04-03_0000Z.csv

sum1 2332 Guam, US                                           not equal to
     2020-04-02 23:25:27| 13.444300| 144.793700|      82|       3|       0|       0|00066
sum2 2333 Guam, US                                            (FIPS)
     2020-04-02 23:25:27| 13.444300| 144.793700|      82|       3|       0|       0|66000

sum1 2365 Puerto Rico, US                                    not equal to
     2020-04-02 23:25:27| 18.220800| -66.590100|     316|      12|       0|       0|00072
sum2 2366 Puerto Rico, US                                     (FIPS)
     2020-04-02 23:25:27| 18.220800| -66.590100|     316|      12|       0|       0|    

sum1 2387 ,Virgin Islands,US                                 not equal to
     2020-04-02 23:25:27| 18.335800| -64.896300|      30|       0|       0|       0|00078
sum2 2388 ,Virgin Islands,US                                  (FIPS)
     2020-04-02 23:25:27| 18.335800| -64.896300|      30|       0|       0|       0|    

3 errors / 2569 lines
‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

but I'm not eager to place incorrect FIPS codes in the script for Guam, Puerto Rico, and the US Virgin Islands (the two-digit state code should be in the first two places of the five-digit code), and I only removed FIPS for US possessions (except Guam) yesterday, so I'm going to wait another day or two before making a new script snapshot.

- V

Esri Esteemed Contributor

Okay, so the date format in the 06-Apr JHU file was retroactively changed from "YYYY-MM-DD HH:MM:SS" to "M/D/YY HH:MM" eighteen hours ago.  I'm working on tweaks to the validation parser...

- V

Esri Esteemed Contributor

Okay, yes, the JHU format changed again tonight, adding five more columns (Incident_Rate, People_Tested, People_Hospitalized, UID, and ISO3). The FIPS column has also moved to a later position. I'll start on recreating the format now...

FWIW: The new service is ncov_cases2_v1  and the layer from which to extract Admin2 data is ncov_cases2_v1/FeatureServer/1 

- V

Esri Esteemed Contributor

Heh. So the format change was fleeting, and is now retroactively back to the layout used since 23-Mar, but if it changes again to leverage the new services, I'll be ready.  In fact, I might release the current v3.0 candidate, for those who want access to the new testing and hospitalization data, but I have a data duplication issue to work through first...

- V

Esri Esteemed Contributor

Okay, the People_Tested / People_Hospitalized values are available from the new JHU data files, in the csse_covid_19_daily_reports_us GitHub folder, and exposed in the  ncov_cases2_v1 feature service service, but only at the Admin1 level for the United States (and territories), in layer 3.

Given the volatility of the data files in GitHub, I've put together a Python tool to fetch the CSV files, both in realtime and retroactively, renaming previous incarnations if an MD5 hash indicates the file changed.  It doesn't work with Python 3 due to byte/str/unicode issues, so I need to clean that up before publishing...

- V

About the Author
Thirty-two years with Esri, with expertise in Software Engineering, Large Database Management and Administration, Real-time Spatial Data Manipulation, Modeling and Simulation, and Spatial Data Manipulation involving Topology, Projection, and Geometric Transformation. Certifications: Esri EGMP (19001), Security+ (SY0-501), Esri EGMP (10.3/10.1), Esri ESDA (10.3), Esri EGMA (10.1) Note: Please do not try to message me directly; just ask a question in an appropriate community.
Labels