Python utility to write COVID-19 CSV snapshots from realtime feature service(s)

VinceAngelo · ‎03-24-2020

Like many, I've been using the daily data from the Johns Hopkins University Ops Dashboard site, but I've been frustrated by the periodicity and timing -- Midnight UTC is all well and good, but then you've got 12+ hour old data before folks start their workday on the US East Coast.

But there is another route to the data, since it's published as a feature service. So Sunday night I whipped up a Python script to pull the full feature feed from AGOL, and write an hourly file in the same format. This script is plain-vanilla Python (no arcpy), and has been tested on both Python 2.7.16 and 3.8.2. Execution looks looks something like this:

D:\covid-19>python jhuEmulator.py DELAY --verbose=True --interval=2h

Sleeping 35.67 minutes...

Querying feature service...
   155.6 KB retrieved (310ms elapsed)
Creating datafile '2020-03-25_0600Z.csv'...
     300 rows written (  422989 /  18916 /  108578 )

Sleeping 2.00 hours...

Querying feature service...
   155.6 KB retrieved (179ms elapsed)
Creating datafile '2020-03-25_0800Z.csv'...
     300 rows written (  423670 /  18923 /  108629 )

Sleeping 2.00 hours...

Querying feature service...
   155.6 KB retrieved (180ms elapsed)
Creating datafile '2020-03-25_1000Z.csv'...
     300 rows written (  425493 /  18963 /  109191 )

Sleeping 2.00 hours...

Querying feature service...
   156.1 KB retrieved (146ms elapsed)
Creating datafile '2020-03-25_1200Z.csv'...
     301 rows written (  435006 /  19625 /  111822 )

Sleeping 2.00 hours...

Exiting on 'stop.now' request

D:\covid-19>dir/b data
2020-03-25_0600Z.csv
2020-03-25_0800Z.csv
2020-03-25_1000Z.csv
2020-03-25_1200Z.csv

D:\covid-19>head data\2020-03-25_1200Z.csv
Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered,Latitude,Longitude
,Italy,2020-03-25T07:24:40,69176,6820,8326,41.87194,12.56738
Hubei,China,2020-03-25T02:43:02,67801,3163,60810,30.9756,112.2707
,Spain,2020-03-25T07:24:40,47610,3434,5367,40.463667,-3.74922
,Germany,2020-03-25T07:24:40,34009,172,3532,51.165691,10.451526
,Iran,2020-03-25T07:24:40,27017,2077,9625,32.427908,53.688046
New York,US,2020-03-25T07:24:54,26376,271,0,42.165726,-74.948051
,France,2020-03-25T07:24:40,22304,1100,3281,46.2276,2.2137
,Switzerland,2020-03-25T07:24:40,10171,135,131,46.8182,8.2275
,Korea, South,2020-03-25T07:24:40,9137,126,3730,35.907757,127.766922‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Then I worried about system load (One ought not pound the free service that one wants to use into oblivion) and update frequency (Wouldn't it be nice to get updates as they happen?), so I started working on tweaks:

Added argparse command-line switches to change execution parameters (vice hacking the source)
Added 0-20 random seconds delay for execution of the query (so synchronized servers don't overload the feature service, yet with plenty of time to finish within the targeted minute)
Permitted a directory tree archiving style (creating directories if necessary)

I started working on a way to only write a dataset which has had a substantive change over the previous timestamp (so the timing could be shortened to, say, 5 minutes, without needlessly transmitting files), but I discovered I can't use the Last_Update field for this (since the timestamp changes without case count changes), and then the CSV data format posted on GitHub changed Monday night to include Admin2 (county-level) data in the US (at lower resolution -- no Recovered/Active values), so I'll post this as-is for reference (or for those that now need an alternate source for data in the old format) and figure out how to model the new format...

Note that there are three modes (you need to choose one):

ONCE - Create one file, now, and exit (suitable for a cron job or other batch-oriented invocation at an interval greater than one hour)
IMMEDIATE - Create one file now, and then continue file creation based on the interval modulus (default hourly)
DELAY - Create first file in next expected time window (if the old job was killed, and you just want to pick up where left off)

It's possible to place files in a directory tree based on the time format. For example, if you specify
--csvFormat="y%Y\m%m\d%d\t%H%MZ.csv"

Then the yYYYY\mMM\dDD folder will be created if necessary, and the "tHHMMZ.csv" data file added to that.

You can gracefully exit the application by creating a "stop.now" file in the active directory (or creating the file referenced by the optional --stopPath flag).

== Update 25-Mar @ 0800 EST ==

Now attached is jhuEmulator-v11.zip -- Changelog:

Parameterized URL assembly for future US feed support
Added mapping for missing fields (to None value)
Fixed stopPath utilization defect
Fixed sys.exit() defect
Reduced wait for stopPath detection
Expanded interval to even divisions of hour and day
(2m,3m,4m,5m,6m,10m,12m,15m,30m,1h,2h,3h,4h,6h,12h)

I should probably re-emphasize that this tool can be used to preserve the "old format" feed until your tools have the ability to process the new format with FIPS and Admin2 fields.

== Update 25-Mar @ 0840 EST ==

Now attached is jhuEmulator-v12.zip -- Changelog:

Fixed IMMEDIATE computeDelay() defect

Note that the stop file will now (as of v11) gracefully exit the app within a minute of creation.

== Update 25-Mar @ 1520 EST ==

Now attached is jhuEmulator-v13.zip -- Changelog:

Detect low-volume data pulls and skip write operation
Fixed defect in byVolume sorting (for/else)
Fixed defect in pull error logic
0.4 millisecond/existence test timing compensation
Made Interrupt handling in sleep cleaner

Working on nominal change detection and US counties feed support next...

== Update 25-Mar @ 1820 EST ==

Now attached is jhuEmulator-v14.zip -- Changelog:

Added substantive change detection (only write if case
counts or Lat/Lon are different, not if just Last_Update
changed) - Uses new --skipTrivial flag (default False)
Fixed naming defect in main loop
Self-initializing record count expectation

The --skipTrivial flag is what I had been looking to achieve in near-realtime feed support. With it, you can set a short interval (though not less than 2 minutes) and have CSV files only written when "substantive change" is detected (row count or changes to Confirmed, Deaths, or Reported field values, mostly). Executing it with --interval=2m (which is probably way too often -- 5, 6, 10, 12, or 15 would be better) results in output like this:

D:\covid-19>python jhuEmulator.py IMMEDIATE ^
More? --verbose=True --skipTrivial=True --interval=2m

Querying feature service...
   156.6 KB retrieved (232ms elapsed)
Creating datafile '2020-03-25_2207Z.csv'...
     302 rows written (  464026 /  20946 /  113691 )

Sleeping 0.15 minutes...

Querying feature service...
   156.6 KB retrieved (152ms elapsed)
Skipping write of trivial change to '2020-03-25_2208Z.csv'...

Sleeping 2.07 minutes...

Querying feature service...
   156.6 KB retrieved (195ms elapsed)
Skipping write of trivial change to '2020-03-25_2210Z.csv'...

Sleeping 2.16 minutes...

Querying feature service...
   156.6 KB retrieved (176ms elapsed)
Skipping write of trivial change to '2020-03-25_2212Z.csv'...

Sleeping 1.80 minutes...

Querying feature service...
   156.6 KB retrieved (187ms elapsed)
Skipping write of trivial change to '2020-03-25_2214Z.csv'...

Sleeping 2.20 minutes...

Querying feature service...
   157.2 KB retrieved (185ms elapsed)
Creating datafile '2020-03-25_2216Z.csv'...
     303 rows written (  466836 /  21152 /  113769 )

Sleeping 1.88 minutes...

Querying feature service...
   157.2 KB retrieved (179ms elapsed)
Skipping write of trivial change to '2020-03-25_2218Z.csv'...‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

My next goal is to write data in the new (post 23-Mar) JHU CSV format from the ..._US service. I'll likely set it up to be both global and US-only.

Please let me know if you use this or encounter problems...

== Update 25-Mar @ 2200 EST ==

Blog title updated.

== Update 26-Mar @ 1800 EST ==

Now attached is jhuEmulator-v15.zip -- Changelog:

Added time prefix to non-verbose output
Added --adminLevel=[0,1] flag to permit country summaries
Added --usOnly support to restrict results to US data
Fixed ONCE defect when retrieval fails
Fixed CSV quoting bug with Python 3
Note: adminLevel=2 is partially functioning, but disabled
until a work-around for the 1000 feature limit is added

Good news: jhuEmulator has code to export new-style format files

Bad news: Only 1000 rows can be received per query

I did add the --adminLevel flag, but right now it only functions to summarize the countries that are usually distributed with province/state level data (US, Canada, China, and Australia).

Python 3 users: Note that v1.5 fixes a formatting defect that failed to place double-quotes around values with embedded quotes.

The next snapshot should be v2.0, which will assemble multiple queries into a single file which approximates the new-format CSV style.

== Update 29-Mar @ 2330 EST ==

Now attached is jhuEmulator-v16.zip -- Changelog:

Fixed timezone defect in datetime.datetime.fromtimestamp call
(Last_Update had been captured in local timezone, not UTC)
Added date display in non-verbose operation at midnight UTC

I'm still working on the Admin2 format support. It's nearly done, but not ready for release. I discovered an ugly bug in the v15 and earlier extraaction -- the UTC dates in Last_Update are recorded in the local timezone, not UTC, so I wanted to get this fix out there, BUT I'm now showing far more cases than the JHU Ops Dashboard, so the code that was supposed to accumulate admin2 data into an admin1 summary seems to be double-counting. The same issue exists in v15 and earlier (and v20, right now), so I'll need to figure out what went wrong...

== Update 29-Mar @ 2330 EST ==

Heh. My Chrome session needed refresh. The total in the website correlates to the extraction feed.

For what it's worth, here's the log of my last 26 hours of data extraction, pulling data every 15 minutes, but only saving if if there was non-trivial change:

D:\covid-19>python jhuEmulator.py DELAY --interval=15m --skipTrivial=True
  0245:    312 rows written (  663828 /  30822 /  139451 )
  0300:    312 rows written (  664608 /  30846 /  140156 )
  0400:    312 rows written (  664695 /  30847 /  140156 )
  0515:    312 rows written (  664924 /  30848 /  140222 )
  0630:    312 rows written (  665164 /  30852 /  140225 )
  0745:    312 rows written (  665616 /  30857 /  141746 )
  0845:    312 rows written (  666211 /  30864 /  141789 )
  0945:    312 rows written (  669312 /  30982 /  142100 )
  1100:    312 rows written (  678720 /  31700 /  145609 )
  1200:    312 rows written (  679977 /  31734 /  145625 )
  1315:    312 rows written (  681706 /  31882 /  145696 )
  1430:    312 rows written (  684652 /  32113 /  145696 )
  1530:    312 rows written (  685623 /  32137 /  145706 )
  1645:    312 rows written (  691867 /  32988 /  146613 )
  1815:    312 rows written (  704095 /  33509 /  148824 )
  1900:    312 rows written (  710918 /  33551 /  148900 )
  2000:    312 rows written (  713171 /  33597 /  148995 )
  2130:    312 rows written (  716101 /  33854 /  149071 )
  2230:    312 rows written (  718685 /  33881 /  149076 )
  2345:    312 rows written (  720117 /  33925 /  149082 )
  ----> Mon 2020-03-30 UTC
  0045:    312 rows written (  721584 /  33958 /  149122 )
  0200:    312 rows written (  721817 /  33968 /  151204 )
  0300:    312 rows written (  722289 /  33984 /  151901 )
  0400:    312 rows written (  722435 /  33997 /  151991 )‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

== Update 31-Mar @ 2120 EST ==

The Admin2 format emulator jhuEmulator-v20.zip is now published in the new blog post:

https://community.esri.com/people/vangelo-esristaff/blog/2020/04/01/python-utility-to-write-covid-19...

There were output changes only today, so v2.0 is likely to to have some updates to keep in sync. The most significan change for anyone using the v1.x tools is that the new default file format is the post-03/23 format, but that can be overridden with --adminVersion=1.

- V