Extracting GDELT with Enterprise 10.7 Web Hooks

JeffreyScarmazzi2 · ‎07-31-2019

This article will cover a few of the lessons I’ve learned while building my first web hook. Hopefully an overview of what I learned and a copy of the code (provided at the end) will help you avoid potential headaches, apply this prototype to a similar idea, or come up with something that wouldn’t have crossed my mind. Another great place to get started with web hooks can be found here.

Let’s assume that we have a client who would like to use the Global Database of Events, Language, and Tone (GDELT) data set to anticipate threats against their infrastructure around the world. Essentially, the client wants a dashboard that will help identify situations where the tone in the media changes abruptly and their company was included in the top keywords. To fulfill this business requirement, we can write a program to fetch the location, average tone, total articles, and top keywords within a set distance of all their facilities every 15 minutes (the rate at which GDELT 2.0 provides updates). Historically, this kind of data ingestion is handled by a Python script that is plugged into a task scheduler. This approach is perfect for a well-defined process where a set of known inputs are mapped to outputs in a single direction. But what if we are also expected to provide a workflow to the analysts of this organization who might have a subset of new facilities that need to be tracked outside of the normal execution of this ETL routine? Your first instinct might be to create a web tool and publish it to the Portal from ArcGIS Pro, but starting at the 10.7 release of ArcGIS Enterprise there is a new pattern that allows us to provide this functionality to analysts and expand upon familiar workflows: web hooks.

For this prototype we are just taking a simple ETL process, inserting it into a Flask application, and using web hooks to create a new experience for the end users in Portal. The criteria for firing this web hook will be whenever an update operation is triggered (e.g. uploading a shapefile) for an item that includes the tag “GDELT”. If this does not seem impressive at first, try to think about other situations where simply tagging an item could be turned into the beginning of a process that is largely hidden from the user. For instance, this same workflow could be applied to content promotion, where someone applies the tag “PROMOTION” to a given item and it is immediately published in the next highest environment (e.g. from development to staging). Or imagine that you have 2 users doing analysis in the same county, but they are in separate parts of the organization and are not aware of each other. You could develop a web hook that looks at the extent and tags of newly created items and if it finds 2 or more people working on what appear to be similar topics in the same area, notifies them that they might want to connect inside of a newly created group. What excites me about all of this is not simply that I have a new way to extend the platform as a developer, but rather that this new functionality might give me new ways of engaging users and turning the Portal into a place of discovery.

In the screenshots above, our analyst has gone through the workflow that our new web hook supports. They have published a polygon shapefile as a Hosted Feature Layer (Monitor) with the tag “GDELT” and added this to a web map (the green ellipse). In addition, they’ve added the content (EXT CSV <TIME>) that was automatically generated by the web hook (the graduated circles). For this demonstration the user would have to know something about the naming scheme of the automated content in order to add the correct layer. Additionally, they would have to wait an arbitrary amount of time before they could find the new content (this prototype usually takes about a minute to complete). However, it would be trivial to add an email component to our web hook that lets the user know whenever their content has been generated in the Portal. It is also worth mentioning that our web hook does not necessarily need to be manually triggered (i.e. uploading a shapefile) by the analyst. A developer in the organization could be doing some other automated work against a set of facilities and if they decided to apply the tag “GDELT”, their program could essentially tie into the web hook functionality for free. Now that we have a general understanding of what we are doing with web hooks, let’s talk about the configuration details and how it was built with Flask.

We need to have our Flask application running in order to configure our web hook. Shown below is a snapshot of the application that we will cover in more detail momentarily. For now it is enough to say that it is a simple Flask application with a single route (/gdelt) that will consume events emitted by Portal.

With our Flask application running, we can go to the sharing endpoint (/sharing/rest/portals/self) of Portal and get started on our web hook. This article will not focus on all of the details for configuring a web hook, but you can learn about the process and the various parameters here. If you are just getting started with web hooks, it is a good idea to bump up the deactivationPolicy value so that your web hook does not deactivate while you are building and debugging. Aside from this change to the configuration properties, you really only need to brush up on the implications for the events you choose (e.g. /items or /users) to consume in your web hook.

Your application should receive an initial event if the web hook was created successfully. The screenshot above simply shows our application aborting after being passed an empty event list. If you are running into trouble while registering your web hook, ensure the route on your Flask application has a POST method defined and that application has some support for SSL (e.g. ssl_context='adhoc'). Notice that we used /items instead of /items/add. At the moment, our web hook will fire in both of the following instances: when a user uploads a shapefile to Portal and when a user updates the description of that newly created HFL. Do we really want that to be the case? If we leave it as it is, our end user can update the geometry of the HFL and then update some arbitrary value of the HFL (e.g. the description or the title) in order to generate new content. This is a useful workaround until row-level events are supported in web hooks. The problem is that we have to do a lot more filtering in our Flask application. So do you take everything (/items) and add more logic to filter in the application, or do you focus the process a little more (/items/add) and build multiple routes/applications to handle various events? As you would expect, there is no “right” answer and the important point is that web hooks give you options for experimentation.

With our web hook set up and responding to events sent from Portal, we can now upload a shapefile and evaluate the execution process. Before diving into the “logs” (we are just printing to the console like professionals) and learning about how these events are handled, let’s do a quick overview of what our web hook does:

Waits for an ‘update’ event.
When it has an ‘update’ event, ensures the incoming item has ‘GDELT’ as a tag.
Collects the latest CSV from GDELT.
Determines which of these records intersects the new item geometry.
Publishes these intersection features to Portal as a Hosted Feature Layer.

The full “logs” will be in the GitHub repository linked at the bottom of this article, but let’s look at an abridged version to see what happens when an event is passed to our web hook. It will be helpful to go through this log and try to see where in index.py the logic is happening.

Step	Description	From the Application “Log” - E is a snippet of the Event sent from Portal
1	Initial shapefile upload. Ignored by filter_event.	Parsing Event List E: {‘id’: 'dd8e3cd7356e49d3a30a96aa4e6cb8a4'. . . 'operation': 'publish' . . . }
2	Initial shapefile upload. Ignored by filter_event.	Parsing Event List E: {‘id’: 'dd8e3cd7356e49d3a30a96aa4e6cb8a4'. . . 'operation': 'add' . . . }
3	Initial shapefile upload. Our app logic begins when the event is an update.	Parsing Event List E: {‘id’: '27399665e56d461497f07072cad0dae0'. . . 'operation': 'update' . . . }
4	GDELT tag found. Content generation begins.	Handling Update Operation . . . Connecting to GIS . . . Finding Intersections . . .
5	As initial item is processing, a second update event arrives for the same item id.	Parsing Event List E: {‘id’: '27399665e56d461497f07072cad0dae0'. . . 'operation': 'update' . . . }
6	App logic has stored the item id from the first call and ignores this new event.	Handling Update Operation . . . Ignoring Feature Currently Processing . . .
7	Process from Step 4 has reached the point of publishing new content.	. . . Publishing CSV to HFL . . .
8	Result of process from Step 7.	Parsing Event List E: {‘id’: '2db6b8ce0f8d425297425e928811520f'. . . 'operation': 'add' . . }
9	Another internal Portal call that is treated as an update operation (Step 5).	Parsing Event List E: {‘id’: '27399665e56d461497f07072cad0dae0' . . . 'operation': 'update' . . }
10	See Step 6.	Handling Update Operation . . . Ignoring Feature Currently Processing
11	Result of process from Step 7.	Parsing Event List E: {‘id’: '6636370de0db422da8a2cfaafc2c3f84'. . . 'operation': ‘update’ . . }
12	Event from Step 9 does not proceed because no tag named GDELT found in get_event_item.	Handling Update Operation . . . Fetching Item for Processing . . .
13	Step 4 has finished and new content generated in Portal.	Elapsed time: 46.83072066307068 Generated HFL ID: 6636370de0db422da8a2cfaafc2c3f84

The important concept to take away from this play-by-play centers around Step 6 and 10. Basically, the flow of our web hook needed the following additions to function properly:

Reads a local collection of running item ids from disk.

Waits for an ‘update’ event.
When it has an ‘update’ event, ensures the incoming item has ‘GDELT’ as a tag.

Checks if the item is in the list of currently running items. If so, abort the operation.

Collects the latest CSV from GDELT.
Determines which of these records intersects the new item geometry.
Publishes these intersection features to Portal as a Hosted Feature Layer.

Removes the completed item id from the list of running items.

Reading and writing from/to a local JSON file was enough for this particular example, but there are likely much better options (e.g. sqlite) for when this is running in production and many events are coming in at a time. If we did not implement some way to track items, then our update event that fired twice for the same item would result in duplicates. Being aware of potential race conditions and scenarios for multiple events firing for a single item is important when you are getting started with web hooks and presents challenges that are not common in regular ETL routines. However, even with our kludgy solution to avoid multiple calls during publication, there was another issue that we did not handle for this prototype. Take a look at the following error message that appeared when another shapefile was uploaded with the correct tag:

GetLayers for parameter 0 failed. Error: {"code": 0, "messageCode": "GPEXT_006", "message": "Accessing URL https://scarmazzij.esri.com/server/rest/services/Hosted/Monitor/FeatureServer/0/query failed with error ERROR: code:500, Error performing query operation, Internal server error..", "params": {"url" : "https://scarmazzij.esri.com/server/rest/services/Hosted/Monitor/FeatureServer/0/query","error" : "ERROR: code:500, Error performing query operation, Internal server error."}}

This error was the result of an operation that would have happened at Step 4 in the process above. What has happened is that line 120 in index.py executed before the new service was ready to be processed: l_df = dissolve_boundaries(event_layer).query().sdf. A robust solution would require some logic that would try to contact the newly published service some number of times before aborting.

Why have we chosen to write our web hook in Flask over something like Express? For this particular application, using the Python API to handle the basic overlay and publication operations was valuable. It is quite likely that many of the GIS analysts who create traditional ETL tools in the ArcGIS platform are more comfortable with Python than JavaScript. However, the considerations for bringing the Flask application up to par with a Node.js application (e.g. setting up a WSGI solution) might swing the needle enough to build the tool in Express. Esri provides a number of examples in case you are proficient in something like Java or JavaScript. You will also see people using a service like AWS Lambda for their web hook because it greatly reduces the headaches surrounding deployment and scaling of a given tool. Our prototype is an example of where AWS Lambda would not be feasible because we have to manage some state (i.e. which items are currently being processed).

This blog was only meant to illustrate some of the issues and ideas I’ve had while building my first web hook in Enterprise. What will you do with web hooks to extend the platform and fulfill business requirements? The following GitHub repository contains everything you should need to replicate this trivial prototype. In this repository you will find the following:

index.py
- The Flask application responsible for responding to events and publishing new content in Portal.
config.json
- There are 3 values you need to update for your local environment: portal_url, username, and password. This implementation is likely not the best, but it should get you running.
running.json
- A place to store running item ids so that we do not duplicate content. An immediate improvement could be made to this application by removing item ids whenever there is an unexpected error.
upload_sample/Monitor.zip
- The shapefile that was used to test this prototype web hook.
env/webhook.yml
- A file that can be used to create the same conda environment I used while building this application.
/extractor/*
- A set of scripts for fetching the latest GDELT 2.0 data.
/logs/*
- Pass.txt and Fail.txt are console outputs from various runs of the application and can be compared against the play-by-play table above.