Create and maintain a Knowledge Graph with one ETL Tool

BruceHarold

Problem Definition
Graph Scenario
The ETL Tool
The Graph in ArcGIS
Discussion

Problem Definition

Knowledge Graph entities (also known as nodes) are linked by relationships; entity ESRI__ID values are foreign keys in the relationship columns ESRI__OriginID and ESRI__DestID. The ID values are derived from the GlobalID data type, and are system generated. Because ESRI__ID values need to exist on entities before you can use them in relationships, many people think you need to create or maintain a graph in two stages, first entities and then separately for relationships, and they end up with two ETL tools to manage, or more if processing is done per combination of entity and relationship. This is unnecessary.

This blog shows how you can maintain a graph using one ETL tool that writes both entities and relationships in one run.

Graph Scenario

Firstly, see the obligatory - and very busy - map! Today's graph subject matter is worldwide airport and flight data for 24 hours either side of ETL tool run time, so about half the flights are in the past and half are scheduled for the near future. Note the flight paths do not model actual aircraft routes, the graph is only intended to model connectivity. More on use case scenarios is below.

World airports and flights

The ETL Tool

The data is made available by FlightAware from their AeroAPI endpoints, accessed by this ETL tool - available in the blog download. You will need your own API key.

Single pass ETL tool for graph maintenance

Even if you don't have AeroAPI access, download and unzip the blog attachment and install the content as follows (requires ArcGIS Data Interoperability for Pro 3.5+):

Create.fmw - put this workspace source file in a Pro project home folder
- Optionally create an ETL tool using the fmw as the source
LoopingAirportsGetter.fmx - a custom transformer used by Create.fmw
- Put this in your user profile folder C:\Users\<yourusername>\Documents\FME\Transformers

There is a lot of useful material we'll cover in the tools, but to get immediately to the blog's main goal - how to write entities and relationships in one workspace - you'll see in Create.fmw that you first write entities into the graph with a FeatureWriter transformer, then use the FeatureWriter's Summary output port to trigger reading the entities back into the workspace with a FeatureReader transformer - they then have the ESRI__ID values you need. The workspace is not laid out compactly to show this sequencing but look for the transformer named FeatureWriter that writes airports and you'll see a direct connection from its Summary port to the transformer named FeatureReader reading the airports just written. Summary ports output a single non-spatial feature with a few identifying and statistical properties after the write transaction is committed.

Writing relationships can be done with ordinary Esri Knowledge Graph writers as they have no downstream dependency. If that's all you came for today then no need to read further, but if you enjoy deep dives into ETL then you'll likely learn something in the rest of the post, so read on!

I'm making a graph with data sourced from an API, and one that follows standard modern practice - REST calls return paginated JSON responses, and the whole API has an OpenAPI specification. You can inspect the API at this URL and notice the link to the OpenAPI specification.

Since the OpenAPI specification is available, it can be imported into an OpenAPICaller transformer, which turns HTTP call construction into a form-filling exercise. Here is the first OpenAPIcaller in the workspace. Notice I'm asking for 100 pages (1500 records) of airports data and the header includes a tool parameter for the API key and a request to receive a JSON response. The airports schema isn't very wide so 1500 records doesn't overload the HTTP GET response and cause errors, but I am only getting an initial record set, not all airports data.

OpenAPICaller for Airports

The API supports pagination. If a request doesn't return the last records available on the server then a JSON object named next (a URL) is available in the response, sending it as a request returns the next set of pages. This lends itself to a loop to get all data, which is what the LoopingAirportsGetter custom transformer does.

LoopingAirportsGetter

As the next URL is built for us we can use a simple HTTPCaller in the custom transformer, not another OpenAPICaller. Now we have all airports data and can write out the entity type.

If you inspect the tool, you'll see that after the airport entities are written into the graph they are read back out (with ESRI__ID values) and another OpenAPICaller gets flights for each airport. This time 50 pages of data are retrieved per call (the schema is wider) but we can have 25 calls in flight at any time as we're not paging through one large cursor on the back end, but each airport's flights.

OpenAPICaller for Flights

There are a handful of airports worldwide with more than 50 pages of flights (750 records) over 48 hours and these are retrieved with an HTTPCaller if next is not null from an initial request.

Note the start and end query parameters. These are UTC timestamps in ISO format, generated at tool start time by scripted parameters - so some code crept into my ETL tool! This could be done with transformers.

The Graph in ArcGIS

I'll let you surf the tool to inspect the entity & relationship construction logic, but basically the entity types are Airports (points) and Flights (2-point lines) and the relationships are airports have departures on flights (HasDeparture), flights may have connections to other flights (HasConnection), and flights have arrivals at airports (HasArrival). The business logic used for connections is flights are connected if an inbound flight touches down between 1 and 4 hours before the outbound flight takes off. In real life there might be other factors like agreeing code shares, but this is just a demo!

Here is the graph data model view, airports and flights have relationships to each other and flights have connections with other flights. The Document entity is not used.

FlightAware Graph Data Model

Now let's make an analytic query! Let's say I'm a law enforcement officer and I want to ask airlines and airports to check passenger manifests and recent video footage for a suspected jewel thief who I think left Los Angeles to travel to Berlin, or is about to do so. What airlines, flights and airports make most sense to enquire with? Of course I break out my openCypher skills and use my daily-updated graph!

I'll let you step through the code, but what it does is find the shortest flying time path between Los Angeles and Berlin-Brandenberg, to a maximum of 4 flight legs.

match path = (origin:Airports)-[:HasDeparture|:HasConnection*0..3]->(:Flights)-[:HasArrival]->(destination:Airports)
where origin.name = 'Los Angeles Intl' AND destination.name = 'Berlin-Brandenburg'
with path, nodes(path) as flights
unwind flights as flight
with path, sum(case when flight:Flights then flight.filed_ete else 0 end) as totalDuration
return path, (totalDuration/3600) as totalDuration
order by totalDuration
limit 1

A path is returned from my query....

Los Angeles to Berlin

The route is Los Angeles to Berlin via John F Kennedy Intl and London Gatwick. Here is the path added to a map in ArcGIS Pro:

Los Angeles to Berlin Route

Now I can contact my law enforcement colleagues worldwide with a focused request!

Discussion

This demonstrates both the single-tool approach and a lot more besides around using API data and Knowledge Graphs. The ETL workspace in the download is in a ready to run state assuming you have an API key and have already built the graph before - it is updated. The tool can be run using ArcGIS Pro's regular tool scheduling feature, say every day. You will not be in this state to begin with, but you'll see some Creator transformers that can be used to run the tool manually in parts, say to create the Airports entities. Work this way by temporarily disabling Creators or other transformers that are in streams you don't want. I created empty relationships manually as Knowledge supports specifying the origin and destination for relationships (so it can display the data model).

Comment in this post if you have questions or observations. Have fun with your graph ETL!

SamVernon

Thanks Bruce. Knowledge graphs are great! I used them to build a metadata twin at Watercare. Kieran o'Donnell built a process in FME that hit the end points for ArcGIS Enterprise, Portal and Online, and the same for FME Server. We could then use graphs to show the connections between every featureclass, feature service, map, app, shapefile and everything in between using a graph. It was mesmerising!!

BruceHarold

Hi Sam, I saw Kieran's presentation at the Peak conference - best of show!