ETL Pattern: Pagination and Parsing of API Responses

ShareUser · ‎09-16-2024

The Problem

I'll try out a new acronym here - B2G - Business To GIS - meaning the delivery of externally managed data into ArcGIS. It is very common that external ("business") systems offer a REST API you can call to get common format data like JSON, but also very common that the server delivers a dataset in multiple chunks (i.e. "pages"), and that the JSON isn't in a well-defined dialect like GeoJSON - you have to parse it into columns, potentially from struct or array objects.

This post is about conquering the two challenges of pagination and parsing, without coding, using ArcGIS Data Interoperability.

My subject matter data is Federal Communications Commission customer complaints data about unwanted calls, filtered for 2024. If you want to see how it looks as incoming JSON click here for 1000 random records, but here's the end result as a feature service:

FCC Customer Complaints - unwanted callsFCC Customer Complaints - unwanted calls

Some research reveals there are several common pagination approaches:

Offset and limit parameters defining the starting row position and row count, reading ordered data.
- This approach may include another parameter for which property sets the sort order
Page number and (optionally) page size parameters.
Query based pagination, where a start row is defined by a logical query, implying any next query.
Time based query, a special case of query-based pagination.
Cursor based pagination, where the API provides page names, including any next page, in the result.
Combinations of the above.

The first two approaches are the simplest (and only ones I have used), the server does the page building calculations and the client need only keep track of a simple counter, sending requests until the data is exhausted.

Here is an example of an offset and limit based API and here is an example of a page based API.

For an example of #6, combining #3 & #4, you can read up on querying hosted feature layers. ArcGIS takes care of building layer queries for you.

How about parsing made simple too? We have to turn records like this into typed columns and geometry!

{
  "issue_type" : "Phone",
  "caller_id_number" : "830-210-2001",
  "state" : "IL",
  "method" : "Wired",
  "advertiser_business_phone_number" : "None",
  "issue_time" : "8:41 am",
  "issue" : "Unwanted Calls",
  "zip" : "60629",
  "type_of_call_or_messge" : "Live Voice",
  "issue_date" : "2024-01-04T00:00:00.000",
  "id" : "3739134",
  "location_1" : {
    "latitude" : "41.781382",
    "human_address" : "{\"address\": \"\", \"city\": \"IL\", \"state\": \"\", \"zip\": \"60629-5219\"}",
    "needs_recoding" : false,
    "longitude" : "-87.732853"
  }
}

No problem! Let's get started.

Pagination

We're calling a REST API here, and it offers a paginated response. In ArcGIS Data Interoperability that means calling the HTTPCaller transformer in a loop, in this case with offset, limit and order parameters, incrementing the offset value each call, until all data is received and the response is an empty array. You might not have known you can loop in a visual programming environment, but you can, and it's easy. The mint green custom transformer "UnwantedCallsLooper" near top left in my workspace does the job. It is a looping custom transformer.

Parent ETL workspaceParent ETL workspace

The custom transformer is embedded (included) in the Main canvas, it lives in its own eponymously named tab. You create custom transformers by selecting one or more ordinary transformers in the Main canvas then a context menu accessible by right-click offers a custom transformer creation choice. In my case I selected just a single HTTPCaller transformer.

After creating the custom transformer there were more editing steps to set up looping:

Add a loop return connection using a context menu choice in the canvas
Add a test for completion ahead of the loop return
Increment the offset parameter

Below is how my custom transformer looks. At run time each feature arriving from the Main canvas has an offset attribute used by the HTTPCaller transformer. The limit and order parameters do not change so are "hard coded" in the HTTPCaller. The lower stream is the loop-until condition - if the response is not an empty array the data isn't exhausted, so increment the offset and loop!

Looping custom transformerLooping custom transformer

Parsing

Each API response is an array of JSON structs and is contained in an attribute named "_response_body". The array is exploded into separate features with a JSONFragmenter. The JSON Query parameter just means we're fragmenting the top level array and the other settings just ensure delivery of fragments that look like the code block above.

JSONFragmenterJSONFragmenter

Now we parse each fragment feature with a JSONExtractor.

JSONExtractorJSONExtractor

The Extract Queries grid defines the output attribute names and where the data comes from in the fragment. The JSON queries follow a simple syntax you can key in, but there is a simple trick to getting a picker to let you navigate the JSON structure and automate query generation. Ahead of the JSONExtractor, temporarily sample a single feature and write it to a file, then temporarily set your JSONExtractor input source to this file and you'll get a handy picker for your queries! When you have populated your extract queries remove the Sampler and reset your JSONFragmenter input source to be the document in _response_body.

So that is pagination and parsing de-mystified! In the blog download there are two workspace source fmw files. CreateFC creates a file geodatabase output and is where I figured out the data processing, like fixing data errors that defeated creation of a datetime field. I wanted an initial feature class anyway to symbolize and share as my target hosted feature service. UpdateFS borrows the data processing steps in CreateFC but contains the looping custom transformer and some logic to detect and apply data changes whenever the ETL tool is run, which is what you'll likely need in production.

Please comment in this post with any questions and observations. Page away!