Building a Data Driven Organization, Part #14: Total Integration

BruceHarold · ‎08-16-2022

I'm coining a new term here - Total Integration - meaning the continuous, automated integration of any data and any ArcGIS service, this concept combines data movement with refreshing the web services the data powers. No manual ETL, no disconnected processes after the ETL, no downtime, just continuous Enterprise Service Bus maintenance the way it should be done, in one process.

It has become trite to say you're "excited" to share details on functionality, but I really do feel using a Notebook as a lightweight scheduler will bring web tool automation to the masses.

My example scenario happens to be about maintaining a portal locator (i.e. geocode service), a very common foundational service, but read between the lines and picture your full suite of services under continuous management no matter how the data powering them changes. Of course, you don't have to be maintaining a service, your web tool may be doing any sort of geoprocessing.

San Francisco Locator Data From Socrata

There are a few parts to total integration, the trick is making them work together. The logical steps are:

Publish a web tool that moves your data and refreshes your target web service (if applicable)
Automate the web tool on a schedule

In my case I use ArcGIS Data Interoperability in the web tool but that's just an artifact of my data source. The lightbulb moments for me were realising I could combine ETL with core geoprocessing to include service refresh in my web tool and also use a hosted ArcGIS Online hosted Notebook as a lightweight scheduler for complete automation.

While I'm usually a no-code guy, there is a little Python involved, but you can relax, I'm sharing the code patterns in the blog download. My blog scenario is at the high end in its usage of ArcPy but only because ArcGIS currently lacks a geoprocessing tool that rebuilds a portal locator (note to self: get one coming), much of the time you might build your web tool using Data Interoperability called within ModelBuilder, so without ArcPy code.

So with some ArcPy and using the ArcGIS Python API in a Notebook we can put it all together. Let's get started!

My source dataset to integrate is the city and county of San Francisco Enterprise Addressing System Addresses with Units. The data changes daily and I want a portal locator refreshed each day before working hours. Here is my Data Interoperability workspace authored in Pro and which does the job interactively:

Refresh Portal Locator

The ETL tool reads a CSV file at a URL provided by Socrata, writes the features to a scratch geodatabase, performs change detection between the latest data and its previous state (maintained in a portal feature service) then writes the changes both to the feature service and an Edits feature class in the scratch geodatabase. It also emails me a handy summary of new or changed addresses. The tool is in the blog download and requires Pro 3.0+ and Data Interoperability.

Spoiler

Notes on the scratch environment:
Since publishing this article I have noticed my web tool occasionally failing at run time due to the scratch geodatabase becoming corrupt, the fix was to refactor how the scratch environment is used, with a unique name generated at run time. It is not possible to set the file geodatabase writer to automatically overwrite the target geodatabase because it is locked at run time.

Notes on the scratch environment:Since publishing this article I have noticed my web tool occasionally failing at run time due to the scratch geodatabase becoming corrupt, the fix was to refactor how the scratch environment is used, with a unique name generated at run time. It is not possible to set the file geodatabase writer to automatically overwrite the target geodatabase because it is locked at run time.

But how do changes get pushed to the portal locator by the ETL tool, you ask? By using a shutdown script that calls ArcPy in the geoprocessing scratch environment! Data Interoperability ETL tools are geoprocessing tools and the scratch geodatabase in this case is found by this scripted parameter:

import arcpy
scratchGDB = arcpy.CreateScratchName("xx",".gdb","Workspace",arcpy.env.scratchFolder)
arcpy.AddMessage('Using scratch GDB {}'.format(scratchGDB))
return scratchGDB

The scratch folder is found similarly:

import arcpy
scratchFolder = str(arcpy.env.scratchFolder)
arcpy.AddMessage('Using scratch folder {}'.format(scratchFolder))
return scratchFolder

When shared as a web tool the log shows these messages:

Using scratch GDB C:\Users\arcgis\AppData\Local\Temp\scratch\xx0.gdb
Using scratch folder C:\Users\arcgis\AppData\Local\Temp\scratch

Now the powerful bit, using ArcPy in a shutdown script to recreate the locator service. The locator creation code originated as a copy from a history item after a Create Locator tool run, the rest just from reading the help.

#  If there are edits then recreate the portal locator

import arcpy
import datetime
import fme
import os
import pytz

scratchGDB = fme.macroValues['ScratchGDB']
scratchFolder = fme.macroValues['ScratchFolder']
portalURL = fme.macroValues['PortalURL']
serverURL = fme.macroValues['ServerURL']
portalUser = fme.macroValues['PortalUser']
portalPassword = fme.macroValues['PortalPassword']
arcpy.env.workspace = scratchGDB
arcpy.env.overwriteOutput = True

if arcpy.Exists('Edits') and arcpy.Exists('AddressesWithUnits'):
    #  Create or recreate the locator if there are data changes
    arcpy.geocoding.CreateLocator("USA",
    r"{}\AddressesWithUnits PointAddress".format(scratchGDB),
    "'PointAddress.HOUSE_NUMBER AddressesWithUnits.Full_Address_Number';"+
    "'PointAddress.STREET_NAME AddressesWithUnits.Street_Name';"+
    "'PointAddress.STREET_SUFFIX_TYPE AddressesWithUnits.Street_Type';"+
    "'PointAddress.SUB_ADDRESS_UNIT AddressesWithUnits.Unit_Number';"+
    "'PointAddress.NEIGHBORHOOD AddressesWithUnits.Neighborhood';"+
    "'PointAddress.CITY AddressesWithUnits.City';"+
    "'PointAddress.SUBREGION AddressesWithUnits.County';"+
    "'PointAddress.REGION AddressesWithUnits.Region';"+
    "'PointAddress.POSTAL AddressesWithUnits.ZIP_Code';"+
    "'PointAddress.COUNTRY AddressesWithUnits.Country';",
    r"{}\SanFrancisco".format(scratchFolder),
    "ENG",None,None,None,"GLOBAL_HIGH")
    # Create the SD draft and SD
    locator_path = os.path.join(scratchFolder,'SanFrancisco')
    sddraft_file = os.path.join(scratchFolder,'SanFrancisco.sddraft')
    sd_file = os.path.join(scratchFolder,'SanFrancisco.sd')
    service_name = 'SanFrancisco'
    pst = pytz.timezone('US/Pacific')
    sfNow = datetime.datetime.now(pst)
    sfNowStr = sfNow.strftime('%Y:%m:%d %H:%M:%S')
    summary = 'Point Address With Units locator for the City of San Francisco updated at {} PST'.format(sfNowStr)
    summary += '\nBuilt from the Socrata data source: https://data.sfgov.org/Geographic-Locations-and-Boundaries/Addresses-with-Units-Enterprise-Addressing-System/ramy-di5m'
    tags = 'San Francisco,EAS'
    analyze_messages = arcpy.CreateGeocodeSDDraft(locator_path,
	                                              sddraft_file,
												  service_name,
												  copy_data_to_server = True,
												  summary = summary,
												  tags = tags,
												  max_result_size=50,
												  max_batch_size=1000,
												  suggested_batch_size=150,
												  overwrite_existing_service=True)
    # Upload the locator
    if analyze_messages['errors'] == {}:
        arcpy.SignInToPortal(portalURL,portalUser,portalPassword)
        arcpy.server.StageService(sddraft_file,sd_file)
        arcpy.server.UploadServiceDefinition(sd_file,serverURL)

# Clean up the scratch GDB
try:
    arcpy.management.Delete(scratchGDB)
except:
    pass

So that's the ETL tool that I published as a web tool. Before moving on to how we orchestrate the web tool I have a couple of tips on authoring and publication.

The ETL tool is parameterless as far as the geoprocessing environment is concerned, because I set all its inputs to not be published. This works around a behavior with an input CSV file that is at a URL: ArcGIS Enterprise cannot register a URL as a data store or copy the data to the server at run time so we just hide it by not publishing the parameter.
The ETL tool uses a couple of web connections to access the portal and authenticate to an email service. You get these onto the server by copying them out as XML files from the Tools>Options>Web Connections menu, copying them to the server, logging onto the server as the arcgis service owner, starting Workbench from fmeworkbench.exe and importing them from the same interface. Make sure to re-authenticate each connection after importing them.
Because we're using the scratch environment on the server, which is shared between processes (this behavior may change), the web tool's geoprocessing service is set to run in 1 instance maximum, this prevents us from accidentally getting contention between instances writing to the scratch file geodatabase (multiple jobs will queue). If you need to publish multiple web tools that write into the server environment then refactor them to use temporary workspaces.

At this point we have a web tool that does what we want, now we need to orchestrate it on a schedule. For this we use a hosted Notebook in ArcGIS Online. This is the first notebook I ever authored (my no-code background shows) so if I can do it you can too. The web tool runs on my portal but my notebook runs in Online. That is no problem for the notebook, the web tool isn't executing in Online, under the covers the notebook is just authenticating to the portal and sending a submitJob REST request. Online notebooks know how to connect to portal GIS instances. Here is the single-cell notebook (Standard runtime):

Online hosted Notebook

How hard is that? Crazy simple. The hardest part was getting a certificate from the IT folks that the portal would accept when hearing from ArcGIS Online (my colleague Renato S. gets the credit for that). All I did was read the help and replace help examples with my paths and names. To find the function name to use after importing the web toolbox I just put the code into IDLE so intellisense could give me the methods.

Then while editing the notebook I went to the Tasks editor and set up my notebook to run weekdays at 6am. Note that the task editor time picker uses local time values from your browser client.

Now I can let the system(s) run and each day my organization has an up-to-date locator maintained in a portal!

Locator item details plus today's address changes email