Sync hosted feature services to postgres database

lightpro · ‎06-17-2021

Does anyone have any experience with syncing hosted feature services with an external postgresql database? We're not using enterprise and would rather not have to go through the cost and effort of setting up and administering enterprise. We are building a web application however which will have non-gis business related data (customer info, order details etc.) which we need to relate to location data hosted on AGOL.

The only solution we can really think of is to sync/copy the gis data from AGOL to the the postgres database. Are there any potential pitfalls/issues? Is it even possible? I'm guessing we might have issues with maintaining relationships?

John_Spence · ‎06-17-2021

I ran into the same issue a while ago myself. Attached is a copy of the process I use to lift and insert. Basic gist is, I use a Pro project to designate the layer I want to deal with and put the SDE connections to be used there. I am trying to remember if I run into issues with the service extraction due to limitations on max records of the service or not, but I can't recall it at the moment.

Oh! I probably should mention that the password for the portal or AGOL user specified needs to be encoded in base64. Obviously, not secure, but good enough for the casual eye.

lightpro · ‎06-17-2021

Wow thanks! Will need some time to digest it. I assume this requires a server license in order to setup your postgres geodatabase etc.?

John_Spence · ‎06-17-2021

Yes, in my example it is working on the assumption you have a SDE database (server license needed), but in all honesty you could just change the connection type to be whatever you want and go from there.

lightpro · ‎06-17-2021

Ok that might be the way we go but I'm working on the assumption that we won't have an SDE database. Depends on budget!

John_Spence · ‎06-17-2021

Gotchya. Well, you can add non-SDE databases in a Pro project so you could be good to go there. There is also the option of scraping the data from the service, again though, you can run into the max record count limitation.

Assuming you have to authenticate to get it at it:

Spoiler

def get_token():
#-------------------------------------------------------------------------------
# Name: Function -
# Purpose:
#-------------------------------------------------------------------------------

data = checklist.b64decode(AGOL_Pass)
pw = data.decode("ascii")

url = 'https://www.arcgis.com/sharing/rest/generateToken'
values = {'f': 'json',
'username': AGOL_User,
'password': pw,
'referer' : 'https://www.arcgis.com',
'expiration' : '10'}

data = urllib.parse.urlencode(values).encode("utf-8")
req = urllib.request.Request(url)

response = None
while response is None:
try:
response = urllib.request.urlopen(req,data=data)
except:
pass

the_page = response.read()

#Garbage Collection with some house building
payload_json = the_page.decode('utf8')
payload_json = json.loads(payload_json)

edit_token = payload_json['token']

return (edit_token)

def get_token():#-------------------------------------------------------------------------------# Name: Function -# Purpose:#-------------------------------------------------------------------------------data = checklist.b64decode(AGOL_Pass)pw = data.decode("ascii")url = 'https://www.arcgis.com/sharing/rest/generateToken'values = {'f': 'json','username': AGOL_User,'password': pw,'referer' : 'https://www.arcgis.com','expiration' : '10'}data = urllib.parse.urlencode(values).encode("utf-8")req = urllib.request.Request(url)response = Nonewhile response is None:try:response = urllib.request.urlopen(req,data=data)except:passthe_page = response.read()#Garbage Collection with some house buildingpayload_json = the_page.decode('utf8')payload_json = json.loads(payload_json)edit_token = payload_json['token']return (edit_token)

You can use something like this to scrape it out. The script is to count and ID a specific item in my case, but you can swap things around easily enough to just scrape the entire dataset.

Spoiler

def queryCount(scriptName, edit_token):
#-------------------------------------------------------------------------------
# Name: Function -
# Purpose:
#-------------------------------------------------------------------------------

FS_service = service_StatusURL + '/query/?token={}'.format(edit_token)

where_statement = 'JobName=\'{}\''.format(scriptName)

data = urllib.parse.urlencode({'f': 'json', 'where': where_statement, 'returnCountOnly': 'true'}).encode('utf-8')

req = urllib.request.Request(FS_service)
response = urllib.request.urlopen(req,data=data)
response_payload = response.read()
response_payload = json.loads(response_payload)

item_count = response_payload['count']

if item_count > 0:
response = None
while response is None:
try:
where_statement = 'JobName=\'{}\''.format(scriptName)

data = urllib.parse.urlencode({'f': 'json', 'where': where_statement, 'outFields':'OBJECTID'}).encode('utf-8')

req = urllib.request.Request(FS_service)
response = urllib.request.urlopen(req,data=data)
response_payload = response.read()
response_payload = json.loads(response_payload)
for oid_item in response_payload['features']:
objectID = oid_item['attributes']['OBJECTID']

except:
print ('\t...Failure to count. Trying again.')

time.sleep(30)
response = None
else:
objectID = '0'

return (item_count, objectID)

def queryCount(scriptName, edit_token):#-------------------------------------------------------------------------------# Name: Function -# Purpose:#-------------------------------------------------------------------------------FS_service = service_StatusURL + '/query/?token={}'.format(edit_token)where_statement = 'JobName=\'{}\''.format(scriptName)data = urllib.parse.urlencode({'f': 'json', 'where': where_statement, 'returnCountOnly': 'true'}).encode('utf-8')req = urllib.request.Request(FS_service)response = urllib.request.urlopen(req,data=data)response_payload = response.read()response_payload = json.loads(response_payload)item_count = response_payload['count']if item_count > 0:response = Nonewhile response is None:try:where_statement = 'JobName=\'{}\''.format(scriptName)data = urllib.parse.urlencode({'f': 'json', 'where': where_statement, 'outFields':'OBJECTID'}).encode('utf-8')req = urllib.request.Request(FS_service)response = urllib.request.urlopen(req,data=data)response_payload = response.read()response_payload = json.loads(response_payload)for oid_item in response_payload['features']:objectID = oid_item['attributes']['OBJECTID']except:print ('\t...Failure to count. Trying again.')time.sleep(30)response = Noneelse:objectID = '0'return (item_count, objectID)

lightpro · ‎06-17-2021

Excellent! Between you and @jcarlson I've got several options! Will take me some time to get my head around it all!

jcarlson · ‎06-17-2021

We have nightly scripts that sync hosted layers with both postgres and MSSQL servers, and they work great! You need to know a bit of python, but it's definitely possible.

Prep Work

If you don't have one already, develop a query statement that returns the data you want from the database and save it as its own *.sql file. Or maybe you're just copying entire tables? You may be able to skip this.

You'll also want to create the hosted layer first. You can do this by first publishing an extract from the database, or by creating an empty layer. As long as the schema is correct.

The Process

Using pandas, read your sql query (or table) into a dataframe.
Make any adjustments to the data you need
Using the ArcGIS Python API's Dataframe.spatial submodule, you can edit the hosted layer and supply the dataframe as the new / updated features.

Considerations

If you just want to keep it simple, you can truncate the hosted layer and just re-add everything fresh. Depending on the size of your table and the frequency of updates, this may or may not be feasible. If you want to go a different route than truncate/append, you'll need some way of identifying unique records in the postgres database and comparing them against the hosted layer.

An easy way to do this is to include the postgres primary key as its own field in the hosted layer. You can query the hosted layer to its own dataframe, and then compare the two dataframes to identify rows which have changed between the postgres table and the hosted layer. The result of the comparison could allow you to subset the postgres dataframe, then submit that subset as your adds/edits, joining in the hosted objectID / globalID based on the shared primary key field.

But that can get a bit complicated.

Either way, get that script scheduled to run on a machine that has access to the postgres database and you're set!

We have about 20 hosted layers that all "mirror" an external database or some query therefrom, which for whatever reason cannot be published directly from said database. Our scripts run weeknights with some added exception handling and logging, but 99% of the time they go off without a hitch.

Long story short: if you have the patience to write up and debug some python, you can do pretty much anything you want between a hosted layer and an external database.

- Josh Carlson
Kendall County GIS

jcarlson · ‎06-17-2021

from arcgis import GIS
import pandas

gis = GIS('your-org-url', 'user', 'password')

# Alter list index as needed, if multiple layers, or use '.tables[0]' for non-spatial
hosted_fl = gis.content.get('hosted-layer-itemid').layers[0]

psql_df = pandas.read_sql('your sql query', 'postgresql+psycopg2://user:pw@host/db')

hosted_fl.manager.truncate()

hosted_fl.edit_features(adds=psql_df.spatial.to_featureset())

While @John_Spence's script is incredible, and I love a well-documented tool, I thought I'd elaborate on my my own process a bit, just in case you wanted to work outside of the Pro environment.

This sort of script will run anywhere so long as the required modules are in the python env, which can be nice if you have, say, a Linux server, or just don't need Pro on your DB machine.

I forgot to ask, but is the layer spatial or non-spatial? Pandas might not pick up a geometry column, but GeoPandas can. It only adds a couple steps, but preserves the geometry.

from arcgis import GIS, GeoAccessor
import geopandas
from sqlalchemy import create_engine

gis = GIS('your-org-url', 'user', 'password')

hosted_fl = gis.content.get('hosted-layer-itemid').layers[0]

con = create_engine('postgresql+psycopg2://user:pw@host/db')

gdf = geopandas.GeoDataFrame.from_postgis('your-query', con, 'geometry-column-name')

sdf = GeoAccessor.from_geodataframe(gdf)

hosted_fl.manager.truncate()

hosted_fl.edit_features(adds=sdf.spatial.to_featureset())

- Josh Carlson
Kendall County GIS

John_Spence · ‎06-17-2021

Someone had to go 🐼 here 😁! I am actually glad you did that because I usually end up working around it vs. leveraging the power of Pandas sadly. The only other option I can think of would be to drop the arcgis component entirely and just run pyodbc. Scrape and insert fantastic.

@jcarlson ...made my day! Hope you're having a good day out there.