|
BLOG
|
@ShareUser Now that is exactly the right question! GeoParquet certainly has some very attractive properties, such as a rich data type set (including arrays and structs so you can have an implied data model in one table), queryable metadata, support in the S3 API across the major object stores etc. This makes it a candidate to replace the venerable shapefile, if the user community chooses.
... View more
02-13-2024
11:18 AM
|
0
|
0
|
2930
|
|
BLOG
|
@ShareUser A feature service with behaviors like relationship classes is easily digestible in ArcGIS Pro for mapping and analysis, a remote hive of GeoParquet files (much) less so, at least for a Pro user.
... View more
02-13-2024
11:07 AM
|
0
|
0
|
2941
|
|
POST
|
Hi Matt There are two properties in play, feature operation and table handling. If you are always replacing the table data entirely then insert feature operation and drop and create table handling is simple. This will also create the table if it doesn't exist. If the table has relationship classes though, dropping the table will also drop the relates (which Data Interoperability cannot recreate), in which case you need to insert features with a truncate existing table handling. Most elegant is to also read the existing state of the target table, use a ChangeDetector transformer to figure out the delta, then use fme_db_operation feature operation to do insert, update and delete operations as required. Be aware though that ChangeDetector is brutally strict, things like date precision matter if you include datetimes in the compare.
... View more
02-13-2024
08:21 AM
|
1
|
0
|
1794
|
|
BLOG
|
In an earlier post I got my feet wet using DuckDB in a hosted notebook in ArcGIS Online. That post highlighted reading GeoParquet files in an object store to maintain a feature service. That's a powerful workflow but it is much more likely your daily work involves wrangling common file formats which take up way too much time to ingest - like CSV, Excel and JSON. This post is about making your life easier by showing how DuckDB enables SQL for any file format and also supports simple conversion to Esri formats. If you have basic SQL and Python skills - as in you can read this sample and the web help - you will be fine. Here is some live action, Bloomington Indiana open 311 incidents migrated to a feature class in seconds. Bloomington 311 Incidents The notebook (in the blog download) reads a CSV file at a public URL and writes a feature class into the project default geodatabase. Notebooks are a good place to start compared to a script tool because you can work in a cell interactively as you figure out your SQL. Once things are working you might copy the code to a script tool for easy automation on a schedule. But first you need to install DuckDB! The usual route to adding a Python package to your environment is to clone the arcgispro-py3 environment then add a package in the Package Manager UI backstage in Pro. You would look for a package named python-duckdb. This errored for me (it happens) so I did a manual install. I cloned my default environment to one I named arcgispro-py3-quack and installed DuckDB by opening the Python Command Prompt from the ArcGIS program group then running this command ('username' will be yours): conda install -c conda-forge python-duckdb=1.0.0 -p "C:\Users\username\AppData\Local\ESRI\conda\envs\arcgispro-py3-quack" --yes Make sure to get the latest DuckDB distribution, at writing it is 1.0.0. Let's walk through each notebook cell. The first one is pretty simple, the imports and some calls to duckdb to set up the environment. # Imports, environment
import arcpy
import duckdb
from arcgis.features import GeoAccessor, GeoSeriesAccessor
import os
import requests
from urllib.parse import urlparse
conn = duckdb.connect()
conn.sql("install spatial;load spatial;")
conn.sql("install httpfs;load httpfs;")
conn.sql("set s3_region='us-west-2';")
conn.sql("set enable_object_cache=true;")
arcpy.env.overwriteOutput = True The second cell makes a pandas dataframe by reading the CSV data while automatically inferring data types - such a big help. You'll notice I code defensively for the situation the server I'm hitting doesn't support HTTP range request access like DuckDB wants - I download the data - which in this case was necessary. # Read data into duckdb
url = r'https://data.bloomington.in.gov/resource/aw6y-t4ix.csv?$limit=200000'
filename = os.path.basename(urlparse(url).path)
base, extension = os.path.splitext(filename)
try:
sql = "create or replace view bloomington311_view as select * from read_csv_auto('{}');".format(url)
conn.sql(sql)
except:
response = requests.get(url)
with open(filename, 'w', encoding="utf-8") as f:
f.write(response.text)
f.close()
sql = "create or replace view bloomington311_view as select * from read_csv_auto('{}');".format(filename)
conn.sql(sql)
# Build the query you want in SQL to rename, cast or otherwise process the data.
# This dataset has a column in WKT already, if your data does not then you'll need
# to make one using the ST_ functions in DuckDB.
sql = """select service_request_id, requested_datetime, updated_datetime, closed_date, status_description, source,
service_name, description, agency_responsible, address, city, state, try_cast (zip as varchar(10)) as zip, sladays,
request_complete_days, sla_diff_days,
geocoded_column as SHAPE from bloomington311_view where SHAPE is not null;"""
df = conn.sql(sql).df() Note the point about using or creating WKT values for your geometry. The spatial extension for DuckDB has a rich set of spatial functions. Now convert the dataframe to a feature class. You have other conversion options. Note that the arcgis Python API has no idea where the dataframe came from - DuckDB will do fine! - so I spatially enable the dataframe and then write to a geodatabase feature class. No wrangling of attribute field properties, they just work (see the SQL above, the classic situation of casting ZIP codes to text came up). # Write the feature class
df.spatial.set_geometry("SHAPE",sr=4326,inplace=True)
aprx = arcpy.mp.ArcGISProject("CURRENT")
gdb = aprx.defaultGeodatabase
out_fc = arcpy.ValidateTableName(base,gdb)
location = os.path.join(gdb,out_fc)
df.spatial.to_featureclass(location) Lastly, announcement! print("Created feature class {}".format(location)) That's it, fast, simple, smart migration of data with some help from DuckDB! The notebook is in the blog download, let us know how you get on.
... View more
01-23-2024
01:34 PM
|
9
|
2
|
2347
|
|
BLOG
|
Updated 7/25/2024 to support Overture release 2024-07-22.0 Breaking changes include place category "main" renamed to "primary" and update_time removed from the base schema. Refer to the notebook in the download for code changes. This blog may not be maintained.
Every now and then a compelling workflow is enabled by new ideas, and data distribution by cloud native formats is my topic today. Here is my example data:
Places of interest in London, England
What you are looking at is a hosted group layer in ArcGIS Online with 347,595 place of interest points in London, England, with related tables; 347,585 address details, 16,240 brand names, 262,183 website URLs, 458,534 source attributions, 246,398 social media link URLs, and 302,601 phone numbers. Here is a chocolatier in Covent Garden:
Chocolatier in Covent Garden
This being England, you might expect there are tea rooms about the city. Using place category and related tables we can see not just their locations but their address, website, social media links and phone numbers:
Tea Rooms in London
Tea Rooms and related data
This place data is one of the themes from Overture Maps Foundation and is made available under this license. If you surf the Overture website, you'll see it is a collaboration of Amazon, Meta, Microsoft and TomTom as steering members, Esri as a general member, and many other contributor members and is envisaged as a resource for "developers who build map services or use geospatial data". I'm democratizing it a bit more here, giving you a pattern for consuming the data as hosted feature layers in your ArcGIS Online or ArcGIS Enterprise portals.
Let's dig into the details of how to migrate the data to feature services.
The data is made available at Amazon S3 and Azure object stores as Parquet files, with anonymous access. I'll let you read up on the format details elsewhere but Parquet is definitely one star of the cloud native show because it is optimized for querying by attribute column, and in the case of GeoParquet, this includes a spatial column (technically multiple spatial columns if you want to push your luck). As GeoParquet is an emerging format it still has some things that are TODO, like a spatial index (which would let you query by spatial operators), but Overture very thoughtfully include a bounding box property which is simple to query by X and Y.
The technology that is the second star of the cloud native show is DuckDB. DuckDB enables SQL query of local or remote files like CSV, JSON and of course Parquet (and many more) as if they are local databases. Remote file query is especially powerful if the host portal supports the Amazon S3 REST API and a client that talks this can use HTTP to send SQL queries, which DuckDB can. Especially powerful is DuckDB's ability to unpack complex data types (arrays and structs) into a rich data model like I'm doing here (base features and 1:M related tables), and not just flat tables. Only the query result is returned, not the remote file.
The third star of the cloud native show is ArcGIS Online hosted notebooks, which can be used to orchestrate DuckDB transactions and integrate the results into new or refreshed hosted feature layers in Online or Enterprise. The superpower of this combination of Parquet, DuckDB and Python is that global scale data can be queried for a subset of interest in a desired schema using industry standard SQL, plus automated . This forever deprecates the legacy workflow of downloading superset data files to retrieve the part you want. At writing, the places data resolves to 6 files totaling 5.54GB, not something you want to haul over the wire before you start processing! If you think about it, any file format you have to zip to move around and unzip to query (shapefile, file geodatabase) generates some friction, Parquet avoids this.
The notebook named DuckDBIntegration is what I used to import the places data and is in the blog download. Notebooks aren't easy to share graphically but I'll make a few points.
Firstly, ArcGIS notebooks don't include the DuckDB Python API, so it needs to be installed from PyPi, here is the code that does the import and also loads the spatial and httpfs extensions needed for DuckDB in this workflow. The DuckDB version at writing is 1.0.0, which is supported in the Esri conda channel for Pro. Note that DuckDB 1.1.0 is also available and which has some spatial feature enhancements.
try:
import duckdb
except:
!pip install duckdb==1.0.0
import duckdb
conn = duckdb.connect()
conn.sql("install spatial;load spatial;")
conn.sql("install httpfs;load httpfs;")
conn.sql("set s3_region='us-west-2';")
conn.sql("set enable_object_cache=true;")
Once DuckDB is loaded it is a matter of extracting the layers of interest. I did not model the full Overture places schema, for example omitting alternate place categories.
If you browse the places schema note the YAML tab. Don't be surprised in 2025 if you see YAML powering AI to let you have conversations with your data 😉
As Overture data is of global scale, with hundreds of millions of features for the base Places layer, the extract must be limited to my area of interest. The method I adopted was to use the Division Area theme as my selection layer. This is all the world's political divisions from country to neighborhood scale - I build a selecting geometry from a SQL query on division area subtypes and names. The blog download includes a notebook DivisionArea which extracts the data to a file geodatabase feature class. After some data exploration I determined this SQL where clause would define my area of interest:
country = 'GB' and subtype in ('county','locality') and names.primary in ('City Of London','London','Camden','Tower Hamlets','Islington')
Then in my cell that makes my extraction polygon I use this SQL and DuckDB's spatial operators to construct my selection geometry, both as a Well Known Text polygon and XY bounds variables:
sql = f"""select ST_AsText(ST_GeomFromWKB(geometry)) as wkt
from read_parquet('s3://overturemaps-us-west-2/release/{release}/theme=divisions/type=division_area*/*', filename=true, hive_partitioning=1)
where
country = 'GB' and subtype in ('county','locality') and names.primary in ('City Of London','London','Camden','Tower Hamlets','Islington');"""
wktList = [row[0] for row in conn.sql(sql).fetchall()]
wkt = wktList[0]
for w in wktList[1:]:
wkt = conn.sql(f"""select ST_AsText(ST_Union(ST_GeomFromText('{wkt}'),ST_GeomFromText('{w}')));""").fetchone()[0]
xmin,ymin,xmax,ymax = conn.sql(f"""with aoi as (select ST_GeomFromText('{wkt}') as geom)
select ST_XMin(geom),ST_YMin(geom),ST_XMax(geom),ST_YMax(geom) from aoi;""").fetchone()
I then make a view of the Places theme which defines the features I want:
sql = f"""create or replace view places_view as select *
from read_parquet('s3://overturemaps-us-west-2/release/{release}/theme=places/type=place*/*',filename=false, hive_partitioning=1)
where bbox.xmin >= {xmin}
and bbox.ymin >= {ymin}
and bbox.xmax <= {xmax}
and bbox.ymax <= {ymax}
and ST_Intersects(ST_GeomFromText('{wkt}'), ST_GeomFromWKB(geometry));"""
conn.sql(sql)
The Places data schema models the expansion tables as arrays of structs, so that each place may have a 1:M relationship to its expansion properties. If this terminology is new to you, it is equivalent to Python lists of dictionaries, but with the property that the dictionaries are guaranteed to have the same key names in each row.
Now I have my Places view it is a matter of throwing SQL statements at it to create relations (cursors) to use with ArcPy insert cursors to write each layer. Now, if you inspect the ExtractDivisions notebook you'll see it is possible to use a built-in file geodatabase driver in DuckDB to dump data out quickly and easily. However this doesn't control the schema very well, so I take a longer winded approach to creating the output data, relying on ArcPy to create my output. Here is the code which writes the Places base layer:
arcpy.env.workspace = arcpy.management.CreateFileGDB(out_folder_path=out_folder_path,
out_name=out_name,
out_version="CURRENT").getOutput(0)
arcPlaces = arcpy.management.CreateFeatureclass(out_path=arcpy.env.workspace,
out_name="Places",
geometry_type="POINT",
has_m="DISABLED",
has_z="DISABLED",
spatial_reference=sR).getOutput(0)
arcpy.management.AddField(in_table=arcPlaces,field_name="id",field_type="TEXT",field_length=32)
arcpy.management.AddField(in_table=arcPlaces,field_name="name_primary",field_type="TEXT",field_length=1000)
arcpy.management.AddField(in_table=arcPlaces,field_name="category_main",field_type="TEXT",field_length=100)
arcpy.management.AddField(in_table=arcPlaces,field_name="confidence",field_type="FLOAT")
arcpy.management.AddField(in_table=arcPlaces,field_name="update_time",field_type="DATE")
arcpy.management.AddField(in_table=arcPlaces,field_name="version",field_type="SHORT")
sql = f"""select id,
names.primary as name_primary,
categories.main as category_main,
confidence,
try_cast(update_time as datetime) as update_time,
version,
geometry
from places_view;"""
duckPlaces = conn.sql(sql)
with arcpy.da.InsertCursor(arcPlaces,["id","name_primary","category_main","confidence","update_time","version","shape@"]) as iCursor:
row = duckPlaces.fetchone()
i = 1
if row:
while row:
if i % 10000 == 0:
print('Inserted {} Places rows at {}'.format(str(i),getNow()))
row = list(row)
row[-1] = arcpy.FromWKB(row[-1])
iCursor.insertRow(row)
i+=1
row = duckPlaces.fetchone()
del iCursor
So, the above creates my point layer, then it is a matter of extracting each expansion table, which all follow the same pattern. Here is the code that extracts addresses from the Places view:
sql = """with address as (select id, unnest(addresses, recursive := true) from places_view where addresses is not null)
select id, freeform, locality, region, postcode, country from address;"""
duckAddresses = conn.sql(sql)
arcAddresses = arcpy.management.CreateTable(out_path=arcpy.env.workspace,out_name="Addresses").getOutput(0)
arcpy.management.AddField(in_table=arcAddresses,field_name="id",field_type="TEXT",field_length=32)
arcpy.management.AddField(in_table=arcAddresses,field_name="freeform",field_type="TEXT",field_length=300)
arcpy.management.AddField(in_table=arcAddresses,field_name="locality",field_type="TEXT",field_length=100)
arcpy.management.AddField(in_table=arcAddresses,field_name="region",field_type="TEXT",field_length=50)
arcpy.management.AddField(in_table=arcAddresses,field_name="postcode",field_type="TEXT",field_length=100)
arcpy.management.AddField(in_table=arcAddresses,field_name="country",field_type="TEXT",field_length=2)
with arcpy.da.InsertCursor(arcAddresses,["id","freeform","locality","region","postcode","country"]) as iCursor:
row = duckAddresses.fetchone()
i = 1
if row:
while row:
if i % 10000 == 0:
print('Inserted {} Addresses rows at {}'.format(str(i),getNow()))
iCursor.insertRow(row)
i+=1
row = duckAddresses.fetchone()
del iCursor
Note the fancy "unnest" operator in the SQL. This expands the array of address structs into new rows for each address, inheriting the id field for each expanded row.
Once all expansion tables are extracted I make the relationship classes from Places points to each expansion table:
for destination in ["Addresses","Brands","Websites","Sources","Socials","Phones"]:
arcpy.management.CreateRelationshipClass(origin_table="Places",
destination_table=destination,
out_relationship_class="Places_{}".format(destination),
relationship_type="SIMPLE",
forward_label=destination,
backward_label="Places",
message_direction="NONE",
cardinality="ONE_TO_MANY",
attributed="NONE",
origin_primary_key="id",
origin_foreign_key="id",
destination_primary_key="",
destination_foreign_key="")
Then attach some metadata to each layer as a good practice, some counts, times and license details:
current_time_utc = datetime.now(pytz.utc)
pst = pytz.timezone('US/Pacific')
current_time_pst = current_time_utc.astimezone(pst)
current_time_pst_formatted = current_time_pst.strftime('%Y-%m-%d %H:%M:%S')
objDict = {obj:0 for obj in ["Places","Addresses","Brands","Phones","Socials","Sources","Websites"]}
for k in objDict.keys():
objDict[k] = int(arcpy.management.GetCount(k).getOutput(0))
for obj in objDict.keys():
objCount = objDict[obj]
new_md = arcpy.metadata.Metadata()
new_md.title = f"Overture Maps Foundation {obj} in London, release {release}."
new_md.tags = f"{obj},London,Overture"
new_md.summary = f"For mapping and analysis of {obj} in ArcGIS."
new_md.description = f"{objCount} {obj} features in London, Great Britain."
new_md.credits = f"Esri, {current_time_pst_formatted} Pacific."
new_md.accessConstraints = f"https://cdla.dev/permissive-2-0/"
tgt_item_md = arcpy.metadata.Metadata(obj)
if not tgt_item_md.isReadOnly:
tgt_item_md.copy(new_md)
tgt_item_md.save()
else:
print(f"{obj} metadata is read only.")
I'll let you inspect the rest of the DuckDBIntegration notebook code yourselves, but all that is left is to zip up the output geodatabase and use it to overwrite the feature service being maintained. The most important point about the feature service is it must have been created by sharing a file geodatabase item, not shared from Pro. Making the initial file geodatabase item is just a matter of manually running the notebook down to the cell that does the zipping, then downloading the zip file manually and using it to create the item.
To go into production with this approach first figure out a query on extent or division area fields that works for you, then plug it into the notebook. You'll need to supply your own credentials of course, and change target gis if you're using Enterprise for the output feature service. Then at each Overture release, change the release variable to match Overtute and refresh your service.
To give you a feel for performance, look at the download notebook, you'll see it takes about 6 1/2 minutes to run - amazingly fast for the data scale. I am though co-located with the S3 region being accessed.
Naturally, if the data schema changes or to support other Overture themes, you'll need to author notebook changes or new notebooks. Do share!
I'm sure you'll agree that this is a powerful new paradigm that will change the industry. I'm just following industry trends. It will be fun to see if Parquet is what comes after the shapefile for data sharing.
The blog download has the notebook but not a file geodatabase (it's too big for the blog technology) but when you generate your own services don't forget the data license here. Have fun!
... View more
01-19-2024
11:27 AM
|
10
|
7
|
6409
|
|
IDEA
|
Hello Donald Data Interoperability extension for Pro can write TopoJSON. Regards
... View more
12-15-2023
06:19 AM
|
0
|
0
|
2151
|
|
POST
|
Hello Ashley It seems you have struck an issue we haven't found a cause for: https://pro.arcgis.com/en/pro-app/latest/tool-reference/tool-errors-and-warnings/160001-170000/tool-errors-and-warnings-160226-160250-160236.htm If you open a support call the analyst can look at your workflow. Re. subscribing, if you subscribe to the whole Data Interoperability board you'll get everything, and thank you for asking.
... View more
12-06-2023
11:28 AM
|
0
|
0
|
1582
|
|
BLOG
|
One of the hidden gems in ArcGIS Pro is an ability to reach out to external geographic data sources using simple scripted workflows, which you can optionally automate. Here is today's subject matter, an OGC API Features layer mirrored to a hosted feature layer in ArcGIS Online, and the notebook that does the job: Electricity Transmission Lines The secret sauce in my cooking is GDAL - the Geographic Data Abstraction Library. Esri ship GDAL in the standard conda environment in ArcGIS Pro, here is the package description: GDAL is a translator library for raster and vector geospatial data formats that is released under an X/MIT style Open Source license by the Open Source Geospatial Foundation. As a library, it presents a single raster abstract data model and vector abstract data model to the calling application for all supported formats. While you can go deep with GDAL, in the way I'm using it here it's "copy-paste" ETL with a small T, I didn't do any feature level transformation. All you have to know is where your source data is, what format it is, and then plug it into the notebook. I count 84 vector formats in the supported vector drivers, so it isn't ArcGIS Data Interoperability but it's a useful subset. Datasets can be files or at HTTP, FTP or even S3 URLs. My example uses OGC API Features, but this isn't important, GDAL abstracts the concept of a dataset, so you can use the same code for any format. However, on the topic of OGC API Features, the format supports the concept of a landing page, collections and items, and you can supply a URL to any of these endpoints from where you navigate to them in a browser. I imported the electricity transmission lines collection from this landing page. I'll let you walk through the notebook (in the blog download), but in summary it converts external data to file geodatabase then overwrites a feature service with the data. To create the feature service in the first place, pause at the cell that creates a zip file, upload and publish it, then use the item ID to set your target. Do not create the feature service from map layers, the system requires a file-based layer definition. Run the notebook on any frequency that suits you, or turn it into a Python script tool and schedule it. Make sure you inject the right format short name ahead of your data path. There you have it, simple maintenance of a hosted feature service sourced from external data! It isn't real data virtualization, as the data is moved, but it is an easy way to make data available to all.
... View more
11-28-2023
11:29 AM
|
3
|
0
|
1594
|
|
IDEA
|
Hi everyone, we have a blog coming on this thread but in the meantime the team have a community post up to get feedback, so please visit https://community.esri.com/t5/data-management-questions/generate-schema-report-tool-feedback/m-p/1350612#M44784
... View more
11-17-2023
11:15 AM
|
0
|
0
|
2658
|
|
IDEA
|
This Idea has been implemented in ArcGIS Pro 3.2. Please see the What's New documentation for more new features in Pro 3.2.
... View more
11-15-2023
09:39 AM
|
0
|
0
|
2749
|
|
IDEA
|
This Idea has been implemented in ArcGIS Pro 3.2. Please see the What's New documentation for more new features in Pro 3.2.
... View more
11-15-2023
09:37 AM
|
0
|
0
|
3097
|
|
IDEA
|
The multithreaded nature of Pro and that Data Interoperability runs in a separate process made it hard to make connections work. Memory workspace persistence between sessions is coming but I'm not sure of the time frame.
... View more
11-15-2023
06:09 AM
|
0
|
0
|
1409
|
|
IDEA
|
Hi Cordula The nature of Pro makes it hard for us to replicate the interoperability connection experience you are referring to. What you can do is create a toolbox with a model that does a Quick Import then copies the data into memory and into your map, like this:
... View more
11-14-2023
06:54 AM
|
0
|
0
|
1484
|
|
BLOG
|
This post shows how to use an ArcGIS Online hosted v 10.0 notebook as a data pump between any data sources on the internet, my specific example being the Snowflake cloud warehouse and a hosted feature service in ArcGIS Online. The systems of record you integrate are only limited by your imagination. When scheduled as a notebook task, this delivers continuous integration, in either or both directions. A bonus feature, you can also perform analysis on your data during data movement. Some things I'm sharing that I learned from this exercise are: How to add notebook dataset connectivity on the fly from Conda Forge or PyPI How to use a notebook's built in file system as a staging and analysis workspace How to leverage a data warehouse's processing power to do work remotely How to use SQL to unpack JSON and make relational data How to use feature service relationships in queries in Pro How to use a couple of new field types, BigInteger* and TimeOnly* How to continuously maintain a hosted feature service with the ArcGIS Python API * Requires notebook runtime version 9+ and ArcGIS Pro 3.2+ My example uses Esri business partner SafeGraph Places data, with their kind permission, for which I am grateful. If you follow the link you'll see that Places model point-of-interest data. Places have geometry (polygon or point) and attributes. Some of the attribution is best modeled as related data, and you can see from the map popup my data model has four lookup tables for places to expose the brands, place categories, web domains and open hours known for a place. San Francisco PlacesSan Francisco Places The main topic of this post is using an Online notebook to maintain a hosted feature layer with source data from an internet connection, namely a cloud warehouse, and without compromising data preparation or data model flexibility - expanding JSON columns and creating relationship classes is shown. This post does not have a dependency on ArcGIS Data Interoperability, but does include an ETL tool that is an option to create an initial file geodatabase used to publish the target hosted feature service, otherwise use the supplied XML Workspace document to do this. If you do have Data Interoperability the supplied ETL tool looks like this: BlogImportPOI.fmwBlogImportPOI.fmwImport PlacesI'm accessing SafeGraph's data from their Snowflake Marketplace entry, access to this is provisioned by SafeGraph. When you connect with a Marketplace entry it becomes an available database in your Snowsight console and through any connecting technology like an ODBC connection or REST API Snowflake offer. I could have generated my initial geodatabase by adding a Snowflake database connection in ArcGIS Pro, creating query layers, and using core geoprocessing tools. I just found it easier to make a Spatial ETL tool. Something we're all going to see more of, even in georelational warehouses like Snowflake, is semi-structured data as JSON documents in variant columns. That is the case for SafeGraph places, where the brand, category, domain and open hours values are inside JSON objects in a 1:M cardinality with the owning place. Retaining JSON in fields would make it very hard to query in ArcGIS. Here's an example value for brand data, an array of dictionaries: [
{
"safegraph_brand_id": "SG_BRAND_1e8cb2c9bf1caabd",
"safegraph_brand_name": "Bentley Motors"
},
{
"safegraph_brand_id": "SG_BRAND_89ede26d6a5e8208ad28607464dfa1f3",
"safegraph_brand_name": "Land Rover"
},
{
"safegraph_brand_id": "SG_BRAND_2fa3cf424791add1",
"safegraph_brand_name": "Aston Martin"
},
{
"safegraph_brand_id": "SG_BRAND_b94705c817b09f287e31606604e526ba",
"safegraph_brand_name": "Jaguar"
}
] Here is another for open hours, a dictionary keyed on day name with arrays of open/close times: {
"Fri": [
[
"10:00",
"18:00"
]
],
"Mon": [
[
"10:00",
"18:00"
]
],
"Sat": [
[
"10:00",
"17:00"
]
],
"Sun": [
[
"12:00",
"16:00"
]
],
"Thu": [
[
"10:00",
"18:00"
]
],
"Tue": [
[
"10:00",
"18:00"
]
],
"Wed": [
[
"10:00",
"18:00"
]
]
} It makes sense to unpack this JSON in Snowflake before it gets to your client. The job in each case is to make new rows for each extracted value while retaining the key field in each row. Snowflake provides the FLATTEN table function for the purpose. This is a proprietary SQL extension but analogs exist in other cloud warehouses. SafeGraph places have a primary key named PLACEKEY, so we retain that in all cases. I'll let you inspect the SQL I used to unpack each JSON document type when you look at the code in the notebook in the blog download, but the tabular results for brands and open hours look like this: Brands: BrandsBrands Open Hours: Open HoursOpen Hours I took this detour through some of the processing to make the point that with a back end as powerful as Snowflake, you should get it to do as much of the processing as possible before your integration. If your integration is against data summaries coming from big data then this is particularly relevant. With the data model I'm using here, reading from Snowflake using a cursor and writing to the file geodatabase in my notebook with another cursor, I get about 10K/sec point features, 7.5K/sec polygon features and 50K/sec table rows throughput - plenty fast enough for the use case. Now let's get to the core of the matter, how to implement a continuous integration. With continuous integrations with feature services, you do not want to break any item properties, like item ID or item description metadata. The way to do this is with the overwrite method. A dependency of overwrite is that you must have published the service from a file item, not from map content in Pro, and you must supply a new file item of identical schema to effect the overwrite. I'm using a file geodatabase. In fact I used two editions of geodatabase, first one with all place data for California, so I could get my data model correct (field type, width etc.), then replacing this with one with just Redlands CA place data, so I could store only a small portal item used for overwrite schema checking. If you download the blog attachment and have Data Interoperability extension you will see my ETL tool has table handling set to Truncate Existing when writing to my geodatabase. When I was happy with my geodatabase schema and had made some layer files from the full version of the data to capture some symbology and popup behavior, I zipped the small version of the geodatabase and published my initial feature service. To start constructing my notebook I had to learn a fundamental concept. Notebooks support a local filesystem, and this is where I built my overwrite payload (a zipped file geodatabase); I used the default /arcgis/home folder. You can think of this filesystem as a staging area for your data, where you can optionally do some analysis. In my case I'm not doing any analysis but I am assembling a data model and populating it. You might not need to do any geoprocessing, say if you're reading from one cloud warehouse and writing directly to another, but if you know ArcPy the possibilities are endless. Now the secret sauce. If you read up on Online notebooks you'll see the available Python libraries, and that Snowflake is not amongst them. So install it! Here is the cell in my notebook that adds the connectivity I want: import arcgis
import arcpy
from datetime import datetime
import os
import zipfile
def getNow():
return str(datetime.utcnow().replace(microsecond=0))
print('Processing starting at {}'.format(getNow()))
gis = arcgis.gis.GIS("https://www.arcgis.com", gisUser, gisPass)
arcpy.SignInToPortal("https://www.arcgis.com", gisUser, gisPass)
os.environ['CRYPTOGRAPHY_OPENSSL_NO_LEGACY'] = '1' # Snowflake dependency
!pip install snowflake-connector-python # pip is faster than conda, or we would use !conda install -y -c conda-forge snowflake-connector-python
import snowflake.connector
print('Snowflake installed at {}'.format(getNow()))
conn = snowflake.connector.connect(
account=snowAccount,
user=snowUser,
password=snowPass,
warehouse=snowWarehouse,
database=snowDatabase,
schema=snowSchema)
print('Imports complete at {}'.format(getNow())) The lines that provide the integration magic are: os.environ... !pip install... import snowflake.connector I'm making the bold claim that you can integrate with any internet data source because the Python ecosystem is so well endowed and maintained. While Online notebooks come with some basic libraries for Amazon, Azure Storage and Google, there are many more you can reach out for in Conda Forge and PyPI using this technique. Do your research. It may take some trial and error. Online notebooks are Linux containers so check a package supports the platform. For example, the pyodbc module isn't available on Linux, so if you're integrating with Azure SQL you'll need the pymssql module. Before committing to an integration try a stub notebook that tests the import. My notebook uses a Snowflake cursor but the package also supports working with pandas dataframes if that's how you want to work. If your integration is with ArcGIS Enterprise you would authenticate to that GIS. Remember you can write to the web as well as read from it. A cursor might not be the best method either, for example in this blog post I upload a Parquet file to Snowflake. The blog download contains my notebook source, the Spatial ETL workspace to read Snowflake and write file geodatabase, plus a schema-only geodatabase XML document you can import. After connecting to Snowflake the data movement is simple, I'll let you inspect the notebook source. It took some trial and error to get the SQL right for Snowflake, but when you're done you have a powerful tool. You can share your notebook and keep the credentials private using this approach. Snowflake is read, a file geodatabase is staged, then used to overwrite an existing service. Note that if your data lifecycle (or size) indicates, in addition to the overwrite method used in this notebook, there is also an append and upsert method. Overwrite is an expensive operation for big data, so if your data can be maintained by append and upsert it is worth considering. Say your data size is in the tens of millions of records, you might have one notebook (or cell) that overwrites a service with a small file and another that appends to it with a much larger one. To go into production with this sample, the technical steps with core ArcGIS are: Obtain access to SafeGraph's Snowflake Marketplace Create a new file geodatabase in ArcGIS Pro 3.3+ Import the geodatabase XML document to a file geodatabase Zip the file geodatabase and add it as an item to Online Share a feature service from the file geodatabase item Create a hosted notebook from the ipynb file supplied Edit the injected parameters cell and target item ID Create a scheduled task for the notebook Having put it all together you can use the service across ArcGIS. You'll see in the supplied notebook my service has all places for San Francisco, California and I want to find where I can go for breakfast on Saturdays between 8AM and 10AM, I would query it like this: TOP_CATEGORY = 'Restaurants and Other Eating Places' AND PLACEKEY IN (SELECT DISTINCT PLACEKEY FROM OPENHOURS WHERE OPENHOURS.DAY = 'Sat' AND OPENHOURS.OPENS <= TIME '08:00:00' And OPENHOURS.CLOSES >= TIME '10:00:00') There you have it, a continuous integration powered by an ArcGIS Online hosted notebook! Please comment in this blog space with any questions or suggestions. I hope you all make lots of data pumps, let us know!
... View more
11-03-2023
12:51 PM
|
5
|
5
|
2799
|
|
DOC
|
Thanks for visiting the ArcGIS Data Interoperability documents board, for the news about our 3.2 release please check out the blog here - Part 2!
... View more
10-31-2023
11:33 AM
|
1
|
0
|
556
|
| Title | Kudos | Posted |
|---|---|---|
| 1 | 11-21-2025 05:34 AM | |
| 1 | 10-06-2025 05:36 AM | |
| 1 | 11-03-2025 05:14 AM | |
| 3 | 11-04-2025 08:41 AM | |
| 1 | 10-23-2025 01:24 PM |
| Online Status |
Offline
|
| Date Last Visited |
Friday
|