BLOG
|
@StuartSmith1 The Python script that writes item detail metadata should work with any version of Pro having the arcpy.metadata module, and does not need Data Interoperability. The workflows using the ArcGISOnlineConnector are using Data Interoperability at the Pro 3.3 release but may work at an earlier release, this had not been tested.
... View more
09-24-2024
06:10 AM
|
0
|
0
|
410
|
BLOG
|
A comment in LinkedIn prompts me to make clear what keeps the article's pagination approach simple. You're usually expanding arrays returned from each API call but do not do this inside your looping custom transformer as then your loop continuation test might require sampling or counting features, both of which require blocking transformers, which will then require an externally linked custom transformer - much harder to manage. I have made that mistake, and it put me off looping until I figured it out! If you inspect the loop continuation test in the UpdateFS workspace you'll see I test if the response is JSON and does not match the regular expression \[\s*\]. The server returns "[ ]" (with a space) in the array, hence the regex allowing any number of spaces, including none.
... View more
09-16-2024
10:20 AM
|
0
|
0
|
725
|
BLOG
|
The Problem
I'll try out a new acronym here - B2G - Business To GIS - meaning the delivery of externally managed data into ArcGIS. It is very common that external ("business") systems offer a REST API you can call to get common format data like JSON, but also very common that the server delivers a dataset in multiple chunks (i.e. "pages"), and that the JSON isn't in a well-defined dialect like GeoJSON - you have to parse it into columns, potentially from struct or array objects.
This post is about conquering the two challenges of pagination and parsing, without coding, using ArcGIS Data Interoperability.
My subject matter data is Federal Communications Commission customer complaints data about unwanted calls, filtered for 2024. If you want to see how it looks as incoming JSON click here for 1000 random records, but here's the end result as a feature service:
FCC Customer Complaints - unwanted calls
Some research reveals there are several common pagination approaches:
Offset and limit parameters defining the starting row position and row count, reading ordered data.
This approach may include another parameter for which property sets the sort order
Page number and (optionally) page size parameters.
Query based pagination, where a start row is defined by a logical query, implying any next query.
Time based query, a special case of query-based pagination.
Cursor based pagination, where the API provides page names, including any next page, in the result.
Combinations of the above.
The first two approaches are the simplest (and only ones I have used), the server does the page building calculations and the client need only keep track of a simple counter, sending requests until the data is exhausted.
Here is an example of an offset and limit based API and here is an example of a page based API.
For an example of #6, combining #3 & #4, you can read up on querying hosted feature layers. ArcGIS takes care of building layer queries for you.
How about parsing made simple too? We have to turn records like this into typed columns and geometry!
{
"issue_type" : "Phone",
"caller_id_number" : "830-210-2001",
"state" : "IL",
"method" : "Wired",
"advertiser_business_phone_number" : "None",
"issue_time" : "8:41 am",
"issue" : "Unwanted Calls",
"zip" : "60629",
"type_of_call_or_messge" : "Live Voice",
"issue_date" : "2024-01-04T00:00:00.000",
"id" : "3739134",
"location_1" : {
"latitude" : "41.781382",
"human_address" : "{\"address\": \"\", \"city\": \"IL\", \"state\": \"\", \"zip\": \"60629-5219\"}",
"needs_recoding" : false,
"longitude" : "-87.732853"
}
}
No problem! Let's get started.
Pagination
We're calling a REST API here, and it offers a paginated response. In ArcGIS Data Interoperability that means calling the HTTPCaller transformer in a loop, in this case with offset, limit and order parameters, incrementing the offset value each call, until all data is received and the response is an empty array. You might not have known you can loop in a visual programming environment, but you can, and it's easy. The mint green custom transformer "UnwantedCallsLooper" near top left in my workspace does the job. It is a looping custom transformer.
Parent ETL workspace
The custom transformer is embedded (included) in the Main canvas, it lives in its own eponymously named tab. You create custom transformers by selecting one or more ordinary transformers in the Main canvas then a context menu accessible by right-click offers a custom transformer creation choice. In my case I selected just a single HTTPCaller transformer.
After creating the custom transformer there were more editing steps to set up looping:
Add a loop return connection using a context menu choice in the canvas
Add a test for completion ahead of the loop return
Increment the offset parameter
Below is how my custom transformer looks. At run time each feature arriving from the Main canvas has an offset attribute used by the HTTPCaller transformer. The limit and order parameters do not change so are "hard coded" in the HTTPCaller. The lower stream is the loop-until condition - if the response is not an empty array the data isn't exhausted, so increment the offset and loop!
Looping custom transformer
Parsing
Each API response is an array of JSON structs and is contained in an attribute named "_response_body". The array is exploded into separate features with a JSONFragmenter. The JSON Query parameter just means we're fragmenting the top level array and the other settings just ensure delivery of fragments that look like the code block above.
JSONFragmenter
Now we parse each fragment feature with a JSONExtractor.
JSONExtractor
The Extract Queries grid defines the output attribute names and where the data comes from in the fragment. The JSON queries follow a simple syntax you can key in, but there is a simple trick to getting a picker to let you navigate the JSON structure and automate query generation. Ahead of the JSONExtractor, temporarily sample a single feature and write it to a file, then temporarily set your JSONExtractor input source to this file and you'll get a handy picker for your queries! When you have populated your extract queries remove the Sampler and reset your JSONFragmenter input source to be the document in _response_body.
So that is pagination and parsing de-mystified! In the blog download there are two workspace source fmw files. CreateFC creates a file geodatabase output and is where I figured out the data processing, like fixing data errors that defeated creation of a datetime field. I wanted an initial feature class anyway to symbolize and share as my target hosted feature service. UpdateFS borrows the data processing steps in CreateFC but contains the looping custom transformer and some logic to detect and apply data changes whenever the ETL tool is run, which is what you'll likely need in production.
Please comment in this post with any questions and observations. Page away!
... View more
09-16-2024
08:02 AM
|
1
|
1
|
1435
|
BLOG
|
Here is a run with some sample feature counts. Upsert with feature counts
... View more
09-09-2024
01:47 PM
|
0
|
0
|
883
|
BLOG
|
Thanks for the question, I edited the article to hopefully make it clearer where upsert capability is supported in ArcGIS. My worked example uses ArcGIS Data Interoperability, which can work against hundreds of data sources, but see also core geoprocessing supports upsert in the Append geoprocessing tool and ArcGIS Data Pipelines also has an upsert capability.
... View more
09-09-2024
06:01 AM
|
0
|
0
|
953
|
BLOG
|
Today's featured transaction is upsert, ably assisted by delete. If you are maintaining a geodatabase or hosted feature layer (ArcGIS Online or ArcGIS Enterprise), unless you're doing something drastic like a truncate, you can package all your ETL edits up into these two methods using ArcGIS Data Interoperability (but also see core options below). Inserts and updates travel together as upserts, while deletes speak for themselves. For example use case data, I'm using the City of Los Angeles Open Data Address Points. If you're keen to take a look at the raw material, this link will trigger a CSV file download (161MB+, 1M+ rows). Los Angeles address points The map shows the result of ETL into a hosted feature layer, which is my target information product. It could also be any type of geodatabase. The raw data is typical of many ETL sources, with these properties: The data is not immediately accessible to ArcGIS The data schema is a little cryptic The data changes frequently, but only a small fraction of a big dataset Edits may be inserts, updates or deletes No metadata fields exist to track edits The data has a persistent primary key field #1 & #2 above are basic ETL challenges which are easily solved. #3 suggests that upserts and deletes are candidate methods, while #4 is what makes upserts possible. Using a match key is the secret sauce for this post, but like any good cooking show the recipe comes after a look at the result! Upsert & Delete ETL Tool This single ETL tool supports two stages in the lifecycle of my target information product: Creation of the feature service Applying updates to the feature service on demand To create the feature service, the greyed-out writer labeled Creation would be enabled and the rest of the workspace would not exist. After service creation I disabled this writer and added the rest of the workspace. The workspace still reads the source CSV file at a public URL, but also reads the target feature service (which of course must already have been created), calculates the upsert and delete transactions, and applies them. It's a simple and powerful ETL tool but there is critical enabling work needed outside the ETL tool. The REST API Append method that supports upsert operations requires the match key field (House_Number_ID in my case): Have a unique index Not allow null values Meeting these conditions requires two simple processing steps outside the ETL tool before upsert processing will work. In my ETL tool, there is no way to set these properties, and there are two relevant factors: The Alter Field geoprocessing tool does not support changing the allow nulls property for a feature layer field, but the Add Attribute Index geoprocessing tool does allow creating a unique index on a feature layer field. To work around the allow nulls problem I exported the initial feature layer to my project default geodatabase using the Export Features tool and in the field map control unset the allow nulls property for the House_Number_ID field. Export Features to Geodatabase With the output feature class in Pro, I then created a map layer, taking the opportunity to apply symbology other than the pink lemonade default, then overwrote my target feature layer. With the overwritten target layer, I then added a unique index for my match key field: Add Unique Index Now the target layer is in shape for configuring upsert (and delete) processing! The ChangeDetector transformer is what generates the upsert and delete change sets. Here are the settings: Change Detector Parameters Using match attributes: Detection Fields Change detection comes at the expense of reading the target features in addition to the source features but does deliver an optimally small edit payload. For my subject matter data the whole job takes 5 minutes. To reiterate, while I'm showing a hosted feature layer as my target information product, upsert write capability is also available for geodatabase features via ArcGIS Data Interoperability. For completeness, I'll give a shout out to two other places upsert capability is offered in ArcGIS, namely the Append geoprocessing tool and Data Pipelines Add and Update output option. However, if you need to delete features as well as upsert them, then in core geoprocessing you'll need to use Delete Features or Delete Rows or in Data Pipelines replace the output feature layer. This pipeline does that job for the feature layer my ETL tool operates on. Data Pipeline Replacing Los Angeles Addresses I'll take the opportunity here to call out that using Data Pipelines to maintain hosted feature layers in ArcGIS Online that were created by separate ETL processes is perfectly valid. In my pipeline you'll see the Map Fields tool that greatly assists connecting the schema coming from a CSV file with how I defined the schema using Data Interoperability. So there it is, upsert is there for you, ready to go! The blog download has my Spatial ETL tool plus associated toolbox with the models.
... View more
09-06-2024
12:37 PM
|
2
|
4
|
1607
|
POST
|
Hi Ryan, there are a few ways. You can add a Spatial ETL tool to a ModelBuilder model, in which case you can do things like use the Select Data model tool to work with output data, add it to a map etc. You can also use a FeatureSet input parameter in a model to interact with a map, for example get the JSON representation of the featureset and make it an ETL tool input. Inside ETL tools themselves a shutdown script can call ArcPy, including arcpy.mp etc. Spatial ETL tools can also be called with Python - you can get the code by manually running one and using the history entry Copy Python Command - and use it in a Notebook. Don't forget to add code to check out the extension. Don't forget you can schedule ETL tools just like any Pro geoprocessing tool.
... View more
09-05-2024
06:06 AM
|
1
|
0
|
491
|
IDEA
|
Bud, Data Interoperability extension can write multiple worksheets into a single Excel file. The Excel file may already exist or be created by the process.
... View more
08-29-2024
12:55 PM
|
0
|
0
|
768
|
BLOG
|
Connecting to (geo)relational data is fundamental to working in ArcGIS, and many popular database platforms are supported as workspaces. However, there are also many unsupported database platforms out there. How can you use them? Well, this connectivity gap is so common across information technology a standardized way to bridge the gap emerged long ago, namely ODBC. ODBC stands for Open Database Connectivity, a widely used technology that provides a standardized way for programs to talk to databases. While slower than native drivers, it is plenty performant for everyday usage, and very easy for GIS analysts to install, given sufficient permissions. In ArcGIS Pro, OLE DB is used to implement ODBC connectivity, so documentation is organized under Work with OLE DB Connections. While the documentation for OLE DB connections in Pro correctly states that connections are read-only, the database being connected to may be read-write using other approaches in Pro, such as the ArcGIS Data Interoperability extension or a suitable Python module. In the ModelBuilder session below, I am ingesting normalized data from a Microsoft Access connection into an ArcGIS Pro project geodatabase, including spatially enabling a column storing well-known-text (WKT) geometry. Four tables are joined to create the desired output information product for GIS mapping and analysis. The result can be refreshed on demand should the source data change. ETL of OLE DB data into geodatabase Of course, if your work is supported by simply reading an OLE DB source, you don't need any ETL steps at all! I'm just demonstrating using a core tool - ModelBuilder - for some typical tasks. Why would you choose an unsupported database platform for your GIS work? The thing is you might not choose to, it may just be a fact of life for your organization. For example, I have seen organizations in the energy sector collaborate using Microsoft Access databases, with various apps using parts of the schema, and the whole project, from inception to archival, depending on the database. ArcGIS Pro can collaborate in this environment. OLE DB connections in ArcGIS Pro are configurable in a three-stage process: Install and configure an ODBC driver (if not already present) Create a Windows Data Source Name (DSN) Create an OLE DB connection in Pro Note in the DSN link above you have flexibility with who can access an OLE DB connection in Pro, one or any user logged onto the machine or a file-based approach suitable for wider sharing of a connection. Here are my tables read from Microsoft Access, they are joined Person->BusinessEntityAddress->Address->StateProvince. OLE DB tables from Access After joining and creation of geometry the data is mapped! Note the data is fake, it's the venerable AdventureWorks data from Microsoft, created for tutorial purposes (but not included in the blog download), there aren't really cyclists floating in the middle of Port Phillip Bay in Victoria, Australia! Person table, spatially enabled In a couple of related posts (local files, cloud files) I showed that remote files can be used as tabular sources and databases respectively, and the same is true of OLE DB connections - if an ODBC driver exists for a database, and the database is cloud hosted or web-aware, then you can use web data via OLE DB. There are many web-aware databases and data stores offering ODBC drivers. In summary, OLE DB connections offer a simple way to extend the reach of ArcGIS into databases or data stores that are not supported ArcGIS workspaces.
... View more
08-27-2024
11:53 AM
|
2
|
1
|
627
|
BLOG
|
For a while now, file-based data has been migrating to the web and we're all used to manually downloading what we need and taking it from there in our desktop software like ArcGIS Pro. This blog is about automating that experience, replacing the download and processing steps with easy to understand, well defined and shareable tooling that delivers your information product on demand. At two ends of data scale I have worldwide political divisions and building permit activity in Vancouver, Canada. The former dataset is big and sourced from an AWS S3 bucket, the latter is small and sourced from an open data portal. They are both ingested using notebooks in ArcGIS Pro with the same basic tooling. The small dataset first: Vancouver building permits The big dataset: Worldwide political divisions And in the middle ground, the political divisions for just Germany: A view of political divisions for Germany All these datasets were ingested similarly but with noteworthy differences we'll discuss below. If you're not familiar with GeoParquet you will be, and I'm making it my well-known format of choice for this post. Other common cloud file types like CSV, Excel and JSON would work too. By "cloud" I mean not just where you go to get them but formats that lend themselves to remote query without your having to download and inspect a local copy. While I say that the same basic tooling can reach the big data on S3 and the small data on a website, the capabilities of the respective servers matters. S3 knows how to deliver files according to queries against hive storage, whereas a website just knows how to deliver downloads of whole files. Let's see how to tackle the smaller dataset first - building permits, by automated download from a website. The data resides in an open data portal, if you surf the site you'll see download format options. The data is automated in the notebook VancouverBuildingPermits. If you inspect the code you'll note I'm using DuckDB to help out - it is a spatially enabled SQL database and web client. You'll see the data download and conversion to feature class is very simple - the tooling infers the schema from the Parquet source plus SQL statements. import arcpy
from arcgis.features import GeoAccessor, GeoSeriesAccessor
import duckdb
import math
import os
arcpy.env.overwriteOutput = True
url = r"https://opendata.vancouver.ca/api/explore/v2.1/catalog/datasets/issued-building-permits/exports/parquet?lang=en&refine=issuedate%3A%222024%22&timezone=America%2FLos_Angeles"
conn = duckdb.connect()
conn.sql("set force_download=true;")
conn.sql("install httpfs;load httpfs;")
conn.sql("install spatial;load spatial;")
conn.sql(f"create temp view permitsView as select * from read_parquet('{url}');")
bldgPermits = conn.sql("""select permitnumber as Permit_Number,
try_cast (permitnumbercreateddate as date) as Permit_Number_Created_Date,
try_cast (issuedate as date) as Issue_Date,
permitelapseddays::int as Permit_Elapsed_Days,
projectvalue::float as Project_Value,
typeofwork as Type_Of_Work,
address as Address,
projectdescription as Project_Description,
permitcategory as Permit_Category,
applicant as Applicant,
applicantaddress as Applicant_Address,
propertyuse as Property_Use,
specificusecategory as Specific_Use_Category,
buildingcontractor as Building_Contractor,
buildingcontractoraddress as Building_Contractor_Address,
issueyear as Issue_Year,
geolocalarea as Geo_Loc_Area,
yearmonth as Year_Month,
ST_AsText(ST_GeomFromWKB(geom)) as SHAPE
from permitsView;""")
df = bldgPermits.df()
df.spatial.set_geometry("SHAPE",sr=4326,inplace=True)
aprx = arcpy.mp.ArcGISProject("CURRENT")
gdb = aprx.defaultGeodatabase
out_fc = "Issued_Building_Permits"
location = os.path.join(gdb,out_fc)
df.spatial.to_featureclass(location,sanitize_columns=False) If you inspect the code in the download you'll see some extra steps I used to round up text field widths to factors of 10, an unnecessary flourish in this case but just to show you how to do this if you expect the data to be edited in future and you don't want truncation of values. So that is a simple case of ingesting a small cloud-resident file into ArcGIS with minimal control over the schema - note again standard SQL statements are available to define your view of the cloud-sourced file. Compare this approach to defining the schema using geoprocessing in this related post. Let's step up to global scale data using the Division_Area notebook, using data from Overture Maps Foundation. # Get the Division_Area layer from Overture Maps Foundation
import arcpy
from datetime import datetime
import duckdb
import os
aprx = arcpy.mp.ArcGISProject("CURRENT")
homeFolder = aprx.homeFolder
gdb = aprx.defaultGeodatabase
release = "2024-07-22.0"
def getNow():
return str(datetime.utcnow().replace(microsecond=0))
arcpy.env.overwriteOutput = True
sR = arcpy.SpatialReference(4326)
conn = duckdb.connect()
conn.sql("install spatial;load spatial;")
conn.sql("install httpfs;load httpfs;")
conn.sql("set s3_region='us-west-2';")
conn.sql("set enable_object_cache=true;")
# Make the DuckDB relation
# Insert any desired where clause, for example "country = 'GB'"
# The order-by expression helps to display small divisions on top of large ones, but is expensive
print(f"Processing starting at {getNow()}")
whereExp = "1 = 1"
sql = f"""select id,
bbox.xmin as xmin,
bbox.ymin as ymin,
bbox.xmax as xmax,
bbox.ymax as ymax,
version,
subtype,
names.primary as primary_name,
class,
region,
country,
norms.driving_side as driving_side,
geometry
from read_parquet('s3://overturemaps-us-west-2/release/{release}/theme=divisions/type=division_area/*',filename=true, hive_partitioning=1)
where {whereExp}
order by ST_Area(ST_GeomFromWKB(geometry)) desc;"""
duckDivisions = conn.sql(sql)
# Make the output feature class with precise schema control
print('Creating output feature class at {}'.format(getNow()))
arcDivisions = arcpy.management.CreateFeatureclass(out_path=gdb,
out_name="Division_Areas",
geometry_type="POLYGON",
spatial_reference=sR).getOutput(0)
arcpy.management.AddField(in_table=arcDivisions,field_name="id",field_type="TEXT",field_length=32)
for f in ["xmin","ymin","xmax","ymax"]:
arcpy.management.AddField(in_table=arcDivisions,field_name=f,field_type="FLOAT")
arcpy.management.AddField(in_table=arcDivisions,field_name="version",field_type="SHORT")
arcpy.management.AddField(in_table=arcDivisions,field_name="subtype",field_type="TEXT",field_length=20)
arcpy.management.AddField(in_table=arcDivisions,field_name="primary_name",field_type="TEXT",field_length=100)
arcpy.management.AddField(in_table=arcDivisions,field_name="class",field_type="TEXT",field_length=6)
arcpy.management.AddField(in_table=arcDivisions,field_name="region",field_type="TEXT",field_length=10)
arcpy.management.AddField(in_table=arcDivisions,field_name="country",field_type="TEXT",field_length=2)
arcpy.management.AddField(in_table=arcDivisions,field_name="driving_side",field_type="TEXT",field_length=10)
# Write the output
print('Writing output at {}'.format(getNow()))
with arcpy.da.InsertCursor(arcDivisions,["id","xmin","ymin","xmax","ymax","version",
"subtype","primary_name","class","region",
"country","driving_side","shape@"]) as iCursor:
row = duckDivisions.fetchone()
i = 1
while row:
if i % 100000 == 0:
print(f"Inserted {i} Division_Areas rows...")
row = list(row)
row[-1] = arcpy.FromWKB(row[-1])
iCursor.insertRow(row)
i+=1
row = duckDivisions.fetchone()
del iCursor
# Repair geometries, drop null geometries
print(f"Fixing any bad geometries at {getNow()}")
arcpy.management.RepairGeometry(
in_features=arcDivisions,
delete_null="DELETE_NULL",
validation_method="OGC")
print(f"Finished at {getNow()}") The principal differences between this global scale data ingest and that for the building permits is that the source is obtained by a wildcard query on S3 hive storage, and the schema is defined manually. The tooling is the same. So that is a look at user-focused ingest of data at large and small scale, but what if you're a data owner and you want to offer infinitely flexible cloud well-known file-based data delivery to a wide catchment of users without their having to go to the trouble of building tooling? This is where the mid-scale example of political divisions for Germany comes in. The client library we're using - DuckDB - is also a storage format, but one with spatial database capabilities, including views, so for the case of cloud-sourced data it is straightforward to offer a data product that contains only a view which is read on-demand by any end user. This means, for the effort of figuring out a view definition a data product of any scale can be delivered in a tiny database file. The blog download contains such a data product - mydivisionview.duckdb - which was created by the notebook MakeViewDatabase and can be ingested by anyone using the notebook ReadViewDatabase. In this case then a file of only 268KB can deliver data of national (or even global) scale. This is the power of cloud-native well-known files read on-demand from the cloud. If the dataset source changes the same DuckDB database will always access the latest data. Below is the code for ReadViewDatabase. If you copy mydivisionview.duck db into your project hole folder it will deliver the view contents into your project default geodatabase in seconds (15 on my machine): import arcpy
from datetime import datetime
import duckdb
import os
def getNow():
return str(datetime.utcnow().replace(microsecond=0))
print(f"Extracting data at {getNow()}")
arcpy.env.overwriteOutput = True
aprx = arcpy.mp.ArcGISProject("CURRENT")
homeFolder = aprx.homeFolder
gdb = aprx.defaultGeodatabase
sR = arcpy.SpatialReference(4326)
duckDB = os.path.join(homeFolder,"MyDivisionView.duckdb")
conn = duckdb.connect(duckDB)
conn.sql("install spatial;load spatial;")
conn.sql("install httpfs;load httpfs;")
conn.sql("set enable_object_cache=true;")
duckDivisions = conn.sql(f"""select id,
xmin::float as xmin,
ymin::float as ymin,
xmax::float as xmax,
ymax::float as ymax,
version::short as version,
subtype,
primary_name,
class,
region,
country,
driving_side,
geometry
from MyDivisionView;""")
arcDivisions = arcpy.management.CreateFeatureclass(gdb,
out_name="My_Division_Areas",
geometry_type="POLYGON",
spatial_reference=sR).getOutput(0)
arcpy.management.AddField(in_table=arcDivisions,field_name="id",field_type="TEXT",field_length=32)
for f in ["xmin","ymin","xmax","ymax"]:
arcpy.management.AddField(in_table=arcDivisions,field_name=f,field_type="FLOAT")
arcpy.management.AddField(in_table=arcDivisions,field_name="version",field_type="SHORT")
arcpy.management.AddField(in_table=arcDivisions,field_name="subtype",field_type="TEXT",field_length=20)
arcpy.management.AddField(in_table=arcDivisions,field_name="primary_name",field_type="TEXT",field_length=100)
arcpy.management.AddField(in_table=arcDivisions,field_name="class",field_type="TEXT",field_length=6)
arcpy.management.AddField(in_table=arcDivisions,field_name="region",field_type="TEXT",field_length=10)
arcpy.management.AddField(in_table=arcDivisions,field_name="country",field_type="TEXT",field_length=2)
arcpy.management.AddField(in_table=arcDivisions,field_name="driving_side",field_type="TEXT",field_length=10)
with arcpy.da.InsertCursor(arcDivisions,["id","xmin","ymin","xmax","ymax","version","subtype",
"primary_name","class","region","country",
"driving_side","shape@"]) as iCursor:
row = duckDivisions.fetchone()
i = 1
while row:
#if i % 1000 == 0:
# print('Inserted {} My_Division_Area rows'.format(i))
row = list(row)
row[-1] = arcpy.FromWKB(row[-1])
iCursor.insertRow(row)
i+=1
row = duckDivisions.fetchone()
del iCursor
conn.close()
print(f"Data extracted at {getNow()}") I'll finish with a few summary points: Small cloud well-known files in the cloud are easily ingested. Unlimited scale cloud well-known files in object storage are easily ingested. Any scale cloud well-known file data in any view definition can be easily shared in tiny containers.
... View more
08-20-2024
08:33 AM
|
1
|
1
|
1184
|
BLOG
|
In the Extract Transform and Load (ETL) world you can make the mistake of focusing on data movement at the expense of data usability. As data moves and evolves one crucial aspect that often gets overlooked is metadata. Metadata, the information about your data, plays a vital role in ensuring the context, accuracy, and integrity. It also embraces the FAIR data principles (Findable, Accessible, Interoperable, and Reusable). @BruceHarold and @JordanDuft team up in this post to show how to create and maintain metadata as data moves through ArcGIS. As the use cases involve ETL, ArcGIS Data Interoperability will be used. We'll show three use cases for metadata management: Populating feature class and table metadata in ArcGIS Pro Populating metadata for a hosted feature layer in ArcGIS Online Updating metadata for multiple portal items in ArcGIS Online in bulk Not shown, but discussed near the end of this post: Deep copying data and metadata between environments using metadata.xml Let's start with an example of Case #1 - feature class & table metadata written at ETL time. To set us up Bruce imported data from Vancouver's open data web site. The data was brought into an ArcGIS Pro project geodatabase by a spatial ETL tool (ImportBuildingPermits, in the blog download) that reads from Vancouver's open data web site. The dots on the map are today's permit locations in the feature class BuildingPermits. Not only was the building permit data written to a feature class it also had metadata generated by an ETL process. In the screenshot below, we are viewing the metadata for the BuildingPermits feature class in ArcGIS Pro. In the metadata we can see the item information (title, tags, summary, description, credits, use limitations). Building permits metadata If you're familiar with writers in spatial ETL tools you'll be aware they don't have much in the way of metadata support. However, as Data Interoperability is an ArcGIS Pro extension it knows about everything ArcGIS Pro does - including ArcPy's metadata module, and that is where we go for this support! To back up a little bit, the Workbench app delivered by Data Interoperability has a feature called a shutdown script (there is also a startup script) which lets you run Python code with any installed interpreter on completion of any tool run. For the subject matter data, we leverage this feature to do two things the built-in writers don't know about: Creating relationship classes between output feature types Creating metadata for each output (yay!) The only trick to using this feature is knowing that the shutdown script has a built-in dictionary object (fme.macroValues) that has keys for all tool parameters, this includes input and output data paths. The rest, as they say, is easy! We built the script starting from snippets from the Copy Python Command option in ArcGIS Pro's History pane. Here is the shutdown script: # Create relationship classes between BuildingPermits and child tables PropertyUse and SpecificUseCategory
import arcpy
from datetime import datetime
import fme
import os
import pytz
arcpy.env.overwriteOutput = True
gdb = fme.macroValues['OutputGDB']
arcpy.env.workspace = gdb
origin = "BuildingPermits"
for destination in ["PropertyUse","SpecificUseCategory"]:
arcpy.management.CreateRelationshipClass(
origin_table=origin,
destination_table=destination,
out_relationship_class=origin + '_' + destination,
relationship_type="SIMPLE",
forward_label=destination,
backward_label=origin,
message_direction="NONE",
cardinality="ONE_TO_MANY",
attributed="NONE",
origin_primary_key="PermitNumber",
origin_foreign_key="PermitNumber")
# Create layer level metadata
current_time_utc = datetime.now(pytz.utc)
pst = pytz.timezone('Canada/Pacific')
current_time_pst = current_time_utc.astimezone(pst)
current_time_pst_formatted = current_time_pst.strftime('%Y-%m-%d %H:%M:%S')
descriptions = {"BuildingPermits":"Construction projects and any change of land use or occupancy on private property requiring a building permit",
"PropertyUse":"General use of property; multiple uses will be accessible in a 1:M lookup",
"SpecificUseCategory":"Category of property use; multiple categories will be accessible in a 1:M lookup"}
for obj in ["BuildingPermits","PropertyUse","SpecificUseCategory"]:
new_md = arcpy.metadata.Metadata()
name = obj.replace("gP","g P").replace("yU","y U").replace("cU","c U").replace("eC","e C")
new_md.title = f"{name} of Vancouver, Canada 2024."
new_md.tags = f"Demo,Esri,City of Vancouver,Canada,applications,{obj},{name}"
new_md.summary = f"Layer includes information of all {name} issued to date by the City of Vancouver in 2024"
new_md.description = descriptions[obj]
new_md.credits = f"City of Vancouver, {current_time_pst_formatted} Pacific."
new_md.accessConstraints = "https://opendata.vancouver.ca/pages/licence/"
tgt_item_md = arcpy.metadata.Metadata(obj)
tgt_item_md.copy(new_md)
tgt_item_md.save() Taking the time to embed metadata automation at ETL time at minimum ensures you don't forget it after the fact and is a best practice. Why? To: Ensure that essential information is documented for your data. Help consumers of your data understand the context of your data and how it can be used in future mapping and analysis. Enhance search and discovery of your data- making it easier for people to find. You'll have noticed in the map and metadata screen grab above that Bruce used the unique value renderer to display the data in categories (type of permitted work) and configured pop-ups. He then published the layer and related tables to ArcGIS Online - the target information product. At publication time the metadata for each layer flows to the sublayers of the hosted feature layer, like this: ..and here it is in the BuildingPermits layer: Sublayer metadata flows through Because metadata is flowing from the data to the published sublayers of the hosted feature layer this saves Bruce the trouble of recreating it. So that's how to get metadata into your information product at ETL time and why it's important to do so. However, there is another level of metadata we need - for the hosted feature layer itself, so we'll move on to Case #2 - populating metadata for a hosted feature layer using ETL tools. Below is the details page of the hosted feature layer in ArcGIS Online: Item home page showing metadata Note, the metadata for the hosted feature layer looks amazing!! However, Bruce must confess something… he doesn’t enjoy creating metadata. Therefore, he didn't write a line of the summary, description, tags or terms of use metadata on the item details page. So where did he get metadata? How did he populate the item information on the item details page? Using magic? (…maybe, depending on how you define magic!) This metadata was harvested straight from the source - Vancouver's portal. Below is the landing page, with datasets sorted on popularity (which we may be skewing 😉). We highlight the dataset we are harvesting and if you look near the bottom we also highlight a link to an Excel spreadsheet with all dataset metadata. Open Data Landing Page Opening the spreadsheet we can see the row for the subject matter dataset - Issued building permits. Dataset metadata spreadsheet It is just a matter of getting the cell values for metadata elements from the Excel spreadsheet to the hosted feature layer - no problem - that's the sort of thing this technology was built for. Here is where that happens, the RefreshBuildingPermits ETL tool (in the blog download) - this is it. RefreshBuildingPermits ETL tool Without making this a tutorial on ETL tool design, the steps we're using for our information product are to: Make an initial file geodatabase incarnation of the data. ...using the ETL tool ImportBuildingPermits that has the shutdown script shown above. In ArcGIS Pro create the layer with symbology and pop-up behavior we want. Publish the feature class and related tables as a hosted feature layer to ArcGIS Online. Maintain the hosted feature layer using the RefreshBuildingPermits ETL tool. ...which means we put the hosted feature layer's metadata update functions in this second ETL tool. If you look bottom right you'll see a bookmark Update Metadata. The FeatureWriter transformer handling actual permit data has a summary port, if any data is written a hidden connector (for aesthetic reasons) triggers a FeatureReader which accesses the Excel file on the open data site and gets the right cell contents to an ArcGISOnlineConnector (also works with ArcGIS Enterprise) with the following settings: Set metadata details with ArcGISOnlineConnector So it's as easy as filling in the form! If there are no data changes there are no metadata changes, otherwise there may be. How easy is that? You now know how to harvest and write metadata using ETL tools! Note: The ArcGISOnlineConnector is downloadable manually or will automatically download if you type the transformer name into your Workbench canvas and are connected to the internet. Lastly, the third leg to the stool. Case #3, making bulk metadata changes to any number of portal items. Let's say that we manage an ArcGIS Online or ArcGIS Enterprise organization with hundreds of items and we need to change our terms of use for all hosted feature layers in a content folder. This is what we have now: This is what we want: This is how to effect the bulk change, it is crazy simple with another ETL tool: It is just an ArcGISOnlineConnector used to list the candidate portal items, a Tester to filter for hosted feature layers, then another ArcGISOnlineConnector to update the target element: If no value is supplied for an item it is left unchanged, to zero it out you would supply a value of NULL. Now, we mentioned above another ETL pattern of interest is deep copying of data and metadata between ArcGIS environments, for example between ArcGIS Enterprise and ArcGIS Online. There are core approaches for this use case in which metadata travels with the data, as you would expect, but there are also use cases where you wish to perform edits on the data and/or metadata during the ETL process. In this situation you might work with the metadata.xml file for each dataset. This can be accomplished with an option in the Get Details and Set Details actions in the the ArcGISOnlineConnector to get and set the value of metadata.xml. Move metadata between portal environments So that's it, managing metadata end to end using ArcGIS Data Interoperability!
... View more
08-09-2024
12:37 PM
|
3
|
2
|
1606
|
BLOG
|
At the simple end of the ETL Patterns continuum is the most common format of them all - CSV, or comma separated values, and its close cousin, the Excel spreadsheet. Vast troves of data move around as these formats - easily shared, human readable - what's not to like in ArcGIS? Well, these things! No proper schema built into the format Column names often not valid in ArcGIS Text fields with numeric values are treated as numeric Text field widths are assumed to be 8000 (CSV) or 255 (Excel) bytes Integer fields may be treated as double precision float Null values may be encoded with zero or other unlikely value Geometry is often encoded in an unsupported format for feature class creation Take this popular dataset from data.gov, mirrored from Washington's open data portal: First, the data as a map: Washington EV Population Now the source CSV file: CSV or Excel Data Issues In red I'm picking out some of these usual suspect violations. Column names have spaces or metacharacters, USPS postcodes can have leading zeros so should be text, zeroes have been used to encode nulls in Electric Range and Base MSRP columns, Vehicle Location is in WKT format and 2020 Census Tract has big integers and a name beginning with a number. The subject dataset has velocity - it is regularly updated - so you if you're interested in using it, or any of the thousands like it - you'll want to automate handling these issues with geoprocessing. First, some research! In the blog download is a script tool I used to scan the CSV file for maximum data widths, here is the code: tbl = arcpy.GetParameterAsText(0)
d = arcpy.da.Describe(tbl)
flds = [f.name for f in d['fields'] if f.type == 'String']
mDict = {f:0 for f in flds}
with arcpy.da.SearchCursor(tbl,flds) as cursor:
for row in cursor:
for f in flds:
if row[flds.index(f)]:
mDict[f] = max(mDict[f],len(row[flds.index(f)]))
for k in mDict.keys():
arcpy.AddMessage(f"""Input table '{tbl}' field '{k}' has maximum data width '{mDict[k]}'""") This yields: Maximum Text Field Widths Ignoring the Postal Code column, which local knowledge tells me is 5 characters wide, I can now see the widths required per text field. I'm now in a position to design a schema that correctly handles the subject matter file. I chose ModelBuilder to persist my processing as it can wrap core geoprocessing and Python functions I plan to use. The finished model is this, which we'll walk through. EVPopulation Model The first order of business is to impose a designed schema. This is done using the field map control in Export Table, basically copying the data into an in-memory table in the schema you want. The Field Map control lets you rename, cast and even construct fields very flexibly. In this case there is only one source field for each output field, so the Action is always First, and only the field properties are manipulated. Field Map Downstream of Export Table the output feature class is created using the in-memory table as a template, rows (still with no geometry) are appended to it, then geometry is created from the WKT column and the two fields with zero-encoded nulls are fixed with Calculate Field tools (note the conditional processing). Lastly, the Vehicle Location field is dropped as surplus to requirements. Are we done? No! Best practice is full automation, and this data is coming from the web. The URL used for file download isn't valid as a geoprocessing Text File input like the model wants. There is also a risk that a user might use an input CSV file of a different name than Electric_Vehicle_Population_Data.csv, which would trigger the field map in Export Table to revert to default field handling, undoing all our schema design work. To handle both web data input and consistent file name input the model EVPopulation is wrapped in a parent model, URL2EVPopulation: URL2EVPopulation Web Download First a Calculate Value model tool downloads the current data to a consistently named file of Text File parameter type, then the EVPopulation model is called to process the data. See how a very simple Python snippet takes care of automation? 😉 Now we have automated CSV file handling! I stretched the concept of "local well-known files" a little by reaching out to the web for data, but these days files across the internet are local. I can schedule or manually run URL2EVPopulation any time I like. The models and script tool are in the blog download.
... View more
08-05-2024
09:59 AM
|
2
|
0
|
682
|
BLOG
|
SQL powered ETL. How hard can it be, you know COPY in PostgreSQL and how to stage data in cloud warehouses, right? How about downloading 51 million road segments from a cloud object store into your Pro project? Road segments in the USA Or the same data for France? Parisian transport segments from Overture via SQLIn the world of data engineering, the process of Extract, Transform, Load (ETL) can often be a daunting task, especially when dealing with complex columnar data from cloud-native sources. However, with the right tools and techniques, it can be simplified significantly. This blog post delves into using SQL to streamline ETL processes, specifically focusing on handling complex georelational data types. By leveraging the power of SQL, along with tools like DuckDB and ArcGIS, we can efficiently manage and transform large datasets, such as those provided by the Overture Maps Foundation. Whether you're dealing with GeoParquet, JSON, or CSV files, this guide will show you how to unpack and normalize complex data structures, making your ETL tasks more manageable and less time-consuming. You're probably aware of GeoParquet as an upcoming format, and this post uses the format as subject matter. If it's possible to inject complex objects into columnar data, someone will do it. I suppose the rationale is to supply a rich data model in a single row and a complete dataset in a single file or set of files of the same schema. However, it puts pressure on us ETL folks to unpack into a normalized schema. Now to throw a little light on the subject matter, look closely at the map above (click to enlarge). The linear features are transport segments of many types (road, rail, and water types with many subtypes of road) from the Overture Maps Foundation transportation theme. Overture delivers global scale data as GeoParquet files in S3 and Azure blob storage with anonymous access. The parquet files are partitioned into chunks of about 1GB using a common schema per theme. The segment features for an area of interest can be split across any number of parquet files, random spatial access is facilitated by a bounding box property on the data. This is big data, you don't want to download it all to get going. In the lower right-hand area, you'll see a yellow feature heading southeast out of Paris; it's the E54 route that makes its way to Mulhouse on the German border. Here is a smaller-scale view: E54 from Paris to Mulhouse Now you can see I have lots of data – millions and millions of transport segments in France. While the source data is of a global scale, I just extracted features in France. Not shown are many more millions of various event types - road surface, condition flags, access restrictions, speed zones, turn restrictions, routes, and so on. All are imported to ArcGIS from GeoParquet in an S3 bucket using SQL. Full disclosure, the ETL is given a leg up by ArcPy, but nothing that will surprise you. Processing uses notebooks in ArcGIS Pro. If you wanted, these could be hosted notebooks in ArcGIS Online or ArcGIS Enterprise. My invaluable helper in all this is the DuckDB package, supplying connectivity to web data sources, remote query capability for S3-API-compliant object stores, a performant local database, and above all, SQL support for files, including GeoParquet, and additionally supporting spatial SQL! Who knew files could behave like DBMS objects?! Before we go further, what is one of these complex objects I'm talking about, I hear you say? Here is one: {
"speed_limits": [
{
"is_max_speed_variable": true,
"between": [
0,
0.025818175
]
},
{
"max_speed": {
"value": 120,
"unit": "km/h"
},
"is_max_speed_variable": true,
"between": [
0.025818175,
0.125278098
]
},
{
"max_speed": {
"value": 100,
"unit": "km/h"
},
"is_max_speed_variable": true,
"between": [
0.125278098,
0.219762685
]
},
{
"max_speed": {
"value": 80,
"unit": "km/h"
},
"is_max_speed_variable": true,
"between": [
0.219762685,
0.831591278
]
},
{
"is_max_speed_variable": true,
"between": [
0.831591278,
1
]
}
]
} The above is a struct column value from a single transport segment that defines speed zones along the segment. There are five zones for the segment and four have a maximum speed property. They could also have a minimum speed property or conditional properties pertaining to heavy vehicles, axle count, time of day, and so on. The 'between' array is a pair of linear event coordinates that ArcGIS can display given a suitable route system (which we'll build). You're going to see how to unravel all this with SQL. My data is Overture Maps Foundation transportation theme at the 2024-08-20.0 release. Much of Overture's data is out of beta, but the transportation theme at writing remains in beta, so while you may edit the notebooks to use a later release, there may be schema changes which break the notebooks. Where do you start? You start by seeing what you've got in the data source. Here, I am doing that in a utility notebook: Inspecting the segments theme The important things in the upper cell are importing duckdb (you'll need to clone your environment and install it from Conda Forge or PyPi) and creating a connection (which could be just to memory - no database path). The lower cell describes my subject matter data by reading the S3 hive. By inspection, I can see ordinary columns that are VARCHAR, but one, 'connector_ids', is an array signified by the [] brackets at the end. There are a bunch of these structs, which I think of as JSON dictionaries. Frustratingly, the column_type details do not line-wrap, so a little Python to the rescue, and we can see struct columns can also be arrays, for example, as we see in speed_limits above. Showing the full schema So, having inspected the segment schema documentation and sample data, we can see the basic shape of it is segment linestring features (with unique key 'id'), WKB geometry, a bounding box struct, some descriptive fields, plus a set of array & struct fields we'll have to unpack into linear events. In classic divide and conquer fashion, I split the job into tractable chunks with a view to making it easy for you to modify for your area of interest. To get started, add the blog download notebooks to a Pro 3.3+ project; I created a new project. I'm guessing most people will have an area of interest, not the whole planet, so I based my processing around this. You are free to extract the whole planet but it may take some time.😉 Overture includes a political divisions dataset named division_area. The notebook that extracts this data is DivisionArea. Open and run DivisionArea, and inspect the output. DivisonArea features Identify some features in your area of interest. Determine a layer definition query ('where' clause) for the data that defines the total set of features within which you want to extract transportation theme data from Overture. For example, my area of interest is the country of France: country = 'FR' and subtype in ('country') and names.primary in ('France') You don't need to apply a layer definition query, it's the SQL we want. You can select any number or combination of DivisionArea features, disjoint or not. The next step is to extract the segments data from S3 to a local project DuckDB database. That will save going back to the S3 bucket again for data. Open the Segments notebook. Edit the line for wktRelation to use your where clause. You'll see a few samples I left behind. Save and run the notebook. Depending on the feature count in your area of interest it may take some time to extract the data. The notebook also makes a route system. In my case it took about 50 minutes for all metropolitan France, including making a Connectors feature class, which you may consider optional. Now you have your transport segments table in DuckDB you can run the rest of the notebooks, in any order, to make your transport information product - the various events on the segments. Here is a simple one for road condition flags in the Flags notebook: Flags notebook If you do a little reading on DuckDB you'll see it includes the GDAL library which has the OpenFileGDB driver so it's possible to use a simpler approach to write out the data. However, while this automatically derives a schema, it defaults text fields to 64K width and appears not to offer the OPENFILEGDB_DEFAULT_STRING_WIDTH configuration parameter, so I defined the schema manually with ArcPy. To find suitable text field widths I used a two-stage process, first writing with very large widths, then checking the actual data size using a simple script tool (code below, atbx included in the blog download) to refine the text widths: tbl = arcpy.GetParameterAsText(0)
d = arcpy.da.Describe(tbl)
flds = [f.name for f in d['fields'] if f.type == 'String']
mDict = {f:0 for f in flds}
with arcpy.da.SearchCursor(tbl,flds) as cursor:
for row in cursor:
for f in flds:
if row[flds.index(f)]:
mDict[f] = max(mDict[f],len(row[flds.index(f)]))
for k in mDict.keys():
arcpy.AddMessage(f"""Input table '{tbl}' field '{k}' has maximum data width '{mDict[k]}'""") You'll see in the notebooks I may use multiple UNNEST steps on each array and struct. This powerful function does what it suggests and flattens the data - arrays into new rows and structs into new columns. However, if a row has a null value for an array or struct then the unnest function does not emit the row so you'll see UNION clauses to bring all rows through, for example in the AccessRestrictions notebook. # when_using
conn.sql("""create or replace temp view restrictions_view3 as select
id,
access_type,
when_during,
when_heading,
unnest(when_using) as when_using,
when_recognized,
when_mode,
vehicle,
begins,
ends
from restrictions_view2 where when_using is not null
union all by name
select * exclude (when_using) from restrictions_view2 where when_using is null;""") If 'when_using' is null we still want the row because we have yet to unpack 'when_recognized' and 'when_mode'. The exclude term prevents a column type error between VARCHAR values unnested from 'when_using' and STRUCT values in 'when_using' before unnesting. All the notebooks except ProhibitedTransitions create event tables you get on the map as linear events along the SegmentsM route system, using the begins and ends fields as from-measure and to-measure values. See the Make Route Event Layer geoprocessing tool. ProhibitedTransitions creates line features with a transition_order sequence flag. That's it, bulk data delivered from a cloud object store with the help of SQL! If you look back at the problem before starting, it looked very daunting, something for some serious developer time. Remote GeoParquet files in an S3 hive, complex column types with nested data, geometric scoping using linear referencing – all these problems go away when you throw the right tools at them. The star of the show though is SQL, making complex ETL simpler. Notes: I had some feedback that prompts me to add this clarification: The order in which to implement this sample is: Download the blog attachment to a project home folder, preferably a new project Install the DuckDB package if you have not already, and activate the environment This is called python-duckdb in Conda Forge, duckdb in PyPi Run the DivisionArea notebook and inspect the output feature layer Figure out the where clause you need to select DivisionArea features for your area of interest Plug the where clause into the Segments notebook Run the Segments notebook This will create a segments.duckdb local database Run the other notebooks These depend on reading segments.duckdb
... View more
07-30-2024
10:53 AM
|
3
|
0
|
1116
|
IDEA
|
Rob can you share a link to the issues for the field map view? Thanks.
... View more
07-10-2024
12:59 PM
|
0
|
0
|
681
|
Title | Kudos | Posted |
---|---|---|
4 | a month ago | |
1 | a month ago | |
1 | 05-09-2025 08:06 AM | |
2 | 02-26-2025 12:59 PM | |
3 | 02-20-2025 09:00 AM |
Online Status |
Offline
|
Date Last Visited |
Wednesday
|