Select to view content in your preferred language

Best practices uploading big data to ArcGIS Online

1295
8
Jump to solution
10-18-2023 06:08 AM
DavidLovesArcGIS
New Contributor II

Dear ArcGIS community,

We are in the process of migrating from PostGIS + QGIS to ArcGIS Online, and are wondering what best practices are to upload big columnar datasets to ArcGIS Online. Data is available as local files/PostGIS server (PostGreSQL)/cloud data warehouse (Snowflake)/files on binary storage (Azure Blob). We have multiple datasets with 10 million records that we would like to be available in maps/applications in ArcGIS Online. These datasets have 20.000 insert/updates/deletes per day.

Currently we are considering these options:

- Data warehouse via ETL with Python to ArcGIS Online REST API 

- Data warehouse query layer via web layer in ArcGIS Pro to ArcGIS Online

- File from blob storage via web layer in ArcGIS Pro to ArcGIS Online

We are looking into advantages and disadvantages of these options, and will post our findings here later too. Does anybody else have some experience with uploading big data to ArcGIS Online?

0 Kudos
1 Solution

Accepted Solutions
DavidLovesArcGIS
New Contributor II

We looked at several options:

OptionAdvantagesDisadvantages
Data warehouse (Snowflake) query layerDirect connection from ArcGIS to Snowflake (our main data platform).Not scalable for millions of records according to ESRI documentation, needs ArcGIS Pro to do ETL and still duplicates data in ArcGIS Online.
Data pipelines (ArcGIS Online product)Easy-to-use stand alone ETL tool, integrates easily with multiple data sources (e.g. Snowflake/Blob), scheduling built inStill in beta (in November 2023), no monitoring and API integrations yet
ArcGIS Online REST API via PythonFlexible integration via API, scalable.More code + ETL needed

 

We choose ArcGIS Online REST API since we required scalability and flexible integration via API. If you don't have your own ETL and/or Python developers, then probably Data Pipelines product is the way to go. to upload big data in a scheduled way to ArcGIS Online

View solution in original post

0 Kudos
8 Replies
BethanyScott
Esri Contributor

Hi @DavidLovesArcGIS ,

Thank you for your question.

I'd like to propose another option to try out. Data Pipelines (currently in beta) is a new data integration application available in ArcGIS Online. Data Pipelines can connect to and read from a variety of external data sources, including Snowflake and Azure Blob, and can write that data out as hosted feature layers or tables that are readily available in your content.

One advantage of using Data Pipelines is that, unlike the other options you've listed, Data Pipelines is a no-code solution that can accomplish the workflow right in ArcGIS Online (no need for ArcGIS Pro, python, or other software and scripting). Additionally, in the ArcGIS Online update coming next week, there will be a new feature for scheduling data pipeline workflows - this means you'd be able to create an automated schedule for your daily inserts and updates.

I will note that Data Pipelines is not explicitly designed to be a big data solution; there are some limits with the size of data that can be written (particularly when the data contains complex polygon geometries). However, I do think it's worth trying!

Here are some resources to get started with Data Pipelines:

  1. Tutorial outlining how to access and use Data Pipelines
  2. How to connect to Snowflake
  3. How to connect to Azure Blob
  4. Requirements to access Data Pipelines

Try it out and if you have any questions or run into any blocks, please let us know with a post in the Data Pipelines Questions forum. The Data Pipelines team monitors this closely and we are happy to help. 🙂

Thank you,

Bethany

DavidLovesArcGIS
New Contributor II

Hey Bethany,

Thanks for your suggestion to look into Data Pipelines. Really nice tool by ESRI. The GUI, inputs, transformers and outputs are very easy to use. It's perfect for uploading (big) data easily one time.

Some remarks for improvement of Data Pipelines to make it possible to upload data multiple times:

  • Scheduling or triggering data pipelines is not available yet (https://community.esri.com/t5/data-pipelines-ideas/run-a-data-pipeline-on-a-schedule/idi-p/1299574/j...). So we can't use Data Pipelines automatically yet for updates.
  • Automatic creation of pipelines via an API is not possible. We have 150 data pipelines definitions that can be used to create 150 data pipelines if an API were available. We prefer not to create data pipelines manually in GUI of Data Pipelines.
  • Monitoring/alerting data pipelines automatically. If there were an automatic scheduling of data pipelines, we would prefer to monitor all runs of the data pipelines. Preferably, a REST API would give run success/failure results, but a Graphical User Interface would work too.
  • No incremental update mechanism yet?: data pipelines works well for full loads to ingest all records, but we could not find a mechanism to update the layer in ArcGIS online to only insert/update/delete changed records. Uploading full datasets all the time makes it more costly compared to uploading incrementally.

We are still looking into comparing AGOL REST API and Snowflake Query Layer.

BethanyScott
Esri Contributor

Hi @DavidLovesArcGIS ,

We appreciate your feedback. Here's some more information in case it is helpful:

  • You can update existing records by specifying the output feature layer method of Add and update; this will append new records and update existing records. Note that you'll want to use the Unique identifier parameter to update existing records, and you may need to set that field as unique in the feature layer item page. To delete existing records, the only option currently is to use the Replace output method. This will completely truncate your existing feature layer and write to it with the data from the source. See the doc here for more information on Add and update and Replace.
  • Scheduling will be available next week. I will post here with some documentation and a quick blog with a video when it's available so you can check it out.
  • You're right, there is no API or alerting for Data Pipelines yet. These features will likely come in a future release, but there is no promised timeline yet. We will take your feedback into account for planning and prioritization.

Let us know if you try out Add and update or Replace and whether it can work for you or if you need something more.

Thanks again for your feedback. Someone from the product management team will be reaching out to you to learn more about your workflow, feedback, and ideas.

Bethany

0 Kudos
DavidLovesArcGIS
New Contributor II

Very helpful information. We missed the Add and update feature. Good to know that incremental updates is available and scheduling will be available soon!

0 Kudos
DavidLovesArcGIS
New Contributor II

We looked at several options:

OptionAdvantagesDisadvantages
Data warehouse (Snowflake) query layerDirect connection from ArcGIS to Snowflake (our main data platform).Not scalable for millions of records according to ESRI documentation, needs ArcGIS Pro to do ETL and still duplicates data in ArcGIS Online.
Data pipelines (ArcGIS Online product)Easy-to-use stand alone ETL tool, integrates easily with multiple data sources (e.g. Snowflake/Blob), scheduling built inStill in beta (in November 2023), no monitoring and API integrations yet
ArcGIS Online REST API via PythonFlexible integration via API, scalable.More code + ETL needed

 

We choose ArcGIS Online REST API since we required scalability and flexible integration via API. If you don't have your own ETL and/or Python developers, then probably Data Pipelines product is the way to go. to upload big data in a scheduled way to ArcGIS Online

0 Kudos
avillamo
New Contributor II

You can use Data Interoperability tool. You don't need to consume credits with Pipeline. And you can have multiple workspaces and run from your computer. ArcGIS Data Interoperability Extension | Get Data to Work in Your Workflows (esri.com)

0 Kudos
DavidLovesArcGIS
New Contributor II

That is another option indeed to run locally without credits, but then one person is responsible for running those pipelines from their laptop. If that person is on holiday, their laptop breaks or they leave the company, then the data won't be updated anymore in ArcGIS Online. So that would be a no go for us.

0 Kudos
avillamo
New Contributor II

You can have a notebook in AGOL or you can run the workspace in a server, you only need to have ArcGIS Pro install there. If you have enough annotations in the workspace anyone can understand what is happens. You can have an admin account for this proposed.