Dear ArcGIS community,
We are in the process of migrating from PostGIS + QGIS to ArcGIS Online, and are wondering what best practices are to upload big columnar datasets to ArcGIS Online. Data is available as local files/PostGIS server (PostGreSQL)/cloud data warehouse (Snowflake)/files on binary storage (Azure Blob). We have multiple datasets with 10 million records that we would like to be available in maps/applications in ArcGIS Online. These datasets have 20.000 insert/updates/deletes per day.
Currently we are considering these options:
- Data warehouse via ETL with Python to ArcGIS Online REST API
- Data warehouse query layer via web layer in ArcGIS Pro to ArcGIS Online
- File from blob storage via web layer in ArcGIS Pro to ArcGIS Online
We are looking into advantages and disadvantages of these options, and will post our findings here later too. Does anybody else have some experience with uploading big data to ArcGIS Online?
Solved! Go to Solution.
We looked at several options:
Option | Advantages | Disadvantages |
Data warehouse (Snowflake) query layer | Direct connection from ArcGIS to Snowflake (our main data platform). | Not scalable for millions of records according to ESRI documentation, needs ArcGIS Pro to do ETL and still duplicates data in ArcGIS Online. |
Data pipelines (ArcGIS Online product) | Easy-to-use stand alone ETL tool, integrates easily with multiple data sources (e.g. Snowflake/Blob), scheduling built in | Still in beta (in November 2023), no monitoring and API integrations yet |
ArcGIS Online REST API via Python | Flexible integration via API, scalable. | More code + ETL needed |
We choose ArcGIS Online REST API since we required scalability and flexible integration via API. If you don't have your own ETL and/or Python developers, then probably Data Pipelines product is the way to go. to upload big data in a scheduled way to ArcGIS Online
Hi @DavidLovesArcGIS ,
Thank you for your question.
I'd like to propose another option to try out. Data Pipelines (currently in beta) is a new data integration application available in ArcGIS Online. Data Pipelines can connect to and read from a variety of external data sources, including Snowflake and Azure Blob, and can write that data out as hosted feature layers or tables that are readily available in your content.
One advantage of using Data Pipelines is that, unlike the other options you've listed, Data Pipelines is a no-code solution that can accomplish the workflow right in ArcGIS Online (no need for ArcGIS Pro, python, or other software and scripting). Additionally, in the ArcGIS Online update coming next week, there will be a new feature for scheduling data pipeline workflows - this means you'd be able to create an automated schedule for your daily inserts and updates.
I will note that Data Pipelines is not explicitly designed to be a big data solution; there are some limits with the size of data that can be written (particularly when the data contains complex polygon geometries). However, I do think it's worth trying!
Here are some resources to get started with Data Pipelines:
Try it out and if you have any questions or run into any blocks, please let us know with a post in the Data Pipelines Questions forum. The Data Pipelines team monitors this closely and we are happy to help. 🙂
Thank you,
Bethany
Hey Bethany,
Thanks for your suggestion to look into Data Pipelines. Really nice tool by ESRI. The GUI, inputs, transformers and outputs are very easy to use. It's perfect for uploading (big) data easily one time.
Some remarks for improvement of Data Pipelines to make it possible to upload data multiple times:
We are still looking into comparing AGOL REST API and Snowflake Query Layer.
Hi @DavidLovesArcGIS ,
We appreciate your feedback. Here's some more information in case it is helpful:
Let us know if you try out Add and update or Replace and whether it can work for you or if you need something more.
Thanks again for your feedback. Someone from the product management team will be reaching out to you to learn more about your workflow, feedback, and ideas.
Bethany
Very helpful information. We missed the Add and update feature. Good to know that incremental updates is available and scheduling will be available soon!
We looked at several options:
Option | Advantages | Disadvantages |
Data warehouse (Snowflake) query layer | Direct connection from ArcGIS to Snowflake (our main data platform). | Not scalable for millions of records according to ESRI documentation, needs ArcGIS Pro to do ETL and still duplicates data in ArcGIS Online. |
Data pipelines (ArcGIS Online product) | Easy-to-use stand alone ETL tool, integrates easily with multiple data sources (e.g. Snowflake/Blob), scheduling built in | Still in beta (in November 2023), no monitoring and API integrations yet |
ArcGIS Online REST API via Python | Flexible integration via API, scalable. | More code + ETL needed |
We choose ArcGIS Online REST API since we required scalability and flexible integration via API. If you don't have your own ETL and/or Python developers, then probably Data Pipelines product is the way to go. to upload big data in a scheduled way to ArcGIS Online
You can use Data Interoperability tool. You don't need to consume credits with Pipeline. And you can have multiple workspaces and run from your computer. ArcGIS Data Interoperability Extension | Get Data to Work in Your Workflows (esri.com)
That is another option indeed to run locally without credits, but then one person is responsible for running those pipelines from their laptop. If that person is on holiday, their laptop breaks or they leave the company, then the data won't be updated anymore in ArcGIS Online. So that would be a no go for us.
You can have a notebook in AGOL or you can run the workspace in a server, you only need to have ArcGIS Pro install there. If you have enough annotations in the workspace anyone can understand what is happens. You can have an admin account for this proposed.