Introduction to ArcGIS Data Pipelines Webinar: Q&A

Sarah_Hanson

In our recent webinar, "Introduction to ArcGIS Data Pipelines", we were thrilled to receive a multitude of thoughtful questions. Although we couldn't address all of them during the live session, we've compiled them and provided answers here. For easier navigation, we have organized the questions into five sections: Usage and Capabilities, Data Sources and Formats, Credit Consumption, Product Comparison, and Miscellaneous.

Whether you are experienced in using ArcGIS Data Pipelines and looking to enhance your skills, or you are considering getting started with Data Pipelines, this post is designed to provide you with valuable insights and clarity.

Let's explore the answers to some of the questions that we received during the webinar together now!

Usage and Capabilities:

Can it be used for geoprocessing services?
Data Pipelines cannot run the analysis tools that you find in Map Viewer. Instead, Data Pipelines is a data integration and data engineering solution and the tools it supports are specifically for that. If your goal is to automate the use of analysis tools, you might check out creating a Notebook.

Can you use data pipelines to update an existing layer, or is it only to create new output layers?
Yes, you can use Data Pipelines to update an existing layer. You can either replace all of the data in the target layer, or you can add new records and update existing records based on a unique ID. The output methods that are supported within the Feature Layer output are documented here.

Is there any way to host the output data on ArcGIS Server rather than hosting on ArcGIS Online?
No, this is not possible. The only supported output option today is a hosted feature layer (or a hosted table if there is no geometry) in ArcGIS Online.

How does Data Pipelines handle security and sharing of connections and data pipelines, especially when dealing with password-protected data sources?
All connection information to secured data sources is encrypted and stored within a data store item. This data store item is saved to your content in ArcGIS Online, and can be used in the data pipeline, but cannot be shared with other users. To learn more about the cloud-based data stores that we support establishing connections to, please refer to this product documentation: Work with input data.

After writing out the features… are those results live or a static snapshot? I.e. when data changes in Snowflake, are those changes reflected in the feature service?
After the features are written, they are a static snapshot. If the data changes in the input, the data pipeline would need to be re-run to update the layer. This can be automated using the scheduling feature. To learn more about scheduled tasks, review this documentation: Schedule a data pipeline task.

Can multiple users share connection info within an organization?
In ArcGIS Data Pipelines, users can connect to external data stores that require authentication. This connection information is encrypted and stored as a data store item in the user’s content to enable reuse in other data pipelines. However, data store items cannot be shared with other ArcGIS Online users. Each user must establish their own connection to secured data stores for use in a data pipeline.

Can it filter attributes or reduce columns?
Yes, Data Pipelines can limit the data being processed. The Select Fields tool can be used to limit or reduce the columns, and the Filter by Attribute tool can be used to limit the records based on an attribute query.

Can you get notifications if a scheduled task fails?
No, but this is functionality we are considering for a future release.

Can I write multiple pipelines to a single feature layer as multiple sub-layers?
You can replace or update layers that are a part of a hosted feature layer containing multiple layers, but if your data pipeline is creating the feature layer new, you can only create a single layer in it. If this is something you would like to see changed in the future, please create an idea on the Data Pipelines Ideas site.

Will Data Pipelines get more advanced features, such as creating buffers?
We are actively working on adding support for Arcade geometry functions within the Calculate field tool, which would allow you to create buffers, but we do not have plans for a specific Create buffers tool at this time. If there are specific tools you'd like to see added in the future, please post them to the Data Pipelines Ideas site, as we'd really appreciate your input!

Could you elaborate on the schema adjustment tools and their functionalities?
Schema management is an important part of any data preparation workflow, and Data Pipelines has tools and feature functionality that make it fast and easy to modify, transform, and validate the design and structure of your data before writing it out to layers in ArcGIS Online. Schema management is at the heart of ArcGIS Data Pipelines as a whole, not limited to a specific tool or feature, so I’d encourage anyone with this question to review this introductory blog and related demo who are seeking a quick overview: Introducing Data Pipelines in ArcGIS Online (beta release).

How does Data Pipelines handle real-time data processing tasks?
Data Pipelines is not designed to support the integration of real-time data with ArcGIS Online. Instead, ArcGIS Velocity is the solution for this.

Can it detect changes in input data automatically?
Data Pipelines does not monitor when updates are made to the source data. However, you can schedule data pipelines to be run on a schedule of your choosing to ensure that information in your hosted feature layers and tables is kept current.

Are there any limits on the items created by data pipelines? Can they be shared in a collaboration and/or shared publicly?
Data Pipelines creates standard hosted feature layers and, just like any other item, they can be shared via collaboration and shared with the public without issue. Please note that if your goal is to share the items created by a data pipeline, such as the resulting feature layer, you do not need to share the input data sources, data stores, or data pipeline item. Those items can remain unshared and just the resulting feature layer can be shared with others.

Does the output support publishing to ArcGIS Hub?
Yes! Data Pipelines creates the same hosted feature layers as other parts of ArcGIS Online, and these can be used in Hub.

What is difference between Add and Update vs Replace?
Add and update and Replace are two output methods that are used in the Feature layer output when running a data pipeline on a schedule. The replace option does a full truncate (delete) of the data in the target layer then appends the newly prepared data from your data pipeline. It does this as a single transaction to minimize downtime and to support rolling back any failures. Add and update can be used to add (append) new records and optionally, update existing ones if you specify a unique ID. One key difference to keep in mind has to do with features that have been deleted from the source. If you use Replace, then these features will be removed from the target layer when the data pipeline is run. However, for add and update, records that have been deleted will persist in the output feature layer.

Data Sources and Formats:

Can you provide more information on the supported input data formats, such as CSV on OneDrive?
Yes, you can use CSVs from OneDrive, Google Drive, or DropBox. The supported input data stores and file formats that can be used in a data pipeline are detailed in the product documentation. To learn more, please review: Work with input data.

Is there a solution for being able to create Data Pipelines using raster data in the future?
Support for using raster data as input to a data pipeline is not possible today. If you’d like to see this support added in the future, please create an idea on the Data Pipelines Ideas site.

Will feature classes from a file geodatabase be supported as an input?
Yes, this is on our near-term roadmap and will be supported in a future release. This feature is planned for the June 2024 update.

Will it support GPX files or Salesforce import in the future?
This is not on our near or mid-term roadmap currently. If there are additional file formats and/or data sources that you’d like to supported in ArcGIS Data Pipelines in the future, I’d love it if you could please post the idea to the Data Pipelines Ideas site so it may be considered for a future release.

Can pipeline input be from Databricks or RedCAP?
No, this is not possible today. If there are additional file formats and/or data sources that you’d like to supported in the future, please create an idea on the Data Pipelines Ideas site so we can hear from you.

Can shared content from another organization be a data input?
Yes, certainly. The data used as input to a data pipeline does not need to originate from within your organization.

How can ArcGIS Server services be utilized within Data Pipelines?
ArcGIS Data Pipelines can use feature services from ArcGIS Enterprise as an input to a data pipeline. To do this, first add the feature service URL as a new item to your content, and then use the Feature layer input to load data from the web service. For more information and relevant limitations, please refer to the documentation for the Feature layer input.

Can Data Pipelines integrate with on-premises geodatabases (SQL Server or Oracle)?
ArcGIS Data Pipelines does not support connecting to on-premise databases such as SQL Server or Oracle databases today. This is largely due to the fact that it is uncommon practice for organizations to configure the security for their relational database management systems (RDBMS) like SQL Server or Oracle to allow connections to their databases that originate from the open internet or outside of their network.

Credit Consumption:

How are credits consumed? Is it based on storage or the number of pipelines?
Credits are consumed based on the amount of time, in minutes, that you work in Data Pipelines. This is both for working interactively to build and edit your data pipeline as well as for any scheduled jobs if you schedule your data pipeline to run on a user-defined schedule. For more information, please review this documentation.

Why does it charge credits to build a data pipeline?
Data Pipelines charges credits as the authoring experience is visual, low-code, and interactive: offering built-in support for inferring dataset schemas, generating data previews, caching inputs, and generating error and warning messages.

How many credits per minute are charged?
Data Pipelines charges 50 credits per hour, on a per-minute basis, when working interactively (in the editor when building a data pipeline), and 70 credits per hour when data pipelines are run on a schedule to keep data current, again on a per-minute basis.

For example: if you spend 30 minutes working in the editor, building a data pipeline, you would be charged 25 credits. If you spend 50 minutes working in the editor, you would be charged 42 credits. In the case of scheduled tasks, we can see in our logs that the average duration of scheduled data pipeline runs is 4 minutes. If we translate that to credits, it equals 5 credits.

For more information on credits in ArcGIS Online, please review this documentation.

Is there a way to estimate credit usage before running a data pipeline?
It is not possible for Data Pipelines to estimate the amount of time a data pipeline will take to run. It is relevant to share that if you are working in the editor where you can edit and build a data pipeline, clicking run to execute the data pipeline does not result in any additional charges.

What is the credit cost per minute of workflow runtime in ArcGIS Data Pipelines?
Interactive use: 50 credits per hour. This would equate to 0.83 credits per minute.
Scheduled use: 70 credits per hour. This would equate to 1.17 credits per minute.

What does a typical data pipeline creation "cost" in terms of credits?
That will depend on the amount of time you spend in the editor, creating your data pipeline, as the cost is directly tied to usage in terms of minutes you had the editor open. However, I can try to provide some context. The average amount of time spent working interactively is about 40 minutes. This is about 33 credits. Now, compare that investment to the one that would be required to pay a staff analyst or services consultant to write a Python script to accomplish the same task. This comparison is just to highlight that if a data integration task can be accomplished using Data Pipelines, as an alternative to creating and maintaining an automation script, the value of Data Pipelines will be quickly demonstrated.

Does generating previews consume credits?
When working within the data pipeline editor, where you interactively build and design your data pipeline, you can preview the results at any stage of the workflow – viewing the attributes and/or map to validate that the results are what you expected. This does not charge you extra credits. Instead, you are charged credits for time spent working in the editor on a per minute basis. You can preview the data 15 times in an hour, or not at all, and the credit charge would be the same: directly tied to how many minutes you spent working in the editor.

If you are scheduling data pipelines to run hourly that could be costly because of the pricing model, correct?
Running a data pipeline 24x per day (e.g. every hour) will cost 24x as much as running it 1x per day, is one way to look at this. The frequency at which you choose to schedule your data pipelines to run is up to you and will be based on your business requirements and the importance of keeping the information current.

Product Comparison:

What are the differences between Data Pipelines and Data Interoperability?
Data Pipelines is focused on data integration for ArcGIS Online, with results being written out to a hosted feature layer, while Data Interoperability supports a larger set of supported inputs and file types and is able to write results back to the source.

How is Data Pipelines different from ArcGIS Velocity?
While there are similarities in that both allow you to connect to external data sources and import the data into ArcGIS Online for use across the ArcGIS system, they serve distinct purposes. Velocity is specifically designed for real-time and big data processing, efficiently handling high-speed data streams from sensors and similar sources. It also is focused on enabling analytics such as device tracking, incident detection, and pattern analysis.

If I have ArcGIS Pro on-premises and an ArcGIS Online. What is the benefit of using Data Pipelines instead of ModelBuilder to process data and publish it to Online as a hosted feature layer?
Data Pipelines allows you to create and automate your data integration workflows for ArcGIS Online in the cloud, without having to worry about the infrastructure that it runs on. Scheduling updates to data in Online using ArcGIS Pro and ModelBuilder is certainly possible, it just comes with a few challenges that need to be considered and worked around.

How does Data Pipelines differ from Model Builder? / Is Data Pipelines like Model Builder but for ArcGIS Online?
Model Builder and Data Pipelines are similar in that they offer a low-code, drag and drop user experience for authoring repeatable workflows. However, there are some key differences:

Model Builder is a capability included in ArcGIS Pro; Data Pipelines is a capability included in ArcGIS Online.
Model Builder can be used to automate or streamline analysis workflows, leveraging any of the geoprocessing tools found in ArcGIS Pro (over a thousand); Data Pipelines can be used to automate or streamline data integration and preparation workflows, and includes a number of focused tools designed to clean, format, and prepare data for visualization and analysis.
Note: ModelBuilder for Map Viewer is scheduled for a future release. It is designed for Online and Enterprise users, enabling them to chain map viewer tools and data into models to automate and share their analysis workflows.

Setting this up looks very similar to Workflow Manager. What's the main difference between Workflow Manager and Data Pipelines?
Workflow Manager is a tool for designing, managing, and tracking geospatial workflows and business processes within an organization, while Data Pipelines is a data integration capability for ArcGIS Online.

Miscellaneous:

Is there a plan to make Data Pipelines available for personal use licenses?
No, not at this time.

Can Data Pipelines be used with ArcGIS Enterprise?
Data Pipelines is only available in ArcGIS Online today. However, if you are using ArcGIS Enterprise and ArcGIS Online together, you can add feature layers from ArcGIS Enterprise as items to your Online organization and use those feature layers as input to your data pipeline. For more information, please refer to the documentation for the Feature layer input.

Can a data pipeline be triggered by a webhook?
Support for triggering a data pipeline to run using a webhook is not functionality that is available today, but we are considering it for a future release of ArcGIS Online.

Can a data pipeline workflow be triggered using Workflow Manager?
This is not possible today.

What criteria is used for Add/Update? Is there a way to specify the "primary" key?
Add & update is based on a uniquely identifying field that you specify when configuring the output. This field needs to have a unique index in the feature layer, which can be configured in the "Data" tab of the feature layer item page.

Can we add in a python tool (from a toolbox etc.) as a tool component in a data pipeline?
Today, we only support processing data using tools that are included in our toolsets. If you’d like to see support added for using custom web tools in a data pipeline in the future, please let us know by creating a post to the Data Pipelines Ideas site.

The End!

Please feel free to leave comments and questions in this thread or make a post on the Data Pipelines Questions forum. To learn more about ArcGIS Data Pipelines, check out the documentation and related blog posts. Thank you!

ChristopherCounsell

Great blog Sarah!