How it works: ArcGIS Velocity & Chaos Engineering

QingyingWu · ‎04-06-2021

ArcGIS Velocity is built on Kubernetes technology, which has built-in functionality for health checks, failure recovery, and autoscaling. These capabilities make ArcGIS Velocity robust and resilient for continuous, real-time data ingestion and analysis.

The Real-Time Visualization & Analytics product team routinely runs resiliency tests by introducing failures into the system and observing (1) how the system reacts and (2) if it can recover gracefully. This is called Chaos Engineering, and it helps us identify weaknesses in the system so we can address them and improve resiliency.

For example, ArcGIS Velocity stores data in feature layers backed by a spatiotemporal data store. During a chaos test, we may saturate the disk, CPU, or memory usage on a pod running the data store, and then observe if data continues to be written successfully. As the data store consists of a distributed set of client and master pods, if one pod is not performing normally, the data store should still work and data should be continuously stored. If however, the data store deployment fails to respond, this gives us opportunities to find and fix potential problems before they are encountered by users.

Another chaos test is measuring compute node resiliency. In an ArcGIS Velocity deployment, there are multiple nodes (virtual machines) running on a Kubernetes cluster so if a node fails, other nodes in the cluster can continue to work together to provide services. We periodically test an abrupt shutdown of a node to simulate unforeseen outages that may occur in the underlying cloud infrastructure. This allows us to observe if pods that were running on that node are then re-deployed to other running nodes, and if ArcGIS Velocity continues to perform with minimal disruption of service.

Chaos Engineering is a continuous effort. We run chaos testing in parallel, on development environments, to make sure the system is robust and recovers from unexpected interruptions. As the product team implements new capabilities and fixes reported issues, we will continue to incorporate chaos tests into our development and release cycles. This helps to ensure that many potential issues are uncovered well before they could impact our users or their mission.