Reliability is expected of all enterprise systems, though its exact definition can vary by organization. In ArcGIS Enterprise on Kubernetes, there are several techniques and features you can leverage to meet the performance goals of your organization. In his plenary session at the 2025 Developer & Technology Summit, Chris Pawlyszyn demonstrates how you can achieve reliability by architecting and maintaining a system that is both highly available and recoverable.
Availability
ArcGIS Enterprise on Kubernetes should be implemented in a way that handles pod, node, and availability zone (AZ) outages while maintaining access to critical services and applications.
In his demonstration, Chris’s existing Kubernetes clusters span all availability zones in the region—minimizing the effects of AZ loss. He highlights the following options to consider during the deployment and configuration of your organization:
- During deployment, you can update a topology key in the properties file to ensure that the associated replicas of each statefulset are spread across the appropriate topology. This guarantees that you don’t have an unequal balance of replicas in a single availability zone.
- During configuration, you can select the enhanced availability architecture profile. When moving from development to standard and enhanced availability architecture profiles, the redundancy of individual workloads increases. Enhanced availability is the only profile the guarantees adequate coverage for all stateful workloads in the case of an AZ failure.

Organization storage is another area where you can increase the availability of your organization. When considering storage options, you can choose to integrate with cloud services for both the object and relational stores. This provides durable, easily scalable solutions and reduces the footprint of stateful workloads running within the cluster.

Recoverability
In addition to increasing the availability of your organization, you must ensure the state of your system is recoverable in the event of complete loss of the infrastructure running your primary system.
When it comes to backups, Chris highlights the following options:
- Register a cloud object store as a backup store to provide increased durability and scalability
- Create a backup schedule in ArcGIS Enterprise Manager at a frequency defined by the recovery point objective (RPO) of your organization
- Set up retention policies to avoid the endless accumulation of backups

Testing the recoverability of an organization
To demonstrate the recoverability of the organization and to test that his recovery time objective (RTO) is met, Chris simulates a regional failure in which the primary cluster is taken completely down.
Chris has set up two clusters and named the organization to signify the region the traffic is routing to.

Chris runs a script that will cordon all nodes in his cluster and forcefully terminate all pods in the deployment namespace.

In his system architecture, Chris has implemented Global Accelerator as a traffic manager in front of the two regionally separate endpoint groups. This minimizes the downtime for end users and automatically reroutes traffic to the secondary organization following a failure in the primary.

Once he reopens his organization home page in a new browser, he is directed to the replicated organization in us-west-2 instead of the primary in us-east-2. During this interruption, his organization has accumulated under a minute of downtime, allowing his end users to continue work with minimal disruption.
Conclusion
Not only does this replicated environment buy time to assess and respond to large scale outages, it also supports numerous testing possibilities while still preserving the integrity of the primary system, such as:
- Running regular disaster recovery drills
- Verifying backup integrity
- Testing functionality in upgrades
- Running tests of IT maintenance tasks
This instills confidence in the reliability of your system—no matter the scale of the outage, you can recover and you know how long that recovery will take.
To learn more, review the following resources:
Commenting is not enabled for this article.