ArcGIS Blog

Administration

ArcGIS Enterprise

Reliability in ArcGIS Enterprise on Kubernetes

By Jordan Hooey

Reliability is expected of all enterprise systems, though its exact definition can vary by organization. In ArcGIS Enterprise on Kubernetes, there are several techniques and features you can leverage to meet the performance goals of your organization. In his plenary session at the 2025 Developer & Technology Summit, Chris Pawlyszyn demonstrates how you can achieve reliability by architecting and maintaining a system that is both highly available and recoverable.

Availability

ArcGIS Enterprise on Kubernetes should be implemented in a way that handles pod, node, and availability zone (AZ) outages while maintaining access to critical services and applications.

In his demonstration, Chris’s existing Kubernetes clusters span all availability zones in the region—minimizing the effects of AZ loss. He highlights the following options to consider during the deployment and configuration of your organization:

  • During deployment, you can update a topology key in the properties file to ensure that the associated replicas of each statefulset are spread across the appropriate topology. This guarantees that you don’t have an unequal balance of replicas in a single availability zone.
  • During configuration, you can select the enhanced availability architecture profile. When moving from development to standard and enhanced availability architecture profiles, the redundancy of individual workloads increases. Enhanced availability is the only profile the guarantees adequate coverage for all stateful workloads in the case of an AZ failure.
The option to select the enhanced availability architecture profile when creating an organization.
The option to select the enhanced availability architecture profile when creating an organization.

Organization storage is another area where you can increase the availability of your organization. When considering storage options, you can choose to integrate with cloud services for both the object and relational stores. This provides durable, easily scalable solutions and reduces the footprint of stateful workloads running within the cluster.

A view from ArcGIS Enterprise Manager when cloud services are used for the organization's object and relational store.
A view from ArcGIS Enterprise Manager when cloud services are used for the organization's object and relational store.

Recoverability

In addition to increasing the availability of your organization, you must ensure the state of your system is recoverable in the event of complete loss of the infrastructure running your primary system.

When it comes to backups, Chris highlights the following options:

Create a backup schedule and set a retention policy from ArcGIS Enterprise Manager.
Create a backup schedule and set a retention policy from ArcGIS Enterprise Manager.

Testing the recoverability of an organization

To demonstrate the recoverability of the organization and to test that his recovery time objective (RTO) is met, Chris simulates a regional failure in which the primary cluster is taken completely down.

Chris has set up two clusters and named the organization to signify the region the traffic is routing to.

A view of Lens prior to running the cluster_down.sh script.
A view of Lens prior to running the cluster_down.sh script.

Chris runs a script that will cordon all nodes in his cluster and forcefully terminate all pods in the deployment namespace.

A view of Lens after the pods have been deleted.
A view of Lens after the pods have been deleted.

In his system architecture, Chris has implemented Global Accelerator as a traffic manager in front of the two regionally separate endpoint groups. This minimizes the downtime for end users and automatically reroutes traffic to the secondary organization following a failure in the primary.

A view of the AWS Global Accelerator console.
A view of the AWS Global Accelerator console.

Once he reopens his organization home page in a new browser, he is directed to the replicated organization in us-west-2 instead of the primary in us-east-2. During this interruption, his organization has accumulated under a minute of downtime, allowing his end users to continue work with minimal disruption.

Conclusion

Not only does this replicated environment buy time to assess and respond to large scale outages, it also supports numerous testing possibilities while still preserving the integrity of the primary system, such as:

  • Running regular disaster recovery drills
  • Verifying backup integrity
  • Testing functionality in upgrades
  • Running tests of IT maintenance tasks

This instills confidence in the reliability of your system—no matter the scale of the outage, you can recover and you know how long that recovery will take.

To learn more, review the following resources:

Share this article