ArcGIS Enterprise

Dev Summit 2024: Observability and reliability in ArcGIS Enterprise on Kubernetes

When the unexpected happens in your production environment, you hope that it is resilient enough to remain operational and recover as quickly as possible. With an ArcGIS Enterprise on Kubernetes deployment, organizations have access to multiple features that can provide this level of resiliency. In their plenary demonstration, Chris and Bill highlight some of these features, including ArcGIS Enterprise on Kubernetes’s support for the multiple availability zone deployments.

Configuring a highly available deployment

For their demonstration, Chris and Bill have set up a highly available deployment that will be undergoing a chaos test. While an ArcGIS Enterprise on Kubernetes deployment has built-in high availability, Chris and Bill are also leveraging a number of other features to reduce the amount of downtime, data loss, and improve resiliency.

The current Grafana dashboard view of Chris and Bill's Kubernetes cluster ahead of the chaos test.

Cloud native object stores

Starting at the 11.2 release, ArcGIS Enterprise on Kubernetes organizations can use cloud native services to integrate cloud object stores. Cloud object stores act as the organization’s object store or backup store location. This aids in increasing reliability and resiliency while also reducing the demand on in-cluster resources.

Multi-AZ deployments

Also starting at 11.2, ArcGIS Enterprise on Kubernetes deployments can deploy their Kubernetes cluster across multiple availability zones. By using topology spread constraints, administrators can control how scheduling occurs across the cluster. For Chris and Bill’s deployment, each availability zone becomes the separation boundary between replicas of each workload.

Enhanced availability architecture profile

The enhanced availability architecture profile is designed for use in business or mission-critical production environments and provides the highest level of availability, as it includes increased and expanded redundancy across critical pods. If an organization is configured to use multiple availability zones, the enhanced availability profile is the only profile that guarantees adequate coverage for all stateful workloads in the case of an availability zone failure.

 

Testing reliability when faced with failure

Chris begins his demonstration by showing the current state of his organization. Due to the interplay between the multiple availability zone deployment, enhanced availability profile, and cloud native object store, the organization is set up to be able to withstand a significant outage.

To put this to the test, Chris stops one of the availability zones, disrupting the cluster. As a result, machines have terminated, and numerous pods are shifting from a running to pending state.

A view of the pods within Chris and Bill's Kubernetes cluster, immediately after the availability zone has been stopped.

Observability and monitoring after failure

With Chris’s chaos test implemented, Bill now needs to evaluate the impact of the zone’s outage. From the number of pods in a pending state, we can see that Kubernetes is attempting to rebalance after the loss in capacity.

A view of the pods in Chris and Bill's Kubernetes cluster, showing that Kubernetes is attempting to restart, schedule, and rebalance after the outage.

Next, Bill moves to check ArcGIS Enterprise Manager to check if there are any critical logs. The lack of critical logs shows that the enhanced availability architecture profile is providing the expected resiliency.

Bill then moves on to check perform a basic health check on the core framework services, which shows that the relational store health check has failed.

The results of the basic health check, showing that the relational store health check has failed.

When Bill validates the relational data store, it returns a warning status. This means that the primary relational store is healthy, but the standby is not.

The results after validating the relational store in ArcGIS Enterprise Manager.

Though not highly available, the organization remains available in this degraded state . Bill moves on to monitoring the organization to track current performance through his Grafana dashboard.

The first chart in the dashboard shows the successful HTTP requests over time. This chart shows that while there is an impact to throughput while Kubernetes is rebalancing, all services are continuing to respond and are returning to their expected performance.

A chart monitoring the HTTP requests occurring over time.

The second chart shows request failures. While some errors have occurred, this chart also shows that critical services remain  operational even while Kubernetes is working to rebalance after the outage.

A chart showing the error count of all requests over time.

Conclusion

During their presentation, Chris and Bill showed how ArcGIS Enterprise on Kubernetes remains reliable, resilient, and operational, even when faced with critical failures.

For more information on ArcGIS Enterprise on Kubernetes, and to see if this deployment solution is the right one for your organization, use the links below to reference our documentation and other blogs:

About the authors

Tori O hara

Tori is a technical writer on the ArcGIS Enterprise team.

Bill is a lead software development engineer on the ArcGIS Enterprise team, focused on ArcGIS Notebooks, security, Kubernetes, and framework development.

Connect:

Next Article

Engaging Volunteers for a Cause

Read this article