Fault tolerance pitfalls and resolutions in Microsoft Azure Cloud

Reading time: 4 min(s)

Ensuring continuous service availability is crucial for any cloud-based application. Microsoft Azure offers several built-in mechanisms to handle failures, whether they stem from hardware issues (like hard-disk crashes) or temporary availability problems with dependent services (such as storage or networking). Leveraging Azure’s software-controlled infrastructure, these mechanisms work proactively to anticipate and resolve issues, ensuring uninterrupted service.

Azure’s infrastructure, managed by the Fabric Controller, is designed to swiftly detect and respond to failures. For instance, if a virtual machine (VM) encounters a hardware problem, the Fabric Controller seamlessly moves the VM to another physical node without user intervention. Similarly, updates and upgrades are coordinated to minimize service disruptions. The key to maintaining high availability across Azure’s services lies in two fundamental concepts: fault domains and upgrade domains. These integral features ensure service reliability and uptime across Azure’s ecosystem.

In this article, we explore Azure’s approach to fault tolerance, its infrastructure’s role in supporting high availability, and practical insights on minimizing downtime even when failures occur. We also delve into the concepts of fault domains and upgrade domains, shedding light on how they influence both Platform-as-a-Service (PaaS) and Infrastructure-as-a-Service (IaaS) deployments.

Understanding Azure’s Datacenter Architecture

To grasp the concepts of fault and upgrade domains, it’s essential to visualize Azure’s datacenter architecture. Azure datacenters utilize an architecture known as Quantum 10, which supports higher throughput compared to previous designs. This topology implements a non-blocking, meshed network that provides a high-bandwidth backplane for each Azure datacenter.

The architecture consists of nodes arranged in racks, grouped into clusters. Each cluster is dedicated to specific tasks such as computing, storage, or SQL. The Top-Of-Rack (TOR) switch connects each rack to the rest of the network, and its failure could impact the entire rack. Each cluster has multiple Fabric Controllers that manage all nodes within the cluster, ensuring fault tolerance.

The Fabric Controller orchestrates deployments across nodes, detects failures, and automatically re-provisions affected VMs to other physical nodes as necessary. To monitor and maintain node health, each machine within a cluster has dedicated agents communicating with the Fabric Controller. Key components involved include:

  • Host OS: The operating system running on the physical machine.
  • Host Agent: Facilitates communication between the physical machine and the Fabric Controller.
  • Guest OS: The operating system running inside the VM.
  • Guest Agent: Interacts with the Host Agent to ensure VM health.

 

Fault Domains and Upgrade Domains Explained

Azure uses fault domains and upgrade domains to maintain the high availability of PaaS applications:

  • Fault Domains: A fault domain represents a physical unit of failure, closely tied to the datacenter’s physical infrastructure. For Azure, this typically corresponds to a rack of servers. Azure ensures that PaaS applications with multiple instances are distributed across multiple fault domains, minimizing the impact of hardware failures. The Fabric Controller determines the exact number of fault domains for an application based on resource availability.
  • Upgrade Domains: An upgrade domain is a logical unit that allows updates to be applied to applications without causing downtime. PaaS applications can be spread across up to five upgrade domains, ensuring that updates are rolled out incrementally to avoid service interruptions.

Ensuring High Availability with Availability Sets for IaaS VMs

For Infrastructure-as-a-Service (IaaS) VMs, Azure offers Availability Sets to achieve high availability. VMs within an availability set are distributed across two or more fault domains and are assigned different upgrade domain values. This setup ensures that even if one fault domain experiences issues, the other instances remain operational.

However, it’s important to note that only VMs within an availability set are eligible for Azure’s Service-Level Agreements (SLAs). Without an availability set, there’s a higher risk of service interruptions during failures or routine maintenance.

Example Scenario:

When Azure deploys OS updates across datacenters, it updates the host OS one fault domain at a time and the guest OS one upgrade domain at a time. This approach ensures that applications remain available during updates, provided there are at least two instances per service or two VMs within an availability set.

 

Overcoming Challenges with Fault Domains

While fault domains and upgrade domains work well for stateless applications like web services, maintaining high availability for stateful workloads (such as databases) can be more challenging. For example, a database cluster may rely on quorum votes for node elections, making it critical that enough nodes remain operational to maintain the cluster’s health.

Azure guarantees that VMs within an availability set are deployed across at least two fault domains. However, in practice, this may mean exactly two fault domains, which can pose a risk for stateful workloads if nodes are lost during an upgrade or hardware failure.

 

Strategies for High Availability in Stateful Services

For stateful services like databases, achieving high availability may require deploying across more than two fault domains. Azure’s upcoming improvements for version 2 IaaS VMs via the Azure Resource Manager will support a minimum of three fault domains, addressing some of these concerns. In the meantime, users can mitigate risks by:

  1. 1. Deploying a mix of VMs within and outside an availability set to stagger the impact of updates.
  2. 2. Strategically placing critical nodes, such as database masters and quorum voters, across fault domains to maintain service continuity.

By following these strategies, businesses can minimize downtime and ensure their services remain resilient, even during failures or scheduled maintenance.

With AVASOFT’s expertise in Microsoft Azure solutions, you can confidently design, deploy, and manage resilient cloud infrastructure. Our tailored strategies and solutions help you leverage Azure’s capabilities to achieve maximum uptime and performance, ensuring business continuity even in the face of unexpected disruptions.

Reach out to AVASOFT to learn more about how we can enhance your cloud architecture for better reliability and efficiency.

Share this Article