Failure/Recovery from failure of Site/Node. Dell AX-640, AX-6515, AX-740XD, AX-7525

Add to My manuals
25 Pages


Failure/Recovery from failure of Site/Node. Dell AX-640, AX-6515, AX-740XD, AX-7525 | Manualzz


Failure/Recovery from failure of Site/Node

This chapter presents the following topics:


Planned failover

Operation steps

Planned failover

Windows Admin Center has a Switch Direction feature that allows you to migrate workloads from one site to the other. This must be initiated on each volume. VMs hosted on the volumes follow the volumes to the migrated site after 10 minutes. This feature is helpful in scenarios such as:

● There is a planned downtime

● A potential weather event could take the site down

To use the Switch Direction feature, go to Windows Admin Center and select Storage Replica on the left pane. Then select the

SR Partnership for which you would like to change the Replication Direction. Select More and click on Switch Direction .

In the event of a site failure, if a volume is replicating synchronously then the data and the log volume automatically come online on the surviving site, along with VMs associated with this volume because the RPO is 0. For asynchronous replication the data and the log volume do not come online automatically because the RPO is not equal to 0.

When the failed site comes back online, the Replica and Replica-Log volume are moved to the primary site with persistent disk reservations, and replication begins again. For a synchronous replicated volume, the replication direction cannot be changed until replication is 100 percent complete.

Operation steps

The following sections describe the steps to take in the event of different failure types.

Node failure

Handling a node failure on either site in a stretched cluster environment is no different than managing one in a traditional or standalone Azure Stack HCI cluster. A complete node failure would result in operating system or HBA corruption or complete hardware failure on the node. In either case, restoring system functionality is the priority.

The high level steps to do this are:

1. Replace the hardware as needed.

2. Re-install the operating system on the operating system drives (if needed).

3. Join the system to the domain.

4. Ensure you assign the new node IPs specific to the site where the node is hosted.

5. Add the node to the existing stretched cluster.

6. Based on the IP subnets used or the Cluster Fault Domains added, the cluster adds the drives to the correct pool.

7. Wait for the storage jobs to complete.

8. During this process the workloads on the affected site would still be running and there should be no interruption of replication.

Failure/Recovery from failure of Site/Node 19

Site failure

A site failure in a stretched cluster topology requires rebuilding all of the nodes of the affected site. If the failure happens at the primary site, the following scenarios occur:

● All volumes hosted on the affected site and associated VMs become inaccessible.

● After a brief period, the volumes move to the secondary site.

● The VMs restart on the secondary site.

● Depending on whether synchronous or asynchronous replication is being used, you either have zero data loss or data loss within the limits of the defined RPO:

○ For the replica volumes configured with synchronous replication, the VMs are crash consistent. Application recovery depends on the available backup/recovery of the application.

○ For the replica volumes configured with asynchronous replication, the VMs are not crash consistent. The default RPO is

30 seconds. It can be configured using PowerShell or Windows Admin Center. Application recovery still depends on the available backup/recovery of the application.

Site recovery

Follow these steps to recover the nodes on the failed site:

1. Remove the failed nodes from the cluster and remove the computer names from the Active Directory.

2. Remove SRPartnership and SRGroups using PowerShell cmdlets. Replication can also be disabled from the Failover Cluster


3. Bring up all the nodes on the affected site. The node names and IPs used should be the same as those used before the crash.

4. Join the nodes to the domain.

5. Add all the nodes to the existing stretched cluster at the same time.

6. All drives in the new site will be added to a new pool.

7. Re-create and enable replication for replica volumes and associated log volumes using Failover Cluster Manager or

PowerShell cmdlets.

20 Failure/Recovery from failure of Site/Node


Related manuals