30 Split Brain Recovery C H A P T E R

30 Split Brain Recovery C H A P T E R
CH A P T E R
30
Split Brain Recovery
Split brain mode refers to a state in which each database server does not know the high availability (HA)
role of its redundant peer, and cannot determine which server currently has the primary HA role. In split
brain mode, data modifications may have been made on either node, and those changes may not be
replicated to the peer. Also, neither or both nodes may be functioning in the primary HA role.
Split brain mode occurs when there is a temporary failure of the network connections between the two
database servers, for example, due to one of the following occurrences:
Note
•
Restart of either database server during synchronization.
•
Physical disconnection of the Ethernet cables from a database server.
•
Loss of power to one or both database servers.
If three or more servers failed, see the “Recovering from a Situation Where Three or More Servers
Failed” section on page 33-1.
This chapter includes the following topics:
•
Diagnosing Split Brain Mode, page 30-1
•
Recovering from Split Brain Mode, page 30-3
•
Verifying Synchronization of the Database Servers, page 30-4
•
Diagnosing Corrupted DRBD Metadata, page 30-6
•
Recovering from Corrupted DRBD Metadata, page 30-6
Diagnosing Split Brain Mode
Use this procedure to determine whether your database servers are in split brain mode.
Before You Begin
Make sure that the database servers are correctly cabled. See the “Cabling Requirements for the
Database Servers” section on page 4-3.
Installation and Administration Guide for the Cisco TelePresence Exchange System Release 1.1
30-1
Chapter 30
Split Brain Recovery
Diagnosing Split Brain Mode
Procedure
Step 1
Log in to the CLI of each database server.
Step 2
On each database server, enter the utils service database status command.
Step 3
If the output indicates the role values as Primary/Secondary and Secondary/Primary with only one server
having the current role value as Primary, the database servers are not in split brain mode.
The following split brain conditions are listed:
•
The role values display one of the following combinations:
– The Connection Sync Status field is “StandAlone.”
– “Primary/Unknown” on one server and “Secondary/Unknown” on the other server.
– “Secondary/Secondary” on both servers—In this particular case, if the connection sync status
on both servers is “Connected,” then the MySQL database is corrupted, and the split brain
recovery procedure will not help. Instead, see the “Corrupted MySQL Database Recovery”
chapter.
•
Step 4
“Secondary/Unknown” on both servers—In this particular case, if you know that one of the database
servers had a reboot during the initial synchronization process, then your database system is
functioning in a mode for which the split brain recovery procedure will not help. To recover, you
need to reinstall both database servers. See the “Installing and Synchronizing the
Cisco TelePresence Exchange System Database Servers” section on page 5-4.
To recover from split brain mode, proceed to the “Recovering from Split Brain Mode” section on
page 30-3.
Example
In the following example, one of the database servers is StandAlone, which indicates that the nodes are
in split brain mode:
admin: utils service database status
-------------------------------------------------------------------------------The initial configured HA role of this node
: secondary
The current HA role of this node
: primary
The database vip address
: 10.22.130.54
Node name
: ctx-db-1
Node IP address
: 10.22.130.49
Corosync status
: Running PID <18613>
Current Designated Controller (DC)
: ctx-db-2 - partition with quorum
MySQL status
: Running pid 2810
Connection Sync Status
: StandAlone
Role (this-node/peer-node)
: Primary/Unknown
Disk Status (this-node/peer-node)
: UpToDate/DUnknown
--------------------------------------------------------------------------------
Related Topics
•
Command Reference, page C-1
Installation and Administration Guide for the Cisco TelePresence Exchange System Release 1.1
30-2
Chapter 30
Split Brain Recovery
Recovering from Split Brain Mode
Recovering from Split Brain Mode
Use this procedure to recover your database servers from split brain mode.
Before You Begin
•
Complete the “Diagnosing Split Brain Mode” section on page 30-1 to confirm that your system is
in split brain mode.
•
Decide which node has the data that you want to keep. In this procedure, you will give this node the
primary HA role. All data on the other node will be lost during this procedure and will not be
recoverable.
If you do not know which node has the most recent or most valuable data, follow these
recommendations:
– If the utils service database status command output on both nodes indicates that one node
currently has the primary HA role while the other node currently has the secondary HA role,
you should choose the current primary node to keep as the primary database server.
admin: utils service database status
-------------------------------------------------------------------------------The initial configured HA role of this node
: primary
The current HA role of this node
: primary
The database vip address
: 10.22.130.54
…
– If the utils service database status command output on both nodes indicates that neither or
both nodes have the primary HA role, choose the node that you initially installed as the primary
server to keep as the primary database server.
admin: utils service database status
-------------------------------------------------------------------------------The initial configured HA role of this node
: primary
The current HA role of this node
: secondary
The database vip address
: 10.22.130.54
…
Procedure
Step 1
Log in to the CLI of the database server which has the data that you want to keep.
Step 2
Enter the utils service database drbd keep-node command to reset the server to currently function in
the primary HA role.
admin: utils service database drbd keep-node
This command will make this node as Primary
Trying to assume primary role......... [Done]
Reconnecting to MySQL......... [Done]
Step 3
Log in to the CLI of the other database server.
Step 4
Enter the utils service database drbd discard-node command to reset the server to currently function
in the secondary HA role.
admin: utils service database drbd discard-node
This command will make this node as Secondary
Trying to assume secondary role......... [Done]
Ensuring DRBD volume unmounted...
Ensuring DRBD role is Secondary...
Discarding local MySQL data..... [Done]
Installation and Administration Guide for the Cisco TelePresence Exchange System Release 1.1
30-3
Chapter 30
Split Brain Recovery
Verifying Synchronization of the Database Servers
Synchronization begins between the two database servers.
Step 5
Proceed to the “Verifying Synchronization of the Database Servers” section on page 30-4.
Related Topics
•
Command Reference, page C-1
Verifying Synchronization of the Database Servers
Procedure
Step 1
Log in to the CLI of each database server.
Step 2
On each database server, enter the utils service database status command.
The following examples show that synchronization is in progress and proceeding successfully, because
each node is aware of the HA role of its redundant peer, and the output displays the percentage of the
synchronization progress. Also, the current primary database server identifies itself as the SyncSource,
while the current secondary database server identifies itself as the SyncTarget.
Sample output from the current primary database server:
admin: utils service database status
-------------------------------------------------------------------------------The initial configured HA role of this node
: primary
The current HA role of this node
: primary
The database vip address
: 10.22.130.54
Node name
: ctx-db-1
Node IP address
: 10.22.130.49
Corosync status
: Running PID <20414>
Current Designated Controller (DC)
: ctx-db-2 - partition with quorum
MySQL status
: Running pid 10100
Connection Sync Status
: SyncSource
Role (this-node/peer-node)
: Primary/Secondary
Disk Status (this-node/peer-node)
: UpToDate/Inconsistent
Sample output from the current secondary database server:
admin: utils service database status
-------------------------------------------------------------------------------The initial configured HA role of this node
: secondary
The current HA role of this node
: secondary
The database vip address
: 10.22.130.54
Node name
: ctx-db-1
Node IP address
: 10.22.130.49
Corosync status
: Running PID <17842>
Current Designated Controller (DC)
: ctx-db-2 - partition with quorum
MySQL status
: Not running (only runs on database
server with current role primary.)
Connection Sync Status
: SyncTarget
Role (this-node/peer-node)
: Secondary/Primary
Disk Status (this-node/peer-node)
: Inconsistent/UpToDate
--------------------------------------------------------------------------------
Installation and Administration Guide for the Cisco TelePresence Exchange System Release 1.1
30-4
Chapter 30
Split Brain Recovery
Verifying Synchronization of the Database Servers
Note
Step 3
The synchronization takes approximately 40 minutes. During this time, the disk status value of
the current secondary server is shown as inconsistent. An inconsistent disk state indicates that
the synchronization between the database servers is not complete.
To confirm that the synchronization is complete, enter the utils service database status command on
both the primary and secondary database servers.
The following examples show that synchronization is complete, because the disk status of the current
secondary server is now up to date.
Sample output from the primary database server:
admin: utils service database status
-------------------------------------------------------------------------------The initial configured HA role of this node
: primary
The current HA role of this node
: primary
The database vip address
: 10.22.130.54
Node name
: ctx-db-1
Node IP address
: 10.22.130.49
Corosync status
: Running PID <20414>
Current Designated Controller (DC)
: ctx-db-2 - partition with quorum
MySQL status
: Running pid 10100
Connection Sync Status
: Connected
Role (this-node/peer-node)
: Primary/Secondary
Disk Status (this-node/peer-node)
: UpToDate/UpToDate
--------------------------------------------------------------------------------
Sample output from the secondary database server:
admin: utils service database status
-------------------------------------------------------------------------------The initial configured HA role of this node
: secondary
The current HA role of this node
: secondary
The database vip address
: 10.22.130.54
Node name
: ctx-db-1
Node IP address
: 10.22.130.49
Corosync status
: Running PID <17842>
Current Designated Controller (DC)
: ctx-db-2 - partition with quorum
MySQL status
: Not running (only runs on database
server with current role primary.)
Connection Sync Status
: Connected
Role (this-node/peer-node)
: Secondary/Primary
Disk Status (this-node/peer-node)
: UpToDate/UpToDate
--------------------------------------------------------------------------------
Tip
If this verification procedure shows that the split brain recovery procedure did not work for either or both
servers, proceed to the “Diagnosing Corrupted DRBD Metadata” section on page 30-6.
Installation and Administration Guide for the Cisco TelePresence Exchange System Release 1.1
30-5
Chapter 30
Split Brain Recovery
Diagnosing Corrupted DRBD Metadata
Diagnosing Corrupted DRBD Metadata
If, after you complete the split brain recovery procedure, the database servers still cannot connect to each
other and complete synchronization, the metadata for the Distributed Replicated Block Device (DRBD)
may be corrupted. The DRBD is what synchronizes the secondary database with changes that are made
on the primary database.
Before You Begin
This procedure applies only after you attempt split brain recovery. (See the “Recovering from Split Brain
Mode” section on page 30-3.)
Procedure
Step 1
Log in to the CLI of each database server.
Step 2
On each database server, enter the utils service database status command.
Step 3
The DRBD metadata is corrupted if the disk status value is “Inconsistent/Inconsistent” while the
connection sync status is “StandAlone” or “WFConnection” on one or both servers.
Step 4
To recover from corrupted DRBD metadata, proceed to the “Recovering from Corrupted DRBD
Metadata” section on page 30-6.
Example
In the following example, the status of one database server indicates that the nodes have corrupted
DRBD metadata:
admin: utils service database status
-------------------------------------------------------------------------------The initial configured HA role of this node
: secondary
The current HA role of this node
: secondary
The database vip address
: 10.22.130.54
Node name
: ctx-db-1
Node IP address
: 10.22.130.49
Corosync status
: Running PID <11459>
Current Designated Controller (DC)
: ctx-db-2 - partition with quorum
MySQL status
: Not running (only runs on database
server with current role primary.)
Connection Sync Status
: WFConnection
Role (this-node/peer-node)
: Secondary/Unknown
Disk Status (this-node/peer-node)
: Inconsistent/Inconsistent
--------------------------------------------------------------------------------
Related Topics
•
Command Reference, page C-1
Recovering from Corrupted DRBD Metadata
Before You Begin
•
Make sure that the database servers are correctly cabled. See the “Cabling Requirements for the
Database Servers” section on page 4-3.
Installation and Administration Guide for the Cisco TelePresence Exchange System Release 1.1
30-6
Chapter 30
Split Brain Recovery
Recovering from Corrupted DRBD Metadata
•
Complete the “Diagnosing Corrupted DRBD Metadata” section on page 30-6 to confirm that your
system has corrupted DRBD metadata.
Procedure
Step 1
Log in to the CLI of the database server which has the data that you want to keep.
This should be the same node whose data you decided to keep when you completed the procedure in the
“Recovering from Split Brain Mode” section on page 30-3.
Step 2
Enter the utils service database drbd force-keep-node command to reset the DRBD metadata and set
the server to currently function in the primary HA role.
admin: utils service database drbd force-keep-node
This command will make this node as Primary
Trying to assume primary role......... [Done]
Overwriting peer data... [Done]
Step 3
Log in to the CLI of the other database server.
Step 4
Enter the utils service database drbd force-discard-node command to reset the DRBD metadata and
set the server to currently function in the secondary HA role.
admin: utils service database drbd force-discard-node
Shutting down Heartbeat...
Stopping High-Availability services:
[ OK ]
Ensuring DRBD volume unmounted...
umount: /dev/drbd0: not mounted
Taking down DRBD Resource...
Recreating DRBD meta-data...
NOT initialized bitmap
Bringing up DRBD...
Starting Heartbeat...
Starting High-Availability services:
[ OK ]
[Done]
Synchronization begins between the two database servers.
Step 5
Proceed to the “Verifying Synchronization of the Database Servers” section on page 30-4.
Related Topics
•
Command Reference, page C-1
Installation and Administration Guide for the Cisco TelePresence Exchange System Release 1.1
30-7
Chapter 30
Recovering from Corrupted DRBD Metadata
Installation and Administration Guide for the Cisco TelePresence Exchange System Release 1.1
30-8
Split Brain Recovery
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertising