Skip to content
This repository has been archived by the owner on Sep 30, 2024. It is now read-only.

GracefulMasterTakeover logic flaw in semi-sync replication scenario #1508

Open
Fanduzi opened this issue Sep 14, 2024 · 0 comments
Open

GracefulMasterTakeover logic flaw in semi-sync replication scenario #1508

Fanduzi opened this issue Sep 14, 2024 · 0 comments

Comments

@Fanduzi
Copy link

Fanduzi commented Sep 14, 2024

Consider the following topology:

                                                         ,- replica1 (Rpl_semi_sync_slave_status=1)
                                                        ,
                                                       ,
  master (
          Rpl_semi_sync_master_status=1
          rpl_semi_sync_master_wait_no_slave=1         - - replica2 (Rpl_semi_sync_slave_status=1)
          rpl_semi_sync_master_wait_for_slave_count=2
         )  
                                                       `
                                                        `
                                                         `- replica3 (Rpl_semi_sync_slave_status=1)

Assume that replica1 is the new master. Based on the source code, Orchestrator first allows the new master replica1 to take over the old master’s siblings.

The topology then becomes:

                                                                                                    ,- replica1 (Rpl_semi_sync_slave_status=1)
  master (                                                                                        ,
          Rpl_semi_sync_master_status=1
          rpl_semi_sync_master_wait_no_slave=1         - replica1 (Rpl_semi_sync_slave_status=1)
          rpl_semi_sync_master_wait_for_slave_count=2
         )                                                                                        `
    
                                                                                                    `- replica3 (Rpl_semi_sync_slave_status=1)

This presents a problem. Since the old master has rpl_semi_sync_master_wait_for_slave_count=2, but now only has one replica (replica1), all DML operations will be blocked while waiting for an ACK.

Next, Orchestrator will attempt to set read_only=1 on the old master, but since the DML operations are blocked (as mentioned), the set read_only=1 operation will also be blocked. If rpl_semi_sync_master_timeout is infinite, the switchover will hang indefinitely because ExecInstance does not have a timeout limit.

Even if rpl_semi_sync_master_timeout is not infinite, this situation will significantly increase switchover time, thus impacting the business even more.

In contrast, MHA’s switchover process avoids this issue because its process is as follows:

  1. Block writes on the old master.
  2. Wait for the new master to sync and remove the read-only restriction; at this point, the business can resume operations.
  3. Change all replicas of the old master(except the new master), to the new master.
  4. Finally, the old master change master to the new master.

I don’t understand why Orchestrator first lets the new master (replica1) take over the old master’s siblings. This approach introduces issues that MHA avoids.

@Fanduzi Fanduzi changed the title GracefulMasterTakeover logic flaw in semi-synchronous replication scenario GracefulMasterTakeover logic flaw in semi-sync replication scenario Sep 14, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant