Skip to content

Latest commit

 

History

History
268 lines (235 loc) · 22.7 KB

生产事故.md

File metadata and controls

268 lines (235 loc) · 22.7 KB

一、情况说明

StarRocks Version: 3.0.8-1975985

集群列表

机器名 角色
172.22.0.46 fe
172.22.0.47 be
172.22.0.48 be
172.22.0.49 be
172.22.0.62 be
172.22.0.63 be
172.22.0.64 be

现需求: 增加三台fe节点,并且下线当前fe节点:

机器名 角色
172.22.0.66 fe
172.22.0.67 fe
172.22.0.68 fe

官方文档 扩容三台fe之后, 在 172.22.0.46 机器执行 STARROCKS_HOME/bin/stop_fe.sh 命令之后,集群宕机无法访问. 相关错误信息如下:

172.22.0.66 机器的错误日志:

Caused by: com.sleepycat.je.EnvironmentFailureException: Environment invalid because of previous exception: (JE 18.3.16) 172.22.0.66_9010_1705906345772(4):/data/starrocks/fe/meta/bdb Clock delta: 10506 ms. between Feeder: 172.22.0.68_9010_1705907421469 and this Replica exceeds max permissible delta: 5000 ms. HANDSHAKE_ERROR: Error during the handshake between two nodes. Some validity or compatibility check failed, preventing further communication between the nodes. Environment is invalid and must be closed. Originally thrown by HA thread: RepNode 172.22.0.66_9010_1705906345772(-1) Originally thrown by HA thread: RepNode 172.22.0.66_9010_1705906345772(-1)
        at com.sleepycat.je.rep.stream.ReplicaFeederHandshake.checkClockSkew(ReplicaFeederHandshake.java:432) ~[starrocks-bdb-je-18.3.16.jar:?]
        at com.sleepycat.je.rep.stream.ReplicaFeederHandshake.execute(ReplicaFeederHandshake.java:269) ~[starrocks-bdb-je-18.3.16.jar:?]
        at com.sleepycat.je.rep.impl.node.Replica.initReplicaLoop(Replica.java:709) ~[starrocks-bdb-je-18.3.16.jar:?]
        at com.sleepycat.je.rep.impl.node.Replica.runReplicaLoopInternal(Replica.java:485) ~[starrocks-bdb-je-18.3.16.jar:?]
        at com.sleepycat.je.rep.impl.node.Replica.runReplicaLoop(Replica.java:412) ~[starrocks-bdb-je-18.3.16.jar:?]
        at com.sleepycat.je.rep.impl.node.RepNode.run(RepNode.java:1869) ~[starrocks-bdb-je-18.3.16.jar:?]
2024-01-25 11:09:40,606 INFO (UNKNOWN 172.22.0.66_9010_1705906345772(-1)|1) [BDBEnvironment.close():517] start to close epoch database
2024-01-25 11:09:40,606 INFO (UNKNOWN 172.22.0.66_9010_1705906345772(-1)|1) [BDBEnvironment.close():526] close epoch database end
2024-01-25 11:09:40,606 INFO (UNKNOWN 172.22.0.66_9010_1705906345772(-1)|1) [BDBEnvironment.close():528] start to close replicated environment
2024-01-25 11:09:40,611 INFO (UNKNOWN 172.22.0.66_9010_1705906345772(-1)|1) [BDBEnvironment.close():538] close replicated environment end
2024-01-25 11:09:40,611 ERROR (UNKNOWN 172.22.0.66_9010_1705906345772(-1)|1) [StarRocksFE.start():184] StarRocksFE start failed
com.starrocks.journal.JournalException: failed to setup environment after retried 3 times
        at com.starrocks.journal.bdbje.BDBEnvironment.setupEnvironment(BDBEnvironment.java:306) ~[starrocks-fe.jar:?]
        at com.starrocks.journal.bdbje.BDBEnvironment.setup(BDBEnvironment.java:175) ~[starrocks-fe.jar:?]
        at com.starrocks.journal.bdbje.BDBEnvironment.initBDBEnvironment(BDBEnvironment.java:153) ~[starrocks-fe.jar:?]
        at com.starrocks.journal.JournalFactory.create(JournalFactory.java:31) ~[starrocks-fe.jar:?]
        at com.starrocks.server.GlobalStateMgr.initJournal(GlobalStateMgr.java:1083) ~[starrocks-fe.jar:?]
        at com.starrocks.server.GlobalStateMgr.initialize(GlobalStateMgr.java:1032) ~[starrocks-fe.jar:?]
        at com.starrocks.StarRocksFE.start(StarRocksFE.java:130) ~[starrocks-fe.jar:?]
        at com.starrocks.StarRocksFE.main(StarRocksFE.java:82) ~[starrocks-fe.jar:?]
Caused by: com.sleepycat.je.EnvironmentFailureException: (JE 18.3.16) Environment must be closed, caused by: com.sleepycat.je.EnvironmentFailureException: Environment invalid because of previous exception: (JE 18.3.16) 172.22.0.66_9010_1705906345772(4):/data/starrocks/fe/meta/bdb Clock delta: 10506 ms. between Feeder: 172.22.0.68_9010_1705907421469 and this Replica exceeds max permissible delta: 5000 ms. HANDSHAKE_ERROR: Error during the handshake between two nodes. Some validity or compatibility check failed, preventing further communication between the nodes. Environment is invalid and must be closed. Originally thrown by HA thread: RepNode 172.22.0.66_9010_1705906345772(-1) Originally thrown by HA thread: RepNode 172.22.0.66_9010_1705906345772(-1)
        at com.sleepycat.je.EnvironmentFailureException.wrapSelf(EnvironmentFailureException.java:230) ~[starrocks-bdb-je-18.3.16.jar:?]
        at com.sleepycat.je.dbi.EnvironmentImpl.checkIfInvalid(EnvironmentImpl.java:1835) ~[starrocks-bdb-je-18.3.16.jar:?]
        at com.sleepycat.je.dbi.EnvironmentImpl.checkOpen(EnvironmentImpl.java:1844) ~[starrocks-bdb-je-18.3.16.jar:?]
        at com.sleepycat.je.Environment.checkOpen(Environment.java:2697) ~[starrocks-bdb-je-18.3.16.jar:?]
        at com.sleepycat.je.Environment.openDatabase(Environment.java:659) ~[starrocks-bdb-je-18.3.16.jar:?]
        at com.starrocks.journal.bdbje.BDBEnvironment.setupEnvironment(BDBEnvironment.java:291) ~[starrocks-fe.jar:?]
        ... 7 more

172.22.0.67 机器的错误日志:

2024-01-25 10:59:58,633 INFO (starrocks-mysql-nio-pool-1099|35631) [MysqlChannel.fetchOnePacket():189] Receive packet header failed, remote 172.22.0.40:21412 may close the channel.
2024-01-25 10:59:58,633 INFO (starrocks-mysql-nio-pool-1099|35631) [ConnectScheduler.unregisterConnection():160] Connection closed. remote=172.22.0.40:21412, connectionId=4601
2024-01-25 10:59:59,514 INFO (starrocks-mysql-nio I/O-4|155) [AcceptListener.handleEvent():79] Connection established. remote=/172.22.0.40:21420, connectionId=4602
2024-01-25 10:59:59,514 INFO (starrocks-mysql-nio-pool-1099|35631) [AcceptListener.lambda$handleEvent$1():87] Connection scheduled to worker thread 35631. remote=/172.22.0.40:21420, connectionId=4602
2024-01-25 10:59:59,517 WARN (starrocks-mysql-nio-pool-1099|35631) [StmtExecutor.execute():627] execute Exception, sql /* mysql-connector-java-5.1.49 ( Revision: ad86f36e100e104cd926c6b81c8cab9565750116 ) */SELECT  @@session.auto_increment_increment AS auto_increment_increment, @@character_set_client AS character_set_client, @@character_set_connection AS character_set_connection, @@character_set_results AS character_set_results, @@character_set_server AS character_set_server, @@collation_server AS collation_server, @@collation_connection AS collation_connection, @@init_connect AS init_connect, @@interactive_timeout AS interactive_timeout, @@language AS language, @@license AS license, @@lower_case_table_names AS lower_case_table_names, @@max_allowed_packet AS max_allowed_packet, @@net_buffer_length AS net_buffer_length, @@net_write_timeout AS net_write_timeout, @@query_cache_size AS query_cache_size, @@query_cache_type AS query_cache_type, @@sql_mode AS sql_mode, @@system_time_zone AS system_time_zone, @@time_zone AS time_zone, @@tx_isolation AS transaction_isolation, @@wait_timeout AS wait_timeout
com.sleepycat.je.rep.UnknownMasterException: (JE 18.3.16) Could not determine master from helpers at:[/172.22.0.68:9010, /172.22.0.66:9010, /172.22.0.67:9010, /172.22.0.46:9010]
        at com.sleepycat.je.rep.elections.Learner.findMaster(Learner.java:443) ~[starrocks-bdb-je-18.3.16.jar:?]
        at com.sleepycat.je.rep.util.ReplicationGroupAdmin.getMasterNodeName(ReplicationGroupAdmin.java:208) ~[starrocks-bdb-je-18.3.16.jar:?]
        at com.starrocks.ha.BDBHA.getLeaderNodeName(BDBHA.java:148) ~[starrocks-fe.jar:?]
        at com.starrocks.server.NodeMgr.getLeaderIpAndRpcPort(NodeMgr.java:1067) ~[starrocks-fe.jar:?]
        at com.starrocks.server.GlobalStateMgr.getLeaderIpAndRpcPort(GlobalStateMgr.java:3083) ~[starrocks-fe.jar:?]
        at com.starrocks.qe.LeaderOpExecutor.forward(LeaderOpExecutor.java:154) ~[starrocks-fe.jar:?]
        at com.starrocks.qe.LeaderOpExecutor.execute(LeaderOpExecutor.java:102) ~[starrocks-fe.jar:?]
        at com.starrocks.qe.StmtExecutor.forwardToLeader(StmtExecutor.java:704) ~[starrocks-fe.jar:?]
        at com.starrocks.qe.StmtExecutor.execute(StmtExecutor.java:453) ~[starrocks-fe.jar:?]
        at com.starrocks.qe.ConnectProcessor.handleQuery(ConnectProcessor.java:351) ~[starrocks-fe.jar:?]
        at com.starrocks.qe.ConnectProcessor.dispatch(ConnectProcessor.java:465) ~[starrocks-fe.jar:?]
        at com.starrocks.qe.ConnectProcessor.processOnce(ConnectProcessor.java:741) ~[starrocks-fe.jar:?]
        at com.starrocks.mysql.nio.ReadListener.lambda$handleEvent$0(ReadListener.java:69) ~[starrocks-fe.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0-282]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0-282]
        at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0-282]
2024-01-25 10:59:59,518 INFO (starrocks-mysql-nio-pool-1099|35631) [MysqlChannel.fetchOnePacket():189] Receive packet header failed, remote 172.22.0.40:21420 may close the channel.
2024-01-25 10:59:59,518 INFO (starrocks-mysql-nio-pool-1099|35631) [ConnectScheduler.unregisterConnection():160] Connection closed. remote=172.22.0.40:21420, connectionId=4602
2024-01-25 10:59:59,577 WARN (starrocks-mysql-nio-pool-1099|35631) [StmtExecutor.execute():627] execute Exception, sql /* ApplicationName=DBeaver 23.0.4 - SQLEditor <Script-278.sql> */ select * from ods_hd_hdb_lm_item_mapping
LIMIT 0, 200
com.sleepycat.je.rep.UnknownMasterException: (JE 18.3.16) Could not determine master from helpers at:[/172.22.0.68:9010, /172.22.0.66:9010, /172.22.0.67:9010, /172.22.0.46:9010]
        at com.sleepycat.je.rep.elections.Learner.findMaster(Learner.java:443) ~[starrocks-bdb-je-18.3.16.jar:?]
        at com.sleepycat.je.rep.util.ReplicationGroupAdmin.getMasterNodeName(ReplicationGroupAdmin.java:208) ~[starrocks-bdb-je-18.3.16.jar:?]
        at com.starrocks.ha.BDBHA.getLeaderNodeName(BDBHA.java:148) ~[starrocks-fe.jar:?]
        at com.starrocks.server.NodeMgr.getLeaderIpAndRpcPort(NodeMgr.java:1067) ~[starrocks-fe.jar:?]
        at com.starrocks.server.GlobalStateMgr.getLeaderIpAndRpcPort(GlobalStateMgr.java:3083) ~[starrocks-fe.jar:?]
        at com.starrocks.qe.LeaderOpExecutor.forward(LeaderOpExecutor.java:154) ~[starrocks-fe.jar:?]
        at com.starrocks.qe.LeaderOpExecutor.execute(LeaderOpExecutor.java:102) ~[starrocks-fe.jar:?]
        at com.starrocks.qe.StmtExecutor.forwardToLeader(StmtExecutor.java:704) ~[starrocks-fe.jar:?]
        at com.starrocks.qe.StmtExecutor.execute(StmtExecutor.java:453) ~[starrocks-fe.jar:?]
        at com.starrocks.qe.ConnectProcessor.handleQuery(ConnectProcessor.java:351) ~[starrocks-fe.jar:?]
        at com.starrocks.qe.ConnectProcessor.dispatch(ConnectProcessor.java:465) ~[starrocks-fe.jar:?]
        at com.starrocks.qe.ConnectProcessor.processOnce(ConnectProcessor.java:741) ~[starrocks-fe.jar:?]
        at com.starrocks.mysql.nio.ReadListener.lambda$handleEvent$0(ReadListener.java:69) ~[starrocks-fe.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0-282]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0-282]
        at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0-282]

172.22.0.68 机器的错误日志:

2024-01-25 11:01:39,561 INFO (stateChangeExecutor|72) [StateChangeExecutor.runOneCycle():85] begin to transfer FE type from INIT to UNKNOWN
2024-01-25 11:01:39,561 INFO (stateChangeExecutor|72) [StateChangeExecutor.runOneCycle():179] finished to transfer FE type from INIT to UNKNOWN
2024-01-25 11:01:41,561 INFO (UNKNOWN 172.22.0.68_9010_1705907421469(-1)|1) [GlobalStateMgr.waitForReady():1107] wait globalStateMgr to be ready. FE type: INIT. is ready: false
2024-01-25 11:01:43,561 INFO (UNKNOWN 172.22.0.68_9010_1705907421469(-1)|1) [GlobalStateMgr.waitForReady():1107] wait globalStateMgr to be ready. FE type: INIT. is ready: false
2024-01-25 11:01:45,562 INFO (UNKNOWN 172.22.0.68_9010_1705907421469(-1)|1) [GlobalStateMgr.waitForReady():1107] wait globalStateMgr to be ready. FE type: INIT. is ready: false
2024-01-25 11:01:47,562 INFO (UNKNOWN 172.22.0.68_9010_1705907421469(-1)|1) [GlobalStateMgr.waitForReady():1107] wait globalStateMgr to be ready. FE type: INIT. is ready: false
2024-01-25 11:01:49,563 INFO (UNKNOWN 172.22.0.68_9010_1705907421469(-1)|1) [GlobalStateMgr.waitForReady():1107] wait globalStateMgr to be ready. FE type: INIT. is ready: false
2024-01-25 11:01:51,563 INFO (UNKNOWN 172.22.0.68_9010_1705907421469(-1)|1) [GlobalStateMgr.waitForReady():1107] wait globalStateMgr to be ready. FE type: INIT. is ready: false
2024-01-25 11:01:53,564 INFO (UNKNOWN 172.22.0.68_9010_1705907421469(-1)|1) [GlobalStateMgr.waitForReady():1107] wait globalStateMgr to be ready. FE type: INIT. is ready: false
2024-01-25 11:01:55,564 INFO (UNKNOWN 172.22.0.68_9010_1705907421469(-1)|1) [GlobalStateMgr.waitForReady():1107] wait globalStateMgr to be ready. FE type: INIT. is ready: false
2024-01-25 11:01:57,565 INFO (UNKNOWN 172.22.0.68_9010_1705907421469(-1)|1) [GlobalStateMgr.waitForReady():1107] wait globalStateMgr to be ready. FE type: INIT. is ready: false
2024-01-25 11:01:59,565 INFO (UNKNOWN 172.22.0.68_9010_1705907421469(-1)|1) [GlobalStateMgr.waitForReady():1107] wait globalStateMgr to be ready. FE type: INIT. is ready: false
2024-01-25 11:02:01,566 INFO (UNKNOWN 172.22.0.68_9010_1705907421469(-1)|1) [GlobalStateMgr.waitForReady():1107] wait globalStateMgr to be ready. FE type: INIT. is ready: false
2024-01-25 11:02:03,566 INFO (UNKNOWN 172.22.0.68_9010_1705907421469(-1)|1) [GlobalStateMgr.waitForReady():1107] wait globalStateMgr to be ready. FE type: INIT. is ready: false
2024-01-25 11:02:05,567 INFO (UNKNOWN 172.22.0.68_9010_1705907421469(-1)|1) [GlobalStateMgr.waitForReady():1107] wait globalStateMgr to be ready. FE type: INIT. is ready: false
2024-01-25 11:02:07,567 INFO (UNKNOWN 172.22.0.68_9010_1705907421469(-1)|1) [GlobalStateMgr.waitForReady():1107] wait globalStateMgr to be ready. FE type: INIT. is ready: false
2024-01-25 11:02:09,568 INFO (UNKNOWN 172.22.0.68_9010_1705907421469(-1)|1) [GlobalStateMgr.waitForReady():1107] wait globalStateMgr to be ready. FE type: INIT. is ready: false
2024-01-25 11:02:11,569 INFO (UNKNOWN 172.22.0.68_9010_1705907421469(-1)|1) [GlobalStateMgr.waitForReady():1107] wait globalStateMgr to be ready. FE type: INIT. is ready: false
2024-01-25 11:02:13,569 INFO (UNKNOWN 172.22.0.68_9010_1705907421469(-1)|1) [GlobalStateMgr.waitForReady():1107] wait globalStateMgr to be ready. FE type: INIT. is ready: false
2024-01-25 11:02:15,570 INFO (UNKNOWN 172.22.0.68_9010_1705907421469(-1)|1) [GlobalStateMgr.waitForReady():1107] wait globalStateMgr to be ready. FE type: INIT. is ready: false
2024-01-25 11:02:17,571 INFO (UNKNOWN 172.22.0.68_9010_1705907421469(-1)|1) [GlobalStateMgr.waitForReady():1107] wait globalStateMgr to be ready. FE type: INIT. is ready: false
2024-01-25 11:02:19,571 INFO (UNKNOWN 172.22.0.68_9010_1705907421469(-1)|1) [GlobalStateMgr.waitForReady():1107] wait globalStateMgr to be ready. FE type: INIT. is ready: false
2024-01-25 11:02:21,572 INFO (UNKNOWN 172.22.0.68_9010_1705907421469(-1)|1) [GlobalStateMgr.waitForReady():1107] wait globalStateMgr to be ready. FE type: INIT. is ready: false
2024-01-25 11:02:23,573 INFO (UNKNOWN 172.22.0.68_9010_1705907421469(-1)|1) [GlobalStateMgr.waitForReady():1107] wait globalStateMgr to be ready. FE type: INIT. is ready: false
2024-01-25 11:02:25,573 INFO (UNKNOWN 172.22.0.68_9010_1705907421469(-1)|1) [GlobalStateMgr.waitForReady():1107] wait globalStateMgr to be ready. FE type: INIT. is ready: false
2024-01-25 11:02:27,573 INFO (UNKNOWN 172.22.0.68_9010_1705907421469(-1)|1) [GlobalStateMgr.waitForReady():1107] wait globalStateMgr to be ready. FE type: INIT. is ready: false
2024-01-25 11:02:29,574 INFO (UNKNOWN 172.22.0.68_9010_1705907421469(-1)|1) [GlobalStateMgr.waitForReady():1107] wait globalStateMgr to be ready. FE type: INIT. is ready: false
2024-01-25 11:02:31,574 INFO (UNKNOWN 172.22.0.68_9010_1705907421469(-1)|1) [GlobalStateMgr.waitForReady():1107] wait globalStateMgr to be ready. FE type: INIT. is ready: false
2024-01-25 11:02:33,575 INFO (UNKNOWN 172.22.0.68_9010_1705907421469(-1)|1) [GlobalStateMgr.waitForReady():1107] wait globalStateMgr to be ready. FE type: INIT. is ready: false
2024-01-25 11:02:35,576 INFO (UNKNOWN 172.22.0.68_9010_1705907421469(-1)|1) [GlobalStateMgr.waitForReady():1107] wait globalStateMgr to be ready. FE type: INIT. is ready: false
2024-01-25 11:02:37,576 INFO (UNKNOWN 172.22.0.68_9010_1705907421469(-1)|1) [GlobalStateMgr.waitForReady():1107] wait globalStateMgr to be ready. FE type: INIT. is ready: false
2024-01-25 11:02:39,577 INFO (UNKNOWN 172.22.0.68_9010_1705907421469(-1)|1) [GlobalStateMgr.waitForReady():1107] wait globalStateMgr to be ready. FE type: INIT. is ready: false
2024-01-25 11:02:41,577 INFO (UNKNOWN 172.22.0.68_9010_1705907421469(-1)|1) [GlobalStateMgr.waitForReady():1107] wait globalStateMgr to be ready. FE type: INIT. is ready: false
2024-01-25 11:02:43,578 INFO (UNKNOWN 172.22.0.68_9010_1705907421469(-1)|1) [GlobalStateMgr.waitForReady():1107] wait globalStateMgr to be ready. FE type: INIT. is ready: false
2024-01-25 11:02:45,578 INFO (UNKNOWN 172.22.0.68_9010_1705907421469(-1)|1) [GlobalStateMgr.waitForReady():1107] wait globalStateMgr to be ready. FE type: INIT. is ready: false
2024-01-25 11:02:47,578 INFO (UNKNOWN 172.22.0.68_9010_1705907421469(-1)|1) [GlobalStateMgr.waitForReady():1107] wait globalStateMgr to be ready. FE type: INIT. is ready: false
2024-01-25 11:02:49,579 INFO (UNKNOWN 172.22.0.68_9010_1705907421469(-1)|1) [GlobalStateMgr.waitForReady():1107] wait globalStateMgr to be ready. FE type: INIT. is ready: false
2024-01-25 11:02:51,580 INFO (UNKNOWN 172.22.0.68_9010_1705907421469(-1)|1) [GlobalStateMgr.waitForReady():1107] wait globalStateMgr to be ready. FE type: INIT. is ready: false
2024-01-25 11:02:53,580 INFO (UNKNOWN 172.22.0.68_9010_1705907421469(-1)|1) [GlobalStateMgr.waitForReady():1107] wait globalStateMgr to be ready. FE type: INIT. is ready: false

此时fe全部异常, sr集群无法访问, 情况十万火急.

二、解决办法

先备份所有fe的meta文件夹

cp -r meta meta.bak

官方文档

在4个fe节点上执行

$ cd STARROCKS_HOME
$ java -jar lib/starrocks-bdb-je-18.3.16.jar DbPrintLog -h meta/bdb/ -vd

能够拿到以下信息

<DbPrintLog>
file 0x216 numRepRecords = 15725 firstVLSN = 12,919,352 lastVLSN = 12,935,076
file 0x217 numRepRecords = 26410 firstVLSN = 12,935,077 lastVLSN = 12,961,486
file 0x218 numRepRecords = 20099 firstVLSN = 12,961,487 lastVLSN = 12,981,585
file 0x219 numRepRecords = 32549 firstVLSN = 12,981,586 lastVLSN = 13,014,134
file 0x21a numRepRecords = 28304 firstVLSN = 13,014,135 lastVLSN = 13,042,438
file 0x21b numRepRecords = 20679 firstVLSN = 13,042,439 lastVLSN = 13,063,117
file 0x21c numRepRecords = 24025 firstVLSN = 13,063,118 lastVLSN = 13,087,142
file 0x21d numRepRecords = 9586 firstVLSN = 13,087,143 lastVLSN = 13,096,728
file 0x21e numRepRecords = 223 firstVLSN = 13,096,729 lastVLSN = 13,096,951
file 0x21f numRepRecords = 82 firstVLSN = 13,096,952 lastVLSN = 13,097,033
file 0x220 numRepRecords = 18603 firstVLSN = 13,097,034 lastVLSN = 13,115,636
file 0x221 numRepRecords = 18057 firstVLSN = 13,115,637 lastVLSN = 13,133,693
file 0x222 numRepRecords = 24469 firstVLSN = 13,133,694 lastVLSN = 13,158,162
file 0x223 numRepRecords = 22279 firstVLSN = 13,158,163 lastVLSN = 13,180,441
file 0x224 numRepRecords = 18859 firstVLSN = 13,180,442 lastVLSN = 13,199,300
file 0x225 numRepRecords = 17738 firstVLSN = 13,199,301 lastVLSN = 13,217,038
file 0x226 numRepRecords = 19652 firstVLSN = 13,217,039 lastVLSN = 13,236,690
file 0x227 numRepRecords = 17039 firstVLSN = 13,236,691 lastVLSN = 13,253,729
file 0x228 numRepRecords = 24045 firstVLSN = 13,253,730 lastVLSN = 13,277,774
file 0x229 numRepRecords = 22491 firstVLSN = 13,277,775 lastVLSN = 13,300,265
file 0x22a numRepRecords = 22581 firstVLSN = 13,300,266 lastVLSN = 13,322,846
file 0x22b numRepRecords = 25054 firstVLSN = 13,322,847 lastVLSN = 13,347,900
file 0x22c numRepRecords = 18812 firstVLSN = 13,347,901 lastVLSN = 13,366,712
file 0x22d numRepRecords = 16263 firstVLSN = 13,366,713 lastVLSN = 13,382,975
file 0x22e numRepRecords = 18421 firstVLSN = 13,382,976 lastVLSN = 13,401,396
file 0x22f numRepRecords = 15583 firstVLSN = 13,401,397 lastVLSN = 13,416,979
file 0x230 numRepRecords = 10105 firstVLSN = 13,416,980 lastVLSN = 13,427,084
file 0x231 numRepRecords = 15820 firstVLSN = 13,427,085 lastVLSN = 13,442,904
file 0x232 numRepRecords = 15230 firstVLSN = 13,442,905 lastVLSN = 13,458,134
file 0x233 numRepRecords = 20270 firstVLSN = 13,458,135 lastVLSN = 13,478,404
file 0x234 numRepRecords = 17820 firstVLSN = 13,478,405 lastVLSN = 13,496,224
file 0x235 numRepRecords = 12654 firstVLSN = 13,496,225 lastVLSN = 13,508,878
file 0x236 numRepRecords = 14011 firstVLSN = 13,508,879 lastVLSN = 13,522,889
file 0x237 numRepRecords = 23708 firstVLSN = 13,522,890 lastVLSN = 13,546,597
file 0x238 numRepRecords = 21347 firstVLSN = 13,546,598 lastVLSN = 13,567,944
file 0x239 numRepRecords = 17552 firstVLSN = 13,567,945 lastVLSN = 13,585,496
file 0x23a numRepRecords = 15096 firstVLSN = 13,585,497 lastVLSN = 13,600,592
file 0x23b numRepRecords = 9936 firstVLSN = 13,600,593 lastVLSN = 13,610,528
file 0x23c numRepRecords = 12041 firstVLSN = 13,610,529 lastVLSN = 13,622,569
file 0x23d numRepRecords = 10977 firstVLSN = 13,622,570 lastVLSN = 13,633,546
file 0x23e numRepRecords = 15229 firstVLSN = 13,633,547 lastVLSN = 13,648,775
file 0x23f numRepRecords = 9611 firstVLSN = 13,648,776 lastVLSN = 13,658,386
file 0x240 numRepRecords = 10254 firstVLSN = 13,658,387 lastVLSN = 13,668,640
file 0x241 numRepRecords = 12622 firstVLSN = 13,668,641 lastVLSN = 13,681,262
file 0x242 numRepRecords = 1772 firstVLSN = 13,681,263 lastVLSN = 13,683,034
... 0 files at end
First file: 0x216
Last file: 0x242
</DbPrintLog>

对比4个fe节点的 lastVLSN 的值, 看谁的值最大, 找到这台机器, 修改这台机器的fe.conf 文件

# 增加这个配置
metadata_failure_recovery = true

然后重新启动这个fe

sh bin/start_fe.sh --daemon

等待启动完成, 然后用mysql-client连接数据库(这台机器的ip端口), 检查fe角色

show frontends;

没问题之后再修改这台机器的fe.conf 文件

# 删除这个配置
# metadata_failure_recovery = true

然后重启fe

然后将之前加入的fe follower 全部删除重新加入, 需要注意的是把其他fe的meta目录清空(记得备份), fe follower 第一次启动时执行:

bin/start_fe.sh --helper "172.22.0.67:9010" --daemon

在mysql-client里面添加节点

ALTER SYSTEM ADD follower "172.22.0.68:9010";

加入完成之后再检查fe角色

show frontends;