容灾工具的使用

    本文档将介绍如何使用容灾工具初始化灾备环境,以及当灾备环境发生故障时,如何使用容灾工具进行恢复。

    同城双中心架构下,将机器划分为两个子网SUB1(sdbserver1, sdbserver2)和SUB2(sdbserver3),并在 sdbserver1 和 sdbserver3 上进行配置的修改。

    • 修改 sdbserver1 上的 cluster_opr.js 文件
    • 修改 sdbserver3 上的 cluster_opr.js 文件
    1. if ( typeof(USERNAME) != "string" ) { USERNAME = "sdbadmin" ; }
    2. if ( typeof(PASSWD) != "string" ) { PASSWD = "sdbadmin" ; }
    3. if ( typeof(SDBUSERNAME) != "string" ) { SDBUSERNAME = "sdbadmin" ; }
    4. if ( typeof(SDBPASSWD) != "string" ) { SDBPASSWD = "sdbadmin" ; }
    5. if ( typeof(SUB1HOSTS) == "undefined" ) { SUB1HOSTS = [ "sdbserver1", "sdbserver2" ] ; }
    6. if ( typeof(SUB2HOSTS) == "undefined" ) { SUB2HOSTS = [ "sdbserver3" ] ; }
    7. if ( typeof(COORDADDR) == "undefined" ) { COORDADDR = [ "sdbserver3:11810" ] }
    8. if ( typeof(CURSUB) == "undefined" ) { CURSUB = 2 ; }
    9. if ( typeof(ACTIVE) == "undefined" ) { ACTIVE = false; }
    • sdbserver1 上执行 init
    1. [sdbadmin@sdbserver1 dr_ha]$ sh init.sh
    2. Begin to check args...
    3. Done
    4. Begin to check enviroment...
    5. Done
    6. Begin to init cluster...
    7. Start to copy init file to cluster host
    8. Copy init file to sdbserver2 success
    9. Copy init file to sdbserver3 success
    10. Done
    11. Begin to update catalog and data nodes's config...Done
    12. Begin to reload catalog and data nodes's config...Done
    13. Begin to reelect all groups...Done
    14. Done

    Note:

    • 执行 init.sh 后会生成“datacenter_init.info”文件,位于 SequoiaDB 安装目录下,如果此文件已存在,需要先将其删除或备份。
    • cluster_opr.js 中参数 NEEDBROADCASTINITINFO 默认值为“true”,表示将初始化的结果文件分发到集群的所有主机上,所以初始化操作在 SUB1 的“sdbserver1”机器上执行即可。

    灾备切换

    当主中心的所有机器发生故障时,SUB1 里的所有机器都不可用,SequoiaDB 集群的三副本中有两副本无法工作。此时需要用分裂(split)工具使灾备中心(SUB2)里的一副本脱离原集群,成为具备读写功能的独立集群,以恢复 SequoiaDB 服务。

    • 修改 sdbserver3 的 cluster_opr.js 中的配置项
    1. /* 是否激活该子网集群,取值 true/false */
    2. if ( typeof(ACTIVE) == "undefined" ) { ACTIVE = true ; }
    • sdbserver3执行分裂(split)
    1. [sdbadmin@sdbserver3 dr_ha]$ sh split.sh
    2. Begin to check args...
    3. Done
    4. Begin to check enviroment...
    5. Done
    6. Begin to split cluster...
    7. Stop 11800 succeed in sdbserver3
    8. Start 11800 by standalone succeed in sdbserver3
    9. Change sdbserver3:11800 to standalone succeed
    10. Kick host[sdbserver2] from group[SYSCatalogGroup]
    11. Kick host[sdbserver1] from group[SYSCatalogGroup]
    12. Update kicked group[SYSCatalogGroup] to sdbserver3:11800 succeed
    13. Kick host[sdbserver1] from group[group1]
    14. Kick host[sdbserver2] from group[group1]
    15. Update kicked group[group1] to sdbserver3:11910 succeed
    16. Kick host[sdbserver1] from group[group2]
    17. Kick host[sdbserver2] from group[group2]
    18. Update kicked group[group2] to sdbserver3:11920 succeed
    19. Kick host[sdbserver1] from group[group3]
    20. Kick host[sdbserver2] from group[group3]
    21. Update kicked group[group3] to sdbserver3:11930 succeed
    22. Kick host[sdbserver1] from group[SYSCoord]
    23. Kick host[sdbserver2] from group[SYSCoord]
    24. Update kicked group[SYSCoord] to sdbserver3:11810 succeed
    25. Update sdbserver3:11800 catalog's info succeed
    26. Update sdbserver3:11800 catalog's readonly prop succeed
    27. Update all nodes's catalogaddr to sdbserver3:11803 succeed
    28. Restart all nodes succeed in sdbserver3
    29. Restart all host nodes succeed
    30. Done

    此时灾备中心已完成切换,成为独立的业务集群,并且可以正常对外提供服务。

    • 主中心故障恢复后,在 sdbserver1 上修改 cluster_opr.js 中的配置项

      1. if ( typeof(ACTIVE) == "undefined" ) { ACTIVE = false ; }

      设置ACTIVE=false,使分裂后的2副本集群进入“只读”模式,只有灾备中心的单副本集群具有“写”功能,从而避免了脑裂(brain-split)的情况。

    • 开启数据节点自动全量同步

    如果主中心(SUB1)节点是异常终止的,重新启动节点时必须通过全量同步来恢复数据。数据节点参数设置 dataerrorop=2,会阻止全量同步的发生,导致数据节点无法启动。因此,主中心(SUB1)执行“分裂(split)”操作之前,需要在所有数据节点的配置文件(sdb.conf)中设置 dataerrorop=1,才能顺利启动数据节点。

    • sdbserver1执行分裂(split)
    1. [sdbadmin@sdbserver1 dr_ha]$ sh split.sh
    2. Begin to check args...
    3. Done
    4. Begin to check enviroment...
    5. Done
    6. Begin to split cluster...
    7. ...
    8. Done

    灾难修复

    • 执行合并(merge) 当主中心的所有故障都完成修复,需要将两个中心已分离的独立集群进行合并,恢复最初的状态。用户可以在 sdbserver1 和 sdbserver3 同时执行命令。
    1. $ sh merge.sh
    2. Begin to check args...
    3. Done
    4. Begin to check enviroment...
    5. Done
    6. Begin to merge cluster...
    7. Stop 11800 succeed in sdbserver3
    8. Start 11800 by standalone succeed in sdbserver3
    9. Change sdbserver3:11800 to standalone succeed
    10. Restore group[SYSCatalogGroup] to sdbserver3:11800 succeed
    11. Restore group[group1] to sdbserver3:11800 succeed
    12. Restore group[group2] to sdbserver3:11800 succeed
    13. Restore group[group3] to sdbserver3:11800 succeed
    14. Restore group[SYSCoord] to sdbserver3:11800 succeed
    15. Restore sdbserver3:11800 catalog's info succeed
    16. Update sdbserver3:11800 catalog's readonly prop succeed
    17. ...
    18. Update all nodes's catalogaddr to sdbserver1:11803,sdbserver2:11803,sdbserver3:11803 succeed
    19. Restart all nodes succeed in sdbserver3
    20. Restart all host nodes succeed
    21. Done
    • 关闭数据节点自动全量同步

    当合并(merge)操作完成后,并且主中心(SUB1)和灾备中心(SUB2)的数据追平,后续不再需要数据节点的自动全量同步,因此需要将所有数据节点的 dataerrorop 参数改回最初的设置,即 dataerrorop=2。

    • 再次执行初始化(init),恢复集群最初状态

    由于合并(merge)之后,集群中主节点全部分布在灾备集群(SUB2)中,因此需要再次执行初始化(init)操作,将主节点重新分布到主中心(SUB1)中。

    sdbserver1 修改 cluster_opr.js 中的配置项:

    1. [sdbadmin@sdbserver1 dr_ha]$ grep 'ACTIVE =' cluster_opr.js
    2. if ( typeof(ACTIVE) == "undefined" ) { ACTIVE = true; }

    sdbserver3 修改 cluster_opr.js 中的配置项:

    1. [sdbadmin@sdbserver3 dr_ha]$ grep 'ACTIVE =' cluster_opr.js
    2. if ( typeof(ACTIVE) == "undefined" ) { ACTIVE = false; }

    sdbserver1 执行初始化(init):

    Note:

    Already init. If you want to re-init, you should to remove the file: /opt/sequoiadb/datacenter_init.info

    在下,初始化时将机器划分为两个子网 SUB1(sdbserver1)和 SUB2(sdbserver2, sdbserver3)。

    • 修改 sdbserver1 上的 cluster_opr.js 文件:
    1. if ( typeof(SEQPATH) != "string" || SEQPATH.length == 0 ) { SEQPATH = "/opt/sequoiadb/" ; }
    2. if ( typeof(USERNAME) != "string" ) { USERNAME = "sdbadmin" ; }
    3. if ( typeof(PASSWD) != "string" ) { PASSWD = "sdbadmin" ; }
    4. if ( typeof(SDBUSERNAME) != "string" ) { SDBUSERNAME = "sdbadmin" ; }
    5. if ( typeof(SDBPASSWD) != "string" ) { SDBPASSWD = "sdbadmin" ; }
    6. if ( typeof(SUB1HOSTS) == "undefined" ) { SUB1HOSTS = [ "sdbserver1" ] ; }
    7. if ( typeof(SUB2HOSTS) == "undefined" ) { SUB2HOSTS = [ "sdbserver2", "sdbserver3" ] ; }
    8. if ( typeof(COORDADDR) == "undefined" ) { COORDADDR = [ "sdbserver1:11810" ] }
    9. if ( typeof(CURSUB) == "undefined" ) { CURSUB = 1 ; }
    10. if ( typeof(CUROPR) == "undefined" ) { CUROPR = "split" ; }
    11. if ( typeof(ACTIVE) == "undefined" ) { ACTIVE = true ; }
    • sdbserver1 上执行 init
    1. [sdbadmin@sdbserver1 dr_ha]$ sh init.sh
    2. Begin to check args...
    3. Done
    4. Begin to check enviroment...
    5. Done
    6. Begin to init cluster...
    7. Start to copy init file to cluster host
    8. Copy init file to sdbserver3 success
    9. Copy init file to sdbserver2 success
    10. Done
    11. Begin to update catalog and data nodes's config...Done
    12. Begin to reload catalog and data nodes's config...Done
    13. Begin to reelect all groups...Done
    14. Done

    Note:

    • 执行 init.sh 后会生成“datacenter_init.info”文件,位于 SequoiaDB 安装目录下,如果此文件已存在,需要先将其删除或备份。
    • cluster_opr.js 中参数 NEEDBROADCASTINITINFO 默认值为“true”,表示将初始化的结果文件分发到集群的所有主机上,所以初始化操作在 SUB1 的“sdbserver1”机器上执行即可。

    灾备切换

    当主中心和灾备中心 B 的所有机器发生故障时,SequoiaDB 集群的三副本中有两副本无法工作。此时需要用分裂(split)工具使灾备中心A里的单副本脱离原集群,成为具备读写功能的独立集群,以恢复 SequoiaDB 服务。

    此时子网划分如下:

    • 修改 sdbserver2 的 cluster_opr.js 中的配置项
    1. if ( typeof(SUB2HOSTS) == "undefined" ) { SUB2HOSTS = [ "sdbserver1", "sdbserver3" ] ; }
    2. if ( typeof(COORDADDR) == "undefined" ) { COORDADDR = [ "sdbserver2:11810" ] }
    3. if ( typeof(CURSUB) == "undefined" ) { CURSUB = 1 ; }
    4. if ( typeof(ACTIVE) == "undefined" ) { ACTIVE = true ; }
    • sdbserver2执行分裂(split)
    1. [sdbadmin@sdbserver2 dr_ha]$ sh split.sh
    2. Begin to check args...
    3. Done
    4. Begin to check enviroment...
    5. Done
    6. Begin to split cluster...
    7. Stop 11800 succeed in sdbserver2
    8. Start 11800 by standalone succeed in sdbserver2
    9. Change sdbserver2:11800 to standalone succeed
    10. Kick host[sdbserver3] from group[SYSCatalogGroup]
    11. Update kicked group[SYSCatalogGroup] to sdbserver2:11800 succeed
    12. Kick host[sdbserver1] from group[group1]
    13. Kick host[sdbserver3] from group[group1]
    14. Update kicked group[group1] to sdbserver2:11800 succeed
    15. Kick host[sdbserver1] from group[group2]
    16. Kick host[sdbserver3] from group[group2]
    17. Update kicked group[group2] to sdbserver2:11800 succeed
    18. Kick host[sdbserver1] from group[group3]
    19. Kick host[sdbserver3] from group[group3]
    20. Update kicked group[group3] to sdbserver2:11800 succeed
    21. Kick host[sdbserver1] from group[SYSCoord]
    22. Kick host[sdbserver3] from group[SYSCoord]
    23. Update kicked group[SYSCoord] to sdbserver2:11800 succeed
    24. Update sdbserver2:11800 catalog's info succeed
    25. Update sdbserver2:11800 catalog's readonly prop succeed
    26. Update all nodes's catalogaddr to sdbserver2:11803 succeed
    27. Restart all nodes succeed in sdbserver2
    28. Restart all host nodes succeed
    29. Done

    此时灾备中心 A(sdbserver2)已完成切换,成为独立的业务集群,并且可以正常对外提供服务。

    • 主中心和灾备中心 B 故障恢复后,在 sdbserver1 上修改 cluster_opr.js 中的配置项
    1. if ( typeof(SUB1HOSTS) == "undefined" ) { SUB1HOSTS = [ "sdbserver2" ] ; }
    2. if ( typeof(SUB2HOSTS) == "undefined" ) { SUB2HOSTS = [ "sdbserver1", "sdbserver3" ] ; }
    3. if ( typeof(COORDADDR) == "undefined" ) { COORDADDR = [ "sdbserver1:11810" ] }
    4. if ( typeof(CURSUB) == "undefined" ) { CURSUB = 2 ; }
    5. if ( typeof(ACTIVE) == "undefined" ) { ACTIVE = false ; }

    设置ACTIVE=false,使分裂后的2副本集群进入“只读”模式,只有灾备中心 A 的单副本集群具有“写”功能,从而避免了脑裂(brain-split)的情况。

    • 开启数据节点自动全量同步

    如果 SUB2 中节点是异常终止的,重新启动节点时必须通过全量同步来恢复数据。数据节点参数设置 dataerrorop=2,会阻止全量同步的发生,导致数据节点无法启动。因此,主中心(SUB1)执行分裂(split)操作之前,需要在所有数据节点的配置文件(sdb.conf)中设置 dataerrorop=1,才能顺利启动数据节点。

    • sdbserver1执行分裂(split)
    1. [sdbadmin@sdbserver1 dr_ha]$ sh split.sh
    2. Begin to check args...
    3. Done
    4. Begin to check enviroment...
    5. Done
    6. Begin to split cluster...
    7. Stop 11800 succeed in sdbserver1
    8. Start 11800 by standalone succeed in sdbserver1
    9. ...
    10. Restart all nodes succeed in sdbserver1
    11. Restart all nodes succeed in sdbserver3
    12. Restart all host nodes succeed
    13. Done

    灾难修复

    当主中心和灾备中心 B 的所有故障都完成修复,需要将三个中心已分离的独立集群进行合并,恢复最初的状态。

    • sdbserver2执行合并(merge)
    1. [sdbadmin@sdbserver2 dr_ha]$ sh merge.sh
    2. Begin to check args...
    3. Done
    4. Begin to check enviroment...
    5. Done
    6. Begin to merge cluster...
    7. Stop 11800 succeed in sdbserver2
    8. Start 11800 by standalone succeed in sdbserver2
    9. Change sdbserver2:11800 to standalone succeed
    10. Restore group[SYSCatalogGroup] to sdbserver2:11800 succeed
    11. Restore group[group1] to sdbserver2:11800 succeed
    12. Restore group[group2] to sdbserver2:11800 succeed
    13. Restore group[group3] to sdbserver2:11800 succeed
    14. Restore group[SYSCoord] to sdbserver2:11800 succeed
    15. Restore sdbserver2:11800 catalog's info succeed
    16. Update sdbserver2:11800 catalog's readonly prop succeed
    17. Update all nodes's catalogaddr to sdbserver1:11803,sdbserver2:11803,sdbserver3:11803 succeed
    18. Restart all nodes succeed in sdbserver2
    19. Restart all host nodes succeed
    20. Done
    • sdbserver1执行合并(merge)
    1. [sdbadmin@sdbserver1 dr_ha]$ sh merge.sh
    2. Begin to check args...
    3. Done
    4. Begin to check enviroment...
    5. Done
    6. Begin to merge cluster...
    7. Stop 11800 succeed in sdbserver1
    8. Start 11800 by standalone succeed in sdbserver1
    9. Change sdbserver1:11800 to standalone succeed
    10. Restore group[SYSCatalogGroup] to sdbserver1:11800 succeed
    11. Restore group[group1] to sdbserver1:11800 succeed
    12. Restore group[group2] to sdbserver1:11800 succeed
    13. Restore group[group3] to sdbserver1:11800 succeed
    14. Restore group[SYSCoord] to sdbserver1:11800 succeed
    15. Restore sdbserver1:11800 catalog's info succeed
    16. Update sdbserver1:11800 catalog's readonly prop succeed
    17. Stop 11800 succeed in sdbserver3
    18. Start 11800 by standalone succeed in sdbserver3
    19. Change sdbserver3:11800 to standalone succeed
    20. Restore group[SYSCatalogGroup] to sdbserver3:11800 succeed
    21. Restore group[group1] to sdbserver3:11800 succeed
    22. Restore group[group2] to sdbserver3:11800 succeed
    23. Restore group[group3] to sdbserver3:11800 succeed
    24. Restore group[SYSCoord] to sdbserver3:11800 succeed
    25. Restore sdbserver3:11800 catalog's info succeed
    26. Update sdbserver3:11800 catalog's readonly prop succeed
    27. Update all nodes's catalogaddr to sdbserver1:11803,sdbserver2:11803,sdbserver3:11803 succeed
    28. Restart all nodes succeed in sdbserver1
    29. Restart all nodes succeed in sdbserver3
    30. Restart all host nodes succeed
    31. Done
    • 关闭数据节点自动全量同步

    当合并(merge)操作完成后,并且 SUB1 和 SUB2 的数据追平,后续不再需要数据节点的自动全量同步,因此需要将所有数据节点的 dataerrorop 参数改回最初的设置,即 dataerrorop=2。

    • 再次执行初始化(init),恢复集群最初状态

    集群合并之后,需要再次执行“初始化(init)”操作,将主节点重新分布到主中心,恢复集群最初状态。 此时的子网划分如下:

    sdbserver1 修改 cluster_opr.js 中的配置项:

    1. if ( typeof(SUB1HOSTS) == "undefined" ) { SUB1HOSTS = [ "sdbserver1" ] ; }
    2. if ( typeof(SUB2HOSTS) == "undefined" ) { SUB2HOSTS = [ "sdbserver2", "sdbserver3" ] ; }
    3. if ( typeof(COORDADDR) == "undefined" ) { COORDADDR = [ "sdbserver1:11810" ] }
    4. if ( typeof(CURSUB) == "undefined" ) { CURSUB = 1 ; }
    5. if ( typeof(CUROPR) == "undefined" ) { CUROPR = "split" ; }
    6. if ( typeof(ACTIVE) == "undefined" ) { ACTIVE = true ; }

    sdbserver1 上执行初始化:

    Note:

    再次执行“初始化(init)”操作之前,需要先删除 SequoiaDB 安装目录下的 datacenter_init.info 文件,否则 init.sh 会提示如下错误:

    两地三中心架构下容灾工具的使用,可以参考。

    三地五中心架构下,初始化时将机器划分为两个子网SUB1(sdbserver1)和SUB2(sdbserver2, sdbserver3, sdbserver4, sdbserver5)。

    • 修改 sdbserver1 上的 cluster_opr.js 文件
    1. if ( typeof(SEQPATH) != "string" || SEQPATH.length == 0 ) { SEQPATH = "/opt/sequoiadb/" ; }
    2. if ( typeof(USERNAME) != "string" ) { USERNAME = "sdbadmin" ; }
    3. if ( typeof(PASSWD) != "string" ) { PASSWD = "sdbadmin" ; }
    4. if ( typeof(SDBUSERNAME) != "string" ) { SDBUSERNAME = "sdbadmin" ; }
    5. if ( typeof(SDBPASSWD) != "string" ) { SDBPASSWD = "sdbadmin" ; }
    6. if ( typeof(SUB1HOSTS) == "undefined" ) { SUB1HOSTS = [ "sdbserver1" ] ; }
    7. if ( typeof(SUB2HOSTS) == "undefined" ) { SUB2HOSTS = [ "sdbserver2", "sdbserver3", "sdbserver4", "sdbserver5" ] ; }
    8. if ( typeof(COORDADDR) == "undefined" ) { COORDADDR = [ "sdbserver1:11810" ] }
    9. if ( typeof(CURSUB) == "undefined" ) { CURSUB = 1 ; }
    10. if ( typeof(CUROPR) == "undefined" ) { CUROPR = "split" ; }
    11. if ( typeof(ACTIVE) == "undefined" ) { ACTIVE = true ; }
    • sdbserver1 上执行 init
    1. [sdbadmin@sdbserver1 dr_ha]$ sh init.sh
    2. Begin to check args...
    3. Done
    4. Begin to check enviroment...
    5. Done
    6. Begin to init cluster...
    7. Start to copy init file to cluster host
    8. Copy init file to sdbserver3 success
    9. Copy init file to sdbserver4 success
    10. Copy init file to sdbserver5 success
    11. Copy init file to sdbserver2 success
    12. Done
    13. Begin to update catalog and data nodes's config...Done
    14. Begin to reload catalog and data nodes's config...Done
    15. Begin to reelect all groups...Done
    16. Done

    Note:

    • cluster_opr.js 中参数 NEEDBROADCASTINITINFO 默认值为“true”,表示将初始化的结果文件分发到集群的所有主机上,所以初始化操作在 SUB1 的 sdbserver1 机器上执行即可。

    灾备切换

    当城市1和城市2整体故障时,SUB1 里的所有机器都不可用,SequoiaDB 集群的五副本中有三副本无法工作。此时需要用分裂(split)工具使城市2里的两副本脱离原集群,成为具备读写功能的独立集群,以恢复 SequoiaDB 服务。

    此时子网划分如下:

    • 修改 sdbserver4 的 cluster_opr.js 中的配置项
    1. if ( typeof(SUB1HOSTS) == "undefined" ) { SUB1HOSTS = [ "sdbserver3", "sdbserver4" ] ; }
    2. if ( typeof(SUB2HOSTS) == "undefined" ) { SUB2HOSTS = [ "sdbserver1", "sdbserver2", "sdbserver5" ] ; }
    3. if ( typeof(COORDADDR) == "undefined" ) { COORDADDR = [ "sdbserver4:11810" ] }
    4. if ( typeof(CURSUB) == "undefined" ) { CURSUB = 1 ; }
    5. if ( typeof(ACTIVE) == "undefined" ) { ACTIVE = true ; }
    • sdbserver4执行分裂(split)
    1. [sdbadmin@sdbserver4 dr_ha]$ sh split.sh
    2. Done
    3. Begin to check enviroment...
    4. Done
    5. Begin to split cluster...
    6. Stop 11800 succeed in sdbserver3
    7. Start 11800 by standalone succeed in sdbserver3
    8. Change sdbserver3:11800 to standalone succeed
    9. Kick host[sdbserver1] from group[SYSCatalogGroup]
    10. Kick host[sdbserver2] from group[SYSCatalogGroup]
    11. Kick host[sdbserver5] from group[SYSCatalogGroup]
    12. Update kicked group[SYSCatalogGroup] to sdbserver3:11800 succeed
    13. Kick host[sdbserver1] from group[group1]
    14. Kick host[sdbserver2] from group[group1]
    15. Kick host[sdbserver5] from group[group1]
    16. Update kicked group[group1] to sdbserver3:11800 succeed
    17. Kick host[sdbserver1] from group[group2]
    18. Kick host[sdbserver2] from group[group2]
    19. Kick host[sdbserver5] from group[group2]
    20. Update kicked group[group2] to sdbserver3:11800 succeed
    21. Kick host[sdbserver1] from group[group3]
    22. Kick host[sdbserver2] from group[group3]
    23. Kick host[sdbserver5] from group[group3]
    24. Update kicked group[group3] to sdbserver3:11800 succeed
    25. Kick host[sdbserver1] from group[SYSCoord]
    26. Kick host[sdbserver2] from group[SYSCoord]
    27. Kick host[sdbserver5] from group[SYSCoord]
    28. Update kicked group[SYSCoord] to sdbserver3:11800 succeed
    29. Update sdbserver3:11800 catalog's info succeed
    30. Update sdbserver3:11800 catalog's readonly prop succeed
    31. Stop 11800 succeed in sdbserver4
    32. Start 11800 by standalone succeed in sdbserver4
    33. Change sdbserver4:11800 to standalone succeed
    34. Kick host[sdbserver1] from group[SYSCatalogGroup]
    35. Kick host[sdbserver2] from group[SYSCatalogGroup]
    36. Kick host[sdbserver5] from group[SYSCatalogGroup]
    37. Update kicked group[SYSCatalogGroup] to sdbserver4:11800 succeed
    38. Kick host[sdbserver1] from group[group1]
    39. Kick host[sdbserver2] from group[group1]
    40. Kick host[sdbserver5] from group[group1]
    41. Update kicked group[group1] to sdbserver4:11800 succeed
    42. Kick host[sdbserver1] from group[group2]
    43. Kick host[sdbserver2] from group[group2]
    44. Kick host[sdbserver5] from group[group2]
    45. Update kicked group[group2] to sdbserver4:11800 succeed
    46. Kick host[sdbserver1] from group[group3]
    47. Kick host[sdbserver2] from group[group3]
    48. Kick host[sdbserver5] from group[group3]
    49. Update kicked group[group3] to sdbserver4:11800 succeed
    50. Kick host[sdbserver1] from group[SYSCoord]
    51. Kick host[sdbserver2] from group[SYSCoord]
    52. Kick host[sdbserver5] from group[SYSCoord]
    53. Update kicked group[SYSCoord] to sdbserver4:11800 succeed
    54. Update sdbserver4:11800 catalog's info succeed
    55. Update sdbserver4:11800 catalog's readonly prop succeed
    56. Update all nodes's catalogaddr to sdbserver3:11803,sdbserver4:11803 succeed
    57. Restart all nodes succeed in sdbserver3
    58. Restart all nodes succeed in sdbserver4
    59. Restart all host nodes succeed
    60. Done

    此时城市2(sdbserver3、sdbserver4)已完成切换,成为独立的业务集群,并且可以正常对外提供服务。

    • 城市1和城市3故障恢复后,在 sdbserver1 上修改 cluster_opr.js 中的配置项
    1. if ( typeof(SUB1HOSTS) == "undefined" ) { SUB1HOSTS = [ "sdbserver3", "sdbserver4" ] ; }
    2. if ( typeof(SUB2HOSTS) == "undefined" ) { SUB2HOSTS = [ "sdbserver1", "sdbserver2", "sdbserver5" ] ; }
    3. if ( typeof(COORDADDR) == "undefined" ) { COORDADDR = [ "sdbserver1:11810" ] }
    4. if ( typeof(CURSUB) == "undefined" ) { CURSUB = 2 ; }
    5. if ( typeof(ACTIVE) == "undefined" ) { ACTIVE = false ; }

    设置 ACTIVE=false,使分裂后的三副本集群进入“只读”模式,只有城市2的两副本集群具有“写”功能,从而避免了脑裂(brain-split)的情况。

    • 开启数据节点自动全量同步

    如果 SUB2 中节点是异常终止的,重新启动节点时必须通过全量同步来恢复数据。数据节点参数设置 dataerrorop=2,会阻止全量同步的发生,导致数据节点无法启动。因此,主中心(SUB1)执行“分裂(split)”操作之前,需要在所有数据节点的配置文件(sdb.conf)中设置 dataerrorop=1,才能顺利启动数据节点。

    • sdbserver1执行分裂(split)
    1. [sdbadmin@sdbserver1 dr_ha]$ sh split.sh
    2. Begin to check args...
    3. Done
    4. Begin to check enviroment...
    5. Done
    6. Begin to split cluster...
    7. Stop 11800 succeed in sdbserver1
    8. Start 11800 by standalone succeed in sdbserver1
    9. ...
    10. sdbserver1:11803,sdbserver2:11803,sdbserver5:11803 succeed
    11. Restart all nodes succeed in sdbserver1
    12. Restart all nodes succeed in sdbserver2
    13. Restart all nodes succeed in sdbserver5
    14. Restart all host nodes succeed
    15. Done

    灾难修复

    当城市1和城市3的所有故障都完成修复,需要将两个子网已分离的独立集群进行合并,恢复最初的状态。

    • 城市2执行合并(merge)
    1. [sdbadmin@sdbserver4 dr_ha]$ sh merge.sh
    2. Begin to check args...
    3. Done
    4. Begin to check enviroment...
    5. Done
    6. Begin to merge cluster...
    7. Stop 11800 succeed in sdbserver3
    8. Start 11800 by standalone succeed in sdbserver3
    9. Change sdbserver3:11800 to standalone succeed
    10. Restore group[SYSCatalogGroup] to sdbserver3:11800 succeed
    11. Restore group[group1] to sdbserver3:11800 succeed
    12. Restore group[group2] to sdbserver3:11800 succeed
    13. Restore group[group3] to sdbserver3:11800 succeed
    14. Restore group[SYSCoord] to sdbserver3:11800 succeed
    15. Restore sdbserver3:11800 catalog's info succeed
    16. Update sdbserver3:11800 catalog's readonly prop succeed
    17. Stop 11800 succeed in sdbserver4
    18. Start 11800 by standalone succeed in sdbserver4
    19. Change sdbserver4:11800 to standalone succeed
    20. Restore group[SYSCatalogGroup] to sdbserver4:11800 succeed
    21. Restore group[group1] to sdbserver4:11800 succeed
    22. Restore group[group2] to sdbserver4:11800 succeed
    23. Restore group[group3] to sdbserver4:11800 succeed
    24. Restore group[SYSCoord] to sdbserver4:11800 succeed
    25. Restore sdbserver4:11800 catalog's info succeed
    26. Update sdbserver4:11800 catalog's readonly prop succeed
    27. Update all nodes's catalogaddr to sdbserver1:11803,sdbserver2:11803,sdbserver3:11803,sdbserver4:11803,sdbserver5:11803 succeed
    28. Restart all nodes succeed in sdbserver4
    29. Restart all nodes succeed in sdbserver3
    30. Restart all host nodes succeed
    31. Done
    • 城市1和城市3(SUB2) 执行合并
    1. [sdbadmin@sdbserver1 dr_ha]$ sh merge.sh
    2. Begin to check args...
    3. Done
    4. Begin to check enviroment...
    5. Done
    6. Begin to merge cluster...
    7. ...
    8. Restart all nodes succeed in sdbserver1
    9. Restart all nodes succeed in sdbserver2
    10. Restart all nodes succeed in sdbserver5
    11. Restart all host nodes succeed
    12. Done
    • 关闭数据节点自动全量同步

    当“合并(merge)”操作完成后,并且 SUB2 和 SUB1 的数据追平,后续不再需要数据节点的自动全量同步,因此需要将所有数据节点的 dataerrorop 参数改回最初的设置,即 dataerrorop=2。

    • 再次执行初始化(init),恢复集群最初状态

    由于“合并(merge)”之后,集群中主节点全部分布在灾备集群(SUB2)中,因此需要再次执行“初始化(init)”操作,将主节点重新分布到主中心(SUB1)中。

    此时的子网划分如下:

    sdbserver1 修改 cluster_opr.js 中的配置项:

    1. if ( typeof(SUB1HOSTS) == "undefined" ) { SUB1HOSTS = [ "sdbserver1" ] ; }
    2. if ( typeof(SUB2HOSTS) == "undefined" ) { SUB2HOSTS = [ "sdbserver2", "sdbserver3", "sdbserver4", "sdbserver5" ] ; }
    3. if ( typeof(COORDADDR) == "undefined" ) { COORDADDR = [ "sdbserver1:11810" ] }
    4. if ( typeof(CURSUB) == "undefined" ) { CURSUB = 1 ; }
    5. if ( typeof(CUROPR) == "undefined" ) { CUROPR = "split" ; }
    6. if ( typeof(ACTIVE) == "undefined" ) { ACTIVE = true ; }

    sdbserver1 执行初始化(init):

    Note: