虽然Journaling日志功能提供了数据恢复的功能,但是他通常针对的是单个节点来说的,而复制集则针对的是一组进程,通常是多个节点组成,在每个节点上有Journaling日志保证数据完整性,在整个复制集中实现自动故障转移,从而保证了数据库的高可用性。在生产环境中,一个复制集应该最少包含三个节点,一个仲裁节点(arbiter),唯一一个数据主节点(primary),一个或多个数据次节点(secondary)。主节点用来接收所有的写操作,一个复制集有且仅有一个primary能够进行写关注(写关注将在后面介绍),主节点在他的操作日志oplog中将所有的修改记录到数据集data sets中。典型的结构如下所示:

secondary节点备份primary节点上的数据,secondary节点可以有多个,一旦primary节点不可用,abiter将从secondary节点中选取一个作为primary节点,secondary节点的作用如下:

MongoDB的复制集 - 图1

现在除了primary,secondary节点外,可以新增一个mongod实例副本集作为arbiter,arbiter不能维护数据集。arbiter的主要作用是维持与复制集中所有的其他节点的心跳以保证选举需要的节点数,因为arbiter不是一个数据存储集,arbiter可以提供一个比全功能副本集更廉价的方法来获取法定人数。如果复制集中是偶数个节点,可以通过添加arbiter节点使得primary可以获取到大多数的投票。arbiter不需要专门的硬件支持。arbiter的作用如下:

相对于primary与secondary节点可能在一次选举中(主节点失效触发)互换角色,arbiter仲裁者永远都是arbiter。

故障转移流程如下所示:

https://docs.mongodb.com/manual/replication/#edge-cases-2-primaries

复制集的创建:

在D:\MongoDB\Server\3.2\bin文件夹中运行CMD命令,启动mongod三个进程,分别是:

注意:需要创建data文件夹下的子文件夹为rs0_0,rs0_1,rs0_2

然后执行启动一个客户端,mongo,执行:

执行初始化复制集命令:

  1. > rs.initiate()

显示结果如下:

  1. {
  2. "info2" : "no configuration specified. Using a default configuration for
  3. the set",
  4. "me" : "linfl-PC:40000",
  5. "ok" : 1
  6. }
  1. rs0:OTHER> rs.conf()
  1. {
  2. "_id" : "rs0",
  3. "version" : 1,
  4. "protocolVersion" : NumberLong(1),
  5. "members" : [
  6. {
  7. "_id" : 0,
  8. "host" : "linfl-PC:40000",
  9. "arbiterOnly" : false,
  10. "buildIndexes" : true,
  11. "hidden" : false,
  12. "priority" : 1,
  13. "tags" : {
  14. },
  15. "slaveDelay" : NumberLong(0),
  16. "votes" : 1
  17. }
  18. ],
  19. "settings" : {
  20. "chainingAllowed" : true,
  21. "heartbeatIntervalMillis" : 2000,
  22. "heartbeatTimeoutSecs" : 10,
  23. "electionTimeoutMillis" : 10000,
  24. "getLastErrorModes" : {
  25. },
  26. "getLastErrorDefaults" : {
  27. "w" : 1,
  28. "wtimeout" : 0
  29. },
  30. "replicaSetId" : ObjectId("58a2e2f5c2e580f7b1c85b18")
  31. }
  32. }

将两个节点加入进来:

注意:此时命令行的前缀已经变了:rs0:PRIMARY

观察复制集的状态信息:

  1. rs0:PRIMARY> rs.status()

发现有如下输出:

  1. {
  2. "set" : "rs0",//复制集名称
  3. "date" : ISODate("2017-02-14T11:00:36.634Z"),
  4. "myState" : 1,//1:primary;2:secondary;
  5. "term" : NumberLong(1),
  6. "heartbeatIntervalMillis" : NumberLong(2000),
  7. "members" : [
  8. {
  9. "_id" : 0,
  10. "name" : "linfl-PC:40000",
  11. "health" : 1,//1:运行;0:失败
  12. "state" : 1,
  13. "stateStr" : "PRIMARY",
  14. "uptime" : 216,//成员在线时长(秒)
  15. "optime" : {
  16. "ts" : Timestamp(1487070006, 1),
  17. "t" : NumberLong(1)
  18. },
  19. "optimeDate" : ISODate("2017-02-14T11:00:06Z"),
  20. "infoMessage" : "could not find member to sync from",
  21. "electionTime" : Timestamp(1487069941, 2),
  22. "electionDate" : ISODate("2017-02-14T10:59:01Z"),
  23. "configVersion" : 3,
  24. "self" : true
  25. },
  26. {
  27. "_id" : 1,
  28. "name" : "linfl-PC:40001",
  29. "health" : 1,
  30. "state" : 2,
  31. "stateStr" : "SECONDARY",
  32. "uptime" : 40,
  33. "optime" : {
  34. "ts" : Timestamp(1487070006, 1),
  35. "t" : NumberLong(1)
  36. },
  37. "optimeDate" : ISODate("2017-02-14T11:00:06Z"),
  38. "lastHeartbeat" : ISODate("2017-02-14T11:00:36.075Z"),
  39. "lastHeartbeatRecv" : ISODate("2017-02-14T11:00:35.082Z"
  40. ),
  41. "pingMs" : NumberLong(0),//从远端成员到本实例间个路由包的来回时间
  42. "syncingTo" : "linfl-PC:40000",//数据同步实例来源
  43. "configVersion" : 3
  44. },
  45. {
  46. "_id" : 2,
  47. "name" : "linfl-PC:40002",
  48. "health" : 1,
  49. "state" : 7,
  50. "stateStr" : "ARBITER",
  51. "uptime" : 5,
  52. "lastHeartbeat" : ISODate("2017-02-15T02:01:11.170Z"),
  53. "lastHeartbeatRecv" : ISODate("2017-02-15T02:01:10.172Z"
  54. ),
  55. "pingMs" : NumberLong(0),
  56. "syncingTo" : "linfl-PC:40001",
  57. "configVersion" : 5
  58. }
  59. ],
  60. "ok" : 1
  61. }

由于arbiter实例不同步数据,只是在主节点发生故障时在复制集剩下的secondary节点中选取一个新的primary,只是做仲裁,故而运行arbiter实例的机器不需要太多存储空间

现在通过命令展示数据同步过程,并对数据同步过程做讲解。首先我们先查看下该复制集中的所有数据库:

  1. rs0:PRIMARY> show dbs
  2. local 0.000GB

只有一个local库,查看下local库中的集合:

  1. rs0:PRIMARY> use local
  2. switched to db local
  3. rs0:PRIMARY> show collections
  4. me
  5. oplog.rs
  6. replset.election
  7. startup_log
  8. system.replset

注意:MongoDB就是通过oplog.rs来实现复制集间数据同步的。我们通过往cms数据库中插入一条记录查看oplog.rs的变化:

  1. rs0:PRIMARY> use cms
  2. switched to db cms
  3. rs0:PRIMARY> db.customers.insert({id:11,name:'lisi',orders:[{orders_id:1,create_time:'2017-02-06',products:[{product_name:'MiPad',price:'$100.00'},{product_name:'iphone',price:'$399.00'}]}],mobile:'13161020110',address:{city:'beijing',street:'taiyanggong'}})
  4. WriteResult({ "nInserted" : 1 })
  5. rs0:PRIMARY> db.customers.find()
  6. { "_id" : ObjectId("58a3bb2ca0bd576baa4763de"), "id" : 11, "name" : "lisi", "orders" : [ { "orders_id" : 1, "create_time" : "2017-02-06", "products" : [ { "product_name" : "MiPad", "price" : "$100.00" }, { "product_name" : "iphone", "price" : "$399.00" } ] } ], "mobile" : "13161020110", "address" : { "city" :"beijing", "street" : "taiyanggong" } }
  7. rs0:PRIMARY> use local
  8. switched to db local
  9. rs0:PRIMARY> db.oplog.rs.find()
  10. { "ts" : Timestamp(1487069941, 1), "h" : NumberLong("-6355743292210446009"), "v" : 2, "op" : "n", "ns" : "", "o" : { "msg" : "initiating set" } }
  11. { "ts" : Timestamp(1487069942, 1), "t" : NumberLong(1), "h" : NumberLong("-1263029456710822127"), "v" : 2, "op" : "n", "ns" : "", "o" : { "msg" : "new primary"} }
  12. { "ts" : Timestamp(1487069995, 1), "t" : NumberLong(1), "h" : NumberLong("6502719191955655967"), "v" : 2, "op" : "n", "ns" : "", "o" : { "msg" : "Reconfig set","version" : 2 } }
  13. { "ts" : Timestamp(1487070006, 1), "t" : NumberLong(1), "h" : NumberLong("-2415405716599170931"), "v" : 2, "op" : "n", "ns" : "", "o" : { "msg" : "Reconfig set", "version" : 3 } }
  14. { "ts" : Timestamp(1487124022, 1), "t" : NumberLong(1), "h" : NumberLong("-478589502849657245"), "v" : 2, "op" : "n", "ns" : "", "o" : { "msg" : "Reconfig set", "version" : 4 } }
  15. { "ts" : Timestamp(1487125292, 1), "t" : NumberLong(1), "h" : NumberLong("4089071333042150540"), "v" : 2, "op" : "c", "ns" : "cms.$cmd", "o" : { "create" :"customers" } }
  16. { "ts" : Timestamp(1487125292, 2), "t" : NumberLong(1), "h" : NumberLong("-682469243777763072"), "v" : 2, "op" : "i", "ns" : "cms.customers", "o" : { "_id" : ObjectId("58a3bb2ca0bd576baa4763de"), "id" : 11, "name" : "lisi", "orders" : [ { "orders_id" : 1, "create_time" : "2017-02-06", "products" : [ { "product_name" :"MiPad", "price" : "$100.00" }, { "product_name" : "iphone", "price" : "$399.00" } ] } ], "mobile" : "13161020110", "address" : { "city" : "beijing", "street" : "taiyanggong" } } }
  17. rs0:PRIMARY>

发现oplog.rs已经有了一条我们刚刚创建的记录

其中op参数表示操作码:i表示insert操作;ns表示操作发生的命名空间,o为操作包含的对象。

当primary节点完成插入操作后,secondary节点为了保证数据的同步,也会完成一些动作,所有的secondary节点将检查自己的local数据库上oplog.rs是否有修改,找出最近一条记录的时间戳,然后secondary节点将此时间戳作为条件查询primary节点上的oplog.rs集合,并找出所有大雨此时间戳的记录,最后secondary节点将这些找到的记录差润到自己的oplog.rs集合,同时执行这些记录代表的操作,然后完成数据同步。

查看此时的secondary节点数据库信息:

  1. D:\MongoDB\Server\3.2\bin>mongo --port 40001
  2. MongoDB shell version: 3.2.9
  3. connecting to: 127.0.0.1:40001/test
  4. rs0:SECONDARY> show dbs
  5. 2017-02-15T10:40:07.204+0800 E QUERY [thread1] Error: listDatabases failed:{ "ok" : 0, "errmsg" : "not master and slaveOk=false", "code" : 13435 } :
  6. _getErrorWithCode@src/mongo/shell/utils.js:25:13
  7. Mongo.prototype.getDBs@src/mongo/shell/mongo.js:62:1
  8. shellHelper.show@src/mongo/shell/utils.js:761:19
  9. shellHelper@src/mongo/shell/utils.js:651:15
  10. @(shellhelp2):1:1
  11. rs0:SECONDARY> rs.slaveOk()//注意:正常情况下secondary不允许读写,这里做更改
  12. rs0:SECONDARY> show dbs
  13. cms 0.000GB
  14. local 0.000GB

还要注意:oplog.rs的大小是固定的。32位系统默认大小50MB,64位系统默认为空闲磁盘空间大小的5%,可以通过—oplogSize在启动时设置。

MongoDB自动故障转移是依靠心跳包实现的就是在前文提到的(lastHeartbeat)字段。Mongod每隔两秒向其他成员发送一个心跳包并且通过rs.status()返回的成员的”headth”来判断成员状态,如果出现复制集中primary节点不可用,则复制集中所有的secondary节点会触发一次选举操作,选举出新的primary节点,arbiter只是负责选举其他成员成primary节点,自己不会参与到选举中。如果secondary节点有多个则会选择拥有最新时间戳的oplog记录或较高权限的节点称为primary。而如果secondary节点失败,则不会发生重新选举primary过程。现在模拟两种情况下查看数据的处理过程,分别为secondary节点down,以及primary节点down

  1. rs0:PRIMARY> rs.status()
  2. {
  3. "set" : "rs0",
  4. "date" : ISODate("2017-02-24T07:39:17.573Z"),
  5. "myState" : 1,
  6. "term" : NumberLong(2),
  7. "heartbeatIntervalMillis" : NumberLong(2000),
  8. "members" : [
  9. {
  10. "_id" : 0,
  11. "name" : "linfl-PC:40000",
  12. "health" : 1,
  13. "state" : 1,
  14. "stateStr" : "PRIMARY",
  15. "uptime" : 295,
  16. "optime" : {
  17. "ts" : Timestamp(1487921807, 1),
  18. "t" : NumberLong(2)
  19. },
  20. "optimeDate" : ISODate("2017-02-24T07:36:47Z"),
  21. "electionTime" : Timestamp(1487921806, 1),
  22. "electionDate" : ISODate("2017-02-24T07:36:46Z"),
  23. "configVersion" : 5,
  24. "self" : true
  25. },
  26. {
  27. "_id" : 1,
  28. "name" : "linfl-PC:40001",
  29. "health" : 0,
  30. "state" : 8,
  31. "stateStr" : "(not reachable/healthy)",
  32. "uptime" : 0,
  33. "optime" : {
  34. "ts" : Timestamp(0, 0),
  35. "t" : NumberLong(-1)
  36. },
  37. "optimeDate" : ISODate("1970-01-01T00:00:00Z"),
  38. "lastHeartbeat" : ISODate("2017-02-24T07:39:15.017Z"),
  39. "lastHeartbeatRecv" : ISODate("2017-02-24T07:38:33.501Z"
  40. ),
  41. "pingMs" : NumberLong(0),
  42. "lastHeartbeatMessage" : "Couldn't get a connection with
  43. in the time limit",
  44. "configVersion" : -1
  45. },
  46. {
  47. "_id" : 2,
  48. "name" : "linfl-PC:40002",
  49. "health" : 1,
  50. "state" : 7,
  51. "stateStr" : "ARBITER",
  52. "uptime" : 156,
  53. "lastHeartbeat" : ISODate("2017-02-24T07:39:16.961Z"),
  54. "lastHeartbeatRecv" : ISODate("2017-02-24T07:39:15.647Z"
  55. ),
  56. "pingMs" : NumberLong(0),
  57. "configVersion" : 5
  58. }
  59. ],
  60. "ok" : 1
  61. }

可以看到 secondary节点state已经变为8(成员宕机状态了),同时,lastHeartbeatMessage显示:Couldn’t get a connection within the time limit。

往primary节点插入一条记录并查看状态信息:

  1. rs0:PRIMARY> use cms
  2. switched to db cms
  3. rs0:PRIMARY> db.customers.insert({id:12,name:'zhangsan'})
  4. WriteResult({ "nInserted" : 1 })
  5. rs0:PRIMARY> rs.status()
  6. {
  7. "set" : "rs0",
  8. "date" : ISODate("2017-02-24T07:46:58.458Z"),
  9. "myState" : 1,
  10. "term" : NumberLong(2),
  11. "heartbeatIntervalMillis" : NumberLong(2000),
  12. "members" : [
  13. {
  14. "_id" : 0,
  15. "name" : "linfl-PC:40000",
  16. "health" : 1,
  17. "state" : 1,
  18. "stateStr" : "PRIMARY",
  19. "uptime" : 756,
  20. "optime" : {
  21. "ts" : Timestamp(1487922414, 1),
  22. "t" : NumberLong(2)
  23. },
  24. "optimeDate" : ISODate("2017-02-24T07:46:54Z"),
  25. "electionTime" : Timestamp(1487921806, 1),
  26. "electionDate" : ISODate("2017-02-24T07:36:46Z"),
  27. "configVersion" : 5,
  28. "self" : true
  29. },
  30. {
  31. "_id" : 1,
  32. "name" : "linfl-PC:40001",
  33. "health" : 0,
  34. "state" : 8,
  35. "stateStr" : "(not reachable/healthy)",
  36. "uptime" : 0,
  37. "optime" : {
  38. "ts" : Timestamp(0, 0),
  39. "t" : NumberLong(-1)
  40. },
  41. "optimeDate" : ISODate("1970-01-01T00:00:00Z"),
  42. "lastHeartbeat" : ISODate("2017-02-24T07:46:57.445Z"),
  43. "lastHeartbeatRecv" : ISODate("2017-02-24T07:38:33.501Z"
  44. ),
  45. "pingMs" : NumberLong(0),
  46. "lastHeartbeatMessage" : "����Ŀ�����������ܾ����޷����ӡ�",
  47. "configVersion" : -1
  48. },
  49. {
  50. "_id" : 2,
  51. "name" : "linfl-PC:40002",
  52. "health" : 1,
  53. "state" : 7,
  54. "stateStr" : "ARBITER",
  55. "lastHeartbeat" : ISODate("2017-02-24T07:46:57.005Z"),
  56. "lastHeartbeatRecv" : ISODate("2017-02-24T07:46:55.655Z"
  57. "pingMs" : NumberLong(0),
  58. "configVersion" : 5
  59. }
  60. ],
  61. "ok" : 1
  62. }
  63. rs0:PRIMARY>

检查optime信息发现已经发生了变化,重新启动secondary节点:并再次查看:

  1. rs0:PRIMARY> rs.status()
  2. {
  3. "set" : "rs0",
  4. "date" : ISODate("2017-02-24T07:49:51.633Z"),
  5. "myState" : 1,
  6. "term" : NumberLong(2),
  7. "heartbeatIntervalMillis" : NumberLong(2000),
  8. "members" : [
  9. {
  10. "_id" : 0,
  11. "name" : "linfl-PC:40000",
  12. "health" : 1,
  13. "state" : 1,
  14. "stateStr" : "PRIMARY",
  15. "uptime" : 929,
  16. "optime" : {
  17. "ts" : Timestamp(1487922414, 1),
  18. "t" : NumberLong(2)
  19. },
  20. "optimeDate" : ISODate("2017-02-24T07:46:54Z"),
  21. "electionTime" : Timestamp(1487921806, 1),
  22. "electionDate" : ISODate("2017-02-24T07:36:46Z"),
  23. "configVersion" : 5,
  24. "self" : true
  25. },
  26. {
  27. "_id" : 1,
  28. "name" : "linfl-PC:40001",
  29. "health" : 1,
  30. "state" : 2,
  31. "stateStr" : "SECONDARY",
  32. "uptime" : 6,
  33. "optime" : {
  34. "ts" : Timestamp(1487921807, 1),
  35. "t" : NumberLong(2)
  36. },
  37. "optimeDate" : ISODate("2017-02-24T07:36:47Z"),
  38. "lastHeartbeat" : ISODate("2017-02-24T07:49:51.570Z"),
  39. "lastHeartbeatRecv" : ISODate("2017-02-24T07:49:47.386Z"
  40. ),
  41. "pingMs" : NumberLong(0),
  42. "configVersion" : 5
  43. },
  44. {
  45. "_id" : 2,
  46. "name" : "linfl-PC:40002",
  47. "health" : 1,
  48. "state" : 7,
  49. "stateStr" : "ARBITER",
  50. "uptime" : 790,
  51. "lastHeartbeat" : ISODate("2017-02-24T07:49:51.016Z"),
  52. "lastHeartbeatRecv" : ISODate("2017-02-24T07:49:50.656Z"
  53. ),
  54. "pingMs" : NumberLong(0),
  55. "configVersion" : 5
  56. }
  57. ],
  58. "ok" : 1
  59. }

可以看到primary与secondary节点 optime 中t已经相同了(注意:由于误操作,刚才插入了两条记录)。

现在我们来实验primary节点失效,将primary节点关掉,查看复制集状态,

  1. D:\MongoDB\Server\3.2\bin>mongo --port 40001
  2. 2017-02-24T15:57:09.654+0800 I CONTROL [main] Hotfix KB2731284 or later update
  3. is not installed, will zero-out data files
  4. MongoDB shell version: 3.2.9
  5. connecting to: 127.0.0.1:40001/test
  6. rs0:PRIMARY> rs.status()
  7. {
  8. "set" : "rs0",
  9. "date" : ISODate("2017-02-24T07:57:12.710Z"),
  10. "myState" : 1,
  11. "term" : NumberLong(3),
  12. "heartbeatIntervalMillis" : NumberLong(2000),
  13. "members" : [
  14. {
  15. "_id" : 0,
  16. "name" : "linfl-PC:40000",
  17. "health" : 0,
  18. "state" : 8,
  19. "stateStr" : "(not reachable/healthy)",
  20. "uptime" : 0,
  21. "optime" : {
  22. "ts" : Timestamp(0, 0),
  23. "t" : NumberLong(-1)
  24. },
  25. "optimeDate" : ISODate("1970-01-01T00:00:00Z"),
  26. "lastHeartbeat" : ISODate("2017-02-24T07:57:12.562Z"),
  27. "lastHeartbeatRecv" : ISODate("2017-02-24T07:56:51.767Z"
  28. ),
  29. "pingMs" : NumberLong(0),
  30. "lastHeartbeatMessage" : "no response within election ti
  31. meout period",
  32. "configVersion" : -1
  33. },
  34. {
  35. "_id" : 1,
  36. "name" : "linfl-PC:40001",
  37. "health" : 1,
  38. "state" : 1,
  39. "stateStr" : "PRIMARY",
  40. "uptime" : 448,
  41. "optime" : {
  42. "ts" : Timestamp(1487923023, 1),
  43. "t" : NumberLong(3)
  44. },
  45. "optimeDate" : ISODate("2017-02-24T07:57:03Z"),
  46. "infoMessage" : "could not find member to sync from",
  47. "electionTime" : Timestamp(1487923022, 1),
  48. "electionDate" : ISODate("2017-02-24T07:57:02Z"),
  49. "configVersion" : 5,
  50. "self" : true
  51. },
  52. {
  53. "_id" : 2,
  54. "name" : "linfl-PC:40002",
  55. "health" : 1,
  56. "state" : 7,
  57. "stateStr" : "ARBITER",
  58. "uptime" : 445,
  59. "lastHeartbeat" : ISODate("2017-02-24T07:57:12.565Z"),
  60. "lastHeartbeatRecv" : ISODate("2017-02-24T07:57:10.743Z"
  61. ),
  62. "pingMs" : NumberLong(0),
  63. "configVersion" : 5
  64. }
  65. ],
  66. "ok" : 1

可以看到在arbiter的调整下,端口40000的节点已经变成了secondary,而端口40001的节点已经变为primary,此时,插入一条记录,并重启端口40000的节点查看复制集状态信息:

此时,数据同步完成,复制集正常工作。

几个需要注意的事项:

1.MongoDB默认情况下只能在primary节点进行读写操作

2.应用程序连接到复制集,而primary节点失效,复制集正在发生故障转移时,复制集会关闭所有与应用程序的socket连接

如果此时发生的是非安全模式下的写操作,就会产生很多不确定性因素,安全模式下的写操作,驱动程序会通过getLastError命令知道哪些写操作成功了,哪些失败了,将失败信息返回给应用程序,应用程序决定如何处理

默认情况下复制集只对primary节点进行写关注应用程序发生一个写操作时,驱动程序调用getLastError命令返回写操作的执行情况,getLastError命令即通过配置的写关注选项执行。常用配置如下:

1.选项w,-1不使用写关注,忽略所有网络或socket错误;0不使用写关注,只返回网络以及socket错误;1使用写关注,只针对primary节点(对复制集以及单实例是默认配置);>1时,写关注对复制集中N个节点有效,仅当全部执行后,客户端才能收到反馈2.选项wtimeout,指定写关注多长时间内返回,不指定则可能导致写操作阻塞

读参考是指将客户端读请求路由到复制集中的指定成员上,如secondary,默认情况下读操作被路由到primary节点,从primary节点读取数据可以保证数据是最新的,从secondary节点读取到的数据有可能不是最新的,对实时性要求不高的应用程序来讲,并不是不可接受(够用)读参考并不能提高系统的读写容量,但是能够将客户端的读请求路由到最佳secondary节点(如华南地区请求华南secondary),提高客户端的读效率

读参考的几种模式:1.primary模式:读请求全部集中到primary节点,primary节点挂了,读操作产生错误或异常2.primarypreferred模式:大多数情况下读请求路由到primary节点,如果primary节点故障,读操作被路由到secondary节点3.secondary模式:读请求全部集中到secondary节点,如果一个secondary节点都没有或不可用,读操作产生错误或异常