首先介绍一下在replica set里分为三种节点类型:
1primary 负责client的读写。
2secondary作为热备节点,应用Primary的oplog读取的操作日志,和primary保持一致,不提供读写操作!
secondary有两种类型:
1)normal secondary 随时和Primay保持同步,
2)delayed secondary 延时指定时间和primary保持同步,防止误操作.
3arbiter.它不负责任何读写,只作为一个仲裁者,负责primary down的时候剩余节点的选举操作.
在Replica Set 如果主库down了,要进行故障切换,集群的选举策略:
当primary当了之后,剩下的节点会选择一个primary节点,仲裁节点也会参与投票,避免僵局出现(如果没有仲裁节点,对于两节点的replica set 从节点down,主节点会变为secondary,导致整个replica set 不可用)选择依据为:优先级最高的且数据新鲜度最新的!
primary 节点使用心跳来跟踪集群中有多少节点对其可见。如果达不到1/2,活跃节点会自动降级为secondary。这样就能够防止上面说的僵局状态或者当网络切割后primary已经与集群隔离的时候!
来自官方文档的例子:
初始状况:
server-a: secondary oplog: ()
server-b: secondary oplog: ()
server-c: secondary oplog: ()
主库写入数据
server-a: primary oplog: (a1,a2,a3,a4,a5)
server-b: secondary oplog: ()
server-c: secondary oplog: ()
secondary库应用数据
server-a: primary oplog: (a1,a2,a3,a4,a5)
server-b: secondary oplog: (a1)
server-c: secondary oplog: (a1,a2,a3)
…
主库 server-a down
…
server-b: secondary oplog: (a1)
server-c: secondary oplog: (a1,a2,a3)
...
server-b: secondary oplog: (a1)
server-c: primary oplog: (a1,a2,a3) // c 具有最新的数据被选择为primary库
...
server-b: secondary oplog: (a1,a2,a3)
server-c: primary oplog: (a1,a2,a3,c4)
...
server-a 或者起来
...
server-a: recovering oplog: (a1,a2,a3,a4,a5) --做数据恢复
server-b: secondary oplog: (a1,a2,a3)
server-c: primary oplog: (a1,a2,a3,c4)
…应用从server-c中的数据,此时 数据a4,a5丢失
server-a: recovering oplog: (a1,a2,a3,c4)
server-b: secondary oplog: (a1,a2,a3,c4)
server-c: primary oplog: (a1,a2,a3,c4)
新的主库server-c进行数据写入。
server-a: secondary oplog: (a1,a2,a3,c4)
server-b: secondary oplog: (a1,a2,a3,c4)
server-c: primary oplog: (a1,a2,a3,c4,c5,c6,c7,c8)
…
server-a: secondary oplog: (a1,a2,a3,c4,c5,c6,c7,c8)
server-b: secondary oplog: (a1,a2,a3,c4,c5,c6,c7,c8)
server-c: primary oplog: (a1,a2,a3,c4,c5,c6,c7,c8)
从上面的过程中可以看出server-c 变为主库,其他节点则应用从server-c的日志。数据a4,a5 丢失。
当选出新的primary之后,此的数据就会被假定为整个集群中的最新数据,对其他节点(原来的活跃节点)的操作都会回滚,即便之前的主库已经恢复工作了。为了完成回滚,所有节点连接新的主库后都要重新进行同步。此过程如下:
这些节点会查看自己的oplog日志,找到还没应用的主库操作,然后向主库请求这些操作影响的文档的最新副本,进行数据同步。
对于Replica Set中的选择策略:
We use a consensus protocol to pick a primary. Exact details will be spared here but that basic process is:
1 get maxLocalOpOrdinal from each .
2 if a majority of servers are not up (from this server's POV), remain in Secondary mode and stop.
3 if the last op time seems very old, stop and await human intervention.
4 else, using a consensus protocol, pick the server with the highest maxLocalOpOrdinal as the Primary.
对于策略2:当集群里的大多数发生down 机了,剩余的节点就会保持在secondary模式并停止服务。
做了结果是对于4节点的 Replica Set,当两个secondary节点down了的时候,主节点变为secondary。整个集群相当于挂了,因为secondary 不提供读写操作。。
在一个集群中关闭两个secondary 节点:rac4:27019和rac3:27017
[mongodb@rac4bin]$ ./mongo 127.0.0.1:27019
MongoDB shell version: 2.0.1
connecting to: 127.0.0.1:27019/test
SECONDARY>
SECONDARY> use admin
switched to db admin
SECONDARY> db.shutdownServer();
Wed Nov 2 11:02:29 DBClientCursor::init call() failed
Wed Nov 2 11:02:29 query failed : admin.$cmd { shutdown: 1.0 } to: 127.0.0.1:27019
server should be down...
[mongodb@rac3bin]$ ./mongo 10.250.7.241:27017
MongoDB shell version: 2.0.1
connecting to: 127.0.0.1:27017/test
SECONDARY>
SECONDARY> use admin
switched to db admin
SECONDARY> db.shutdownServer();
Tue Nov 1 22:02:46 DBClientCursor::init call() failed
Tue Nov 1 22:02:46 query failed : admin.$cmd { shutdown: 1.0 } to: 127.0.0.1:27017
server should be down...
Tue Nov 1 22:02:46 trying reconnect to 127.0.0.1:27017
Tue Nov 1 22:02:46 reconnect 127.0.0.1:27017 failed couldn't connect to server 127.0.0.1:27017
Tue Nov 1 22:02:46 Error: error doing query: unknown shell/collection.js:150
从主库的客户端退出以后,再次进入提示符发生变化:由PRIMARY--->SECONDARY ,查看Replica Set的状态信息:
[mongodb@rac4 bin]$ ./mongo 127.0.0.1:27020
MongoDB shell version: 2.0.1
connecting to: 127.0.0.1:27020/test
SECONDARY>
SECONDARY> rs.status();
{
"set" : "myset",
"date" : ISODate("2011-11-01T13:56:05Z"),
"myState" : 2,
"members" : [
{
"_id" : 0,
"name" : "10.250.7.220:27018",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 101,
"optime" : {
"t" : 1320154033000,
"i" : 1
},
"optimeDate" : ISODate("2011-11-01T13:27:13Z"),
"lastHeartbeat" : ISODate("2011-11-01T13:56:04Z"),
"pingMs" : 0
},
{
"_id" : 1,
"name" : "10.250.7.220:27019",
"health" : 0, --已经关闭
"state" : 8,
"stateStr" : "(not reachable/healthy)",
"uptime" : 0,
"optime" : {
"t" : 1320154033000,
"i" : 1
},
"optimeDate" : ISODate("2011-11-01T13:27:13Z"),
"lastHeartbeat" : ISODate("2011-11-01T13:53:50Z"),
"pingMs" : 0,
"errmsg" : "socket exception"
},
{
"_id" : 2,
"name" : "10.250.7.220:27020",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY", ---由主库变为从库
"optime" : {
"t" : 1320154033000,
"i" : 1
},
"optimeDate" : ISODate("2011-11-01T13:27:13Z"),
"self" : true
},
{
"_id" : 3,
"name" : "10.250.7.241:27017",
"health" : 0,
"state" : 8,
"stateStr" : "(not reachable/healthy)",
"uptime" : 0,
"optime" : {
"t" : 1320154033000,
"i" : 1
},
"optimeDate" : ISODate("2011-11-01T13:27:13Z"),
"lastHeartbeat" : ISODate("2011-11-01T13:53:54Z"),
"pingMs" : 0,
"errmsg" : "socket exception"
}
],
"ok" : 1
}
SECONDARY> exut
Wed Nov 2 15:23:02 ReferenceError: exut is not defined (shell):1
Wed Nov 2 15:23:02 DBClientCursor::init call() failed
> exit
bye
承接之前的文章继续介绍replica set 选举机制。
创建两节点的Replica Sets,一主一备secondary,如果Secondary宕机,Primary会变成Secondary!这时候集群里没有Primary了!为什么会出现这样的情况呢。
[mongodb@rac4 bin]$ mongo 127.0.0.1:27018 init1node.js
MongoDB shell version: 2.0.1
connecting to: 127.0.0.1:27018/test
[mongodb@rac4 bin]$ ./mongo 127.0.0.1:27019
MongoDB shell version: 2.0.1
connecting to: 127.0.0.1:27019/test
RECOVERING>
SECONDARY>
SECONDARY> use admin
switched to db admin
SECONDARY> db.shutdownServer()
Sun Nov 6 20:16:11 DBClientCursor::init call() failed
Sun Nov 6 20:16:11 query failed : admin.$cmd { shutdown: 1.0 } to: 127.0.0.1:27019
should be down...
Sun Nov 6 20:16:11 trying reconnect to 127.0.0.1:27019
Sun Nov 6 20:16:11 reconnect 127.0.0.1:27019 failed couldn't connect to server 127.0.0.1:27019
Sun Nov 6 20:16:11 Error: error doing query: unknown shell/collection.js:150
>
secondary 当机之后,主库有PRIMARY变为SECONDARY
[mongodb@rac4 bin]$ mongo 127.0.0.1:27018
MongoDB shell version: 2.0.1
connecting to: 127.0.0.1:27018/test
PRIMARY>
PRIMARY>
PRIMARY>
SECONDARY>
从日志中可以看出:从库down了之后,主库的变化
Sun Nov 6 20:16:13 [rsHealthPoll] replSet info 10.250.7.220:27019 is down (or slow to respond): DBClientBase::findN: transport error: 10.250.7.220:27019 query: { replSetHeartbeat: "myset", v: 1, pv: 1, checkEmpty: false, from: "10.250.7.220:27018" }
Sun Nov 6 20:16:13 [rsHealthPoll] replSet member 10.250.7.220:27019 is now in state DOWN
Sun Nov 6 20:16:13 [conn7] end connection 10.250.7.220:13217
Sun Nov 6 20:16:37 [rsMgr] can't see a majority of the set, relinquishing primary
Sun Nov 6 20:16:37 [rsMgr] replSet relinquishing primary state
Sun Nov 6 20:16:37 [rsMgr] replSet SECONDARY
这是和MongoDB的Primary选举策略有关的,如果情况不是Secondary宕机,而是网络断开,那么两个节点都会选取自己为Primary,因为他们能连接上的只有自己这一个节点。而这样的情况在网络后就需要处理复杂的一致性问题。而且断开的时间越长,时间越复杂。所以MongoDB选择的策略是如果集群中只有自己一个节点,那么不选取自己为Primary。
所以正确的做法应该是添加两个以上的节点,或者添加arbiter,当然最好也最方便的做法是添加arbiter,aribiter节点只参与选举,几乎不会有压力,所以你可以在各种闲置机器上启动arbiter节点,这不仅会避免上面说到的无法选举Primary的情况,更会让选取更快速的进行。因为如果是三台数据节点,一个节点宕机,另外两个节点很可能会各自选举自己为Primary,从而导致很长时间才能得出选举结果。实际上集群选举主库上由优先级和数据的新鲜度这两个条件决定的。
官方文档:
Example: if B and C are candidates in an election, B having a higher priority but C being the most up to date:
1 C will be elected primary
2 Once B catches up a re-election should be triggered and B (the higher priority node) should win the election between B and C
3 Alternatively, suppose that, once B is within 12 seconds of synced to C, C goes down.
B will be elected primary.
When C comes back up, those 12 seconds of unsynced writes will be written to a file in the rollback directory of your directory (rollback is created when needed).
You can manually apply the rolled-back data, see Replica Sets - Rollbacks.
重新搭建replica set 集群不过这次加上仲裁者:
[mongodb@rac4 bin]$ cat init2node.js
rs.initiate({
_id : "myset",
members : [
{_id : 0, host : "10.250.7.220:28018"},
{_id : 1, host : "10.250.7.220:28019"},
{_id : 2, host : "10.250.7.220:28020", arbiterOnly: true}
]
})
[mongodb@rac4 bin]$ ./mongo 127.0.0.1:28018 init2node.js
[mongodb@rac4 bin]$ ./mongo 127.0.0.1:28018
MongoDB shell version: 2.0.1
connecting to: 127.0.0.1:28018/test
PRIMARY> rs.status()
{
"set" : "myset",
"date" : ISODate("2011-11-06T14:16:13Z"),
"myState" : 1,
"members" : [
{
"_id" : 0,
"name" : "10.250.7.220:28018",
"health" : 1,
"state" : 1,
...
},
{
"_id" : 1,
"name" : "10.250.7.220:28019",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
....
},
{
"_id" : 2,
"name" : "10.250.7.220:28020",
"health" : 1,
"state" : 7,
"stateStr" : "ARBITER",
....
}
],
"ok" : 1
}
PRIMARY>
再次测试,测试主库变成secondary节点。
对于前一篇文章多节点的,比如4个primary,secondary节点,一个仲裁者,当两个节点down了之后,不会出现的文章说的down 1/2的机器整个集群不可用,但是如果down 3/4的机器时,整个集群将不可用!
日志记录中描述的 “majority of” 并没有给出一个具体的数值,目前所做的是多于1/2的时候,整个集群就不可用了
Sun Nov 6 19:34:16 [rsMgr] can't see a majority of the set, relinquishing primary
参考文章:
http://www.mongodb.org/display/DOCS/Replica+Sets+-+Priority
http://blog.nosqlfan.com/html/2523.html
-----------------------------------