[SERVER-24234] one primary mongod of the cluster out of service occasionally Created: 20/May/16  Updated: 08/Jun/16  Resolved: 08/Jun/16

Status: Closed
Project: Core Server
Component/s: Admin
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: ufo ofu Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File QQ图片20160520230528.png     Text File a.log     File aa.log.tgz    
Operating System: ALL
Participants:

 Description   

Cluster has 5 shard. A total of collection 3000+
Global environment
system: centos6.6
shard 1 primary 2 secondary 2 arb
mongos 3.0.8
mongod 3.0.8 wt engine snapp compress 1 primary 2 secondary 2 arb
mongo-configure 3.0.8 mmap engine

At the time point around 2016-05-20T20:08:36.657, it seems that the primary mongod of one shard entered a unstable situation. It blocked
almost all requests from the client for about 10 seconds (BTW, we're using mongo-java-driver 3.2.0).

And we also tried to do some simple log analysis. We tried to count each command for every seconds. (Usage is 'command=count')
We're not sure but we guess there is something wrong with our replcaSet.

Fri May 20 20:08:35 CST 2016   {count=17, delete=1, findAndModify=202, insert=335, isMaster=4, logout=1, ping=1, replSetHeartbeat=3, replSetUpdatePosition=247, saslContinue=2, saslStart=1, serverStatus=1, setShardVersion=2, top=1, update=3}
Fri May 20 20:08:36 CST 2016   {count=19, delete=1, findAndModify=137, insert=178, isMaster=2, logout=2, replSetHeartbeat=1, replSetUpdatePosition=171, saslContinue=4, saslStart=2, serverStatus=1, setShardVersion=28, update=3}
Fri May 20 20:08:37 CST 2016   {isMaster=1, replSetHeartbeat=3, setShardVersion=159}
Fri May 20 20:08:38 CST 2016   {isMaster=3, replSetHeartbeat=1, saslContinue=196, saslStart=98, setShardVersion=216}
Fri May 20 20:08:39 CST 2016   {ping=1, replSetHeartbeat=3}
Fri May 20 20:08:40 CST 2016   {isMaster=2, ping=1, replSetHeartbeat=1}
Fri May 20 20:08:41 CST 2016   {isMaster=1, replSetHeartbeat=3}
Fri May 20 20:08:42 CST 2016   {isMaster=1, replSetHeartbeat=1}
Fri May 20 20:08:43 CST 2016   {isMaster=20, replSetHeartbeat=3}
Fri May 20 20:08:44 CST 2016   {isMaster=180, ping=1, replSetHeartbeat=1}
Fri May 20 20:08:45 CST 2016   {isMaster=222, ping=1, replSetHeartbeat=3}
Fri May 20 20:08:46 CST 2016   {isMaster=175, replSetHeartbeat=1}
Fri May 20 20:08:47 CST 2016   {isMaster=161, replSetHeartbeat=3}
Fri May 20 20:08:48 CST 2016   {attempted=1, count=108, delete=7, findAndModify=476, insert=827, isMaster=199, replSetHeartbeat=1, replSetUpdatePosition=122, saslContinue=310, saslStart=609, serverStatus=1, setShardVersion=934, top=1, update=14}
Fri May 20 20:08:49 CST 2016   {count=133, delete=10, findAndModify=503, insert=537, isMaster=387, logout=1, ping=1, replSetHeartbeat=3, replSetUpdatePosition=204, saslContinue=2049, saslStart=1117, serverStatus=1, setShardVersion=3790, update=16}

We're expecting for your help. Thanks in advance.And we are also wondering to know more details of 'setShardVersion' command. What's the purpose of this command, and when will be triggered?

The following are raw messages from log file:

2016-05-20T20:08:36.647+0800 I QUERY    [conn36825] query xxx.aaa query: {.... } 2ms
2016-05-20T20:08:36.648+0800 I COMMAND  [conn33760] command xxx.$cmd command: findAndModify { .....} 0ms
2016-05-20T20:08:36.648+0800 I QUERY    [conn162] getmore local.oplog.rs query: { ts: { $gte: Timestamp 1463585585000|3 } } cursorid:169950952477 ntoreturn:0 keyUpdates:0 writeConflicts:0 numYields:0 nreturned:2 reslen:879 locks:{ Global: { acquireCount: { r: 6 } }, Database: { acquireCount: { r: 3 } }, oplog: { acquireCount: { r: 3 } } } 8ms
2016-05-20T20:08:36.649+0800 I QUERY    [conn158] getmore local.oplog.rs query: { ts: { $gte: Timestamp 1463585585000|3 } } cursorid:168949710400 ntoreturn:0 keyUpdates:0 writeConflicts:0 numYields:0 nreturned:1 reslen:302 locks:{ Global: { acquireCount: { r: 2 } }, Database: { acquireCount: { r: 1 } }, oplog: { acquireCount: { r: 1 } } } 0ms
2016-05-20T20:08:36.650+0800 I COMMAND  [conn163] command admin.$cmd command: replSetUpdatePosition { replSetUpdatePosition: 1, optimes: [ { _id: ObjectId('565f116c85ef5155d15a0d14'), optime: Timestamp 1463746070000|112, memberId: 0, cfgver: 7, config: { _id: 0, host: "10.6.7.115:47003", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 } }, { _id: ObjectId('5720be08143133e67c0fb1a8'), optime: Timestamp 1463585585000|3, memberId: 1, cfgver: 7, config: { _id: 1, host: "10.6.7.116:47003", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 } }, { _id: ObjectId('56dd435072b07953b7619bcc'), optime: Timestamp 1463746116000|302, memberId: 5, cfgver: 7, config: { _id: 5, host: "10.6.7.129:47003", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 } } ] } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:97 locks:{} 0ms
2016-05-20T20:08:36.650+0800 I COMMAND  [conn163] command xxx.aaa command: replSetUpdatePosition { replSetUpdatePosition: 1, optimes: [ { _id: ObjectId('565f116c85ef5155d15a0d14'), optime: Timestamp 1463746070000|112, memberId: 0, cfgver: 7, config: { _id: 0, host: "10.6.7.115:47003", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 } }, { _id: ObjectId('5720be08143133e67c0fb1a8'), optime: Timestamp 1463585585000|3, memberId: 1, cfgver: 7, config: { _id: 1, host: "10.6.7.116:47003", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 } }, { _id: ObjectId('56dd435072b07953b7619bcc'), optime: Timestamp 1463746116000|303, memberId: 5, cfgver: 7, config: { _id: 5, host: "10.6.7.129:47003", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 } } ] } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:97 locks:{} 0ms
2016-05-20T20:08:36.650+0800 I QUERY    [conn36825]   query ....... 0ms
2016-05-20T20:08:36.655+0800 I QUERY    [conn36825]   query ....... 0ms
2016-05-20T20:08:36.655+0800 I QUERY    [conn23356]   query ....... 0ms
2016-05-20T20:08:36.655+0800 I WRITE    [conn26584]   insert ...... 0ms
2016-05-20T20:08:36.656+0800 I COMMAND  [conn26584] command bbb.xxx command: insert ... 0ms
2016-05-20T20:08:36.656+0800 I QUERY    [conn162] getmore local.oplog.rs query: { ts: { $gte: Timestamp 1463585585000|3 } } cursorid:169950952477 ntoreturn:0 keyUpdates:0 writeConflicts:0 numYields:0 nreturned:1 reslen:513 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 } }, oplog: { acquireCount: { r: 2 } } } 4ms
2016-05-20T20:08:36.656+0800 I QUERY    [conn158] getmore local.oplog.rs query: { ts: { $gte: Timestamp 1463585585000|3 } } cursorid:168949710400 ntoreturn:0 keyUpdates:0 writeConflicts:0 numYields:0 nreturned:1 reslen:513 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 } }, oplog: { acquireCount: { r: 2 } } } 4ms
2016-05-20T20:08:36.656+0800 I QUERY    [conn33760] query bbb.xxx query: { userId: 28344562 } planSummary: IXSCAN { userId: 1.0 } ntoreturn:0 ntoskip:0 nscanned:5 nscannedObjects:5 keyUpdates:0 writeConflicts:0 numYields:0 nreturned:5 reslen:1985 locks:{ Global: { acquireCount: { r: 2 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { r: 1 } } } 1ms
2016-05-20T20:08:36.657+0800 I COMMAND  [conn163] command admin.$cmd command: replSetUpdatePosition { replSetUpdatePosition: 1, optimes: [ { _id: ObjectId('565f116c85ef5155d15a0d14'), optime: Timestamp 1463746070000|112, memberId: 0, cfgver: 7, config: { _id: 0, host: "10.6.7.115:47003", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 } }, { _id: ObjectId('5720be08143133e67c0fb1a8'), optime: Timestamp 1463585585000|3, memberId: 1, cfgver: 7, config: { _id: 1, host: "10.6.7.116:47003", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 } }, { _id: ObjectId('56dd435072b07953b7619bcc'), optime: Timestamp 1463746116000|304, memberId: 5, cfgver: 7, config: { _id: 5, host: "10.6.7.129:47003", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 } } ] } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:97 locks:{} 0ms
2016-05-20T20:08:36.747+0800 I COMMAND  [conn34681] command admin.$cmd command: setShardVersion { setShardVersion: "xxx.bbb", configdb: "10.6.7.112:27001,10.6.7.115:27001,10.6.7.118:27001", shard: "47003", shardHost: "47003/10.6.7.115:47003,10.6.7.116:47003,10.6.7.129:47003", version: Timestamp 5000|14, versionEpoch: ObjectId('573cc54cc19de8acaafe1323') } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:146 locks:{} 0ms
2016-05-20T20:08:36.749+0800 I COMMAND  [conn25365] command admin.$cmd command: setShardVersion { setShardVersion: "xxx.bbb", configdb: "10.6.7.112:27001,10.6.7.115:27001,10.6.7.118:27001", shard: "47003", shardHost: "47003/10.6.7.115:47003,10.6.7.116:47003,10.6.7.129:47003", version: Timestamp 5000|14, versionEpoch: ObjectId('573cc54cc19de8acaafe1323') } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:289 locks:{} 0ms
2016-05-20T20:08:36.749+0800 I COMMAND  [conn25816] command admin.$cmd command: setShardVersion { setShardVersion: "xxx.bbb", configdb: "10.6.7.112:27001,10.6.7.115:27001,10.6.7.118:27001", shard: "47003", shardHost: "47003/10.6.7.115:47003,10.6.7.116:47003,10.6.7.129:47003", version: Timestamp 5000|26, versionEpoch: ObjectId('573cc54cc19de8acaafe1323') } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:146 locks:{} 0ms
2016-05-20T20:08:36.751+0800 I NETWORK  [conn25365] end connection 10.6.7.111:39415 (1265 connections now open)
2016-05-20T20:08:36.753+0800 I COMMAND  [conn36650] command admin.$cmd command: setShardVersion { setShardVersion: "xxx.bbb", configdb: "10.6.7.112:27001,10.6.7.115:27001,10.6.7.118:27001", shard: "47003", shardHost: "47003/10.6.7.115:47003,10.6.7.116:47003,10.6.7.129:47003", version: Timestamp 5000|26, versionEpoch: ObjectId('573cc54cc19de8acaafe1323') } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:146 locks:{} 0ms
2016-05-20T20:08:36.767+0800 I COMMAND  [conn31145] command admin.$cmd command: setShardVersion { setShardVersion: "xxx.bbb", configdb: "10.6.7.112:27001,10.6.7.115:27001,10.6.7.118:27001", shard: "47003", shardHost: "47003/10.6.7.115:47003,10.6.7.116:47003,10.6.7.129:47003", version: Timestamp 5000|26, versionEpoch: ObjectId('573cc54cc19de8acaafe1323') } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:146 locks:{} 0ms
2016-05-20T20:08:36.776+0800 I COMMAND  [conn349] command admin.$cmd command: setShardVersion { setShardVersion: "yyy.cccc", configdb: "10.6.7.112:27001,10.6.7.115:27001,10.6.7.118:27001", shard: "47003", shardHost: "47003/10.6.7.115:47003,10.6.7.116:47003,10.6.7.129:47003", version: Timestamp 644000|682, versionEpoch: ObjectId('55e46bd095f6e6a11b57c560') } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:146 locks:{} 0ms
2016-05-20T20:08:36.776+0800 I COMMAND  [conn34452] command admin.$cmd command: setShardVersion { setShardVersion: "xxx.bbb", configdb: "10.6.7.112:27001,10.6.7.115:27001,10.6.7.118:27001", shard: "47003", shardHost: "47003/10.6.7.115:47003,10.6.7.116:47003,10.6.7.129:47003", version: Timestamp 5000|26, versionEpoch: ObjectId('573cc54cc19de8acaafe1323') } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:146 locks:{} 0ms
2016-05-20T20:08:36.793+0800 I COMMAND  [conn36680] command admin.$cmd command: setShardVersion { setShardVersion: "xxx.bbb", configdb: "10.6.7.112:27001,10.6.7.115:27001,10.6.7.118:27001", shard: "47003", shardHost: "47003/10.6.7.115:47003,10.6.7.116:47003,10.6.7.129:47003", version: Timestamp 5000|26, versionEpoch: ObjectId('573cc54cc19de8acaafe1323') } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:146 locks:{} 0ms
2016-05-20T20:08:36.807+0800 I COMMAND  [conn36643] command admin.$cmd command: setShardVersion { setShardVersion: "xxx.bbb", configdb: "10.6.7.112:27001,10.6.7.115:27001,10.6.7.118:27001", shard: "47003", shardHost: "47003/10.6.7.115:47003,10.6.7.116:47003,10.6.7.129:47003", version: Timestamp 5000|26, versionEpoch: ObjectId('573cc54cc19de8acaafe1323') } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:146 locks:{} 0ms
2016-05-20T20:08:36.808+0800 I COMMAND  [conn34662] command admin.$cmd command: setShardVersion { setShardVersion: "xxx.bbb", configdb: "10.6.7.112:27001,10.6.7.115:27001,10.6.7.118:27001", shard: "47003", shardHost: "47003/10.6.7.115:47003,10.6.7.116:47003,10.6.7.129:47003", version: Timestamp 5000|26, versionEpoch: ObjectId('573cc54cc19de8acaafe1323') } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:146 locks:{} 0ms
2016-05-20T20:08:36.823+0800 I COMMAND  [conn8764] command admin.$cmd command: logout { logout: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:97 locks:{} 0ms
2016-05-20T20:08:36.823+0800 I COMMAND  [conn8764] command admin.$cmd command: isMaster { ismaster: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:539 locks:{} 0ms
2016-05-20T20:08:36.826+0800 I COMMAND  [conn31256] command admin.$cmd command: setShardVersion { setShardVersion: "vox-exam.afentiWrongQuestionLibrary", configdb: "10.6.7.112:27001,10.6.7.115:27001,10.6.7.118:27001", shard: "47003", shardHost: "47003/10.6.7.115:47003,10.6.7.116:47003,10.6.7.129:47003", version: Timestamp 644000|682, versionEpoch: ObjectId('55e46bd095f6e6a11b57c560') } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:146 locks:{} 0ms
2016-05-20T20:08:36.838+0800 I COMMAND  [conn19766] command admin.$cmd command: setShardVersion { setShardVersion: "vox-exam.afentiWrongQuestionLibrary", configdb: "10.6.7.112:27001,10.6.7.115:27001,10.6.7.118:27001", shard: "47003", shardHost: "47003/10.6.7.115:47003,10.6.7.116:47003,10.6.7.129:47003", version: Timestamp 644000|682, versionEpoch: ObjectId('55e46bd095f6e6a11b57c560') } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:146 locks:{} 0ms
2016-05-20T20:08:36.853+0800 I NETWORK  [conn50144] end connection 10.6.7.115:46528 (1264 connections now open)
2016-05-20T20:08:36.854+0800 I NETWORK  [initandlisten] connection accepted from 10.6.7.115:46532 #50148 (1265 connections now open)
2016-05-20T20:08:36.878+0800 I COMMAND  [conn25438] command admin.$cmd command: setShardVersion { setShardVersion: "xxx.bbbb", configdb: "10.6.7.112:27001,10.6.7.115:27001,10.6.7.118:27001", shard: "47003", shardHost: "47003/10.6.7.115:47003,10.6.7.116:47003,10.6.7.129:47003", version: Timestamp 5000|26, versionEpoch: ObjectId('573cc54cc19de8acaafe1323') } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:146 locks:{} 0ms
2016-05-20T20:08:36.880+0800 I COMMAND  [conn50148] command local.$cmd command: saslStart { saslStart: 1, mechanism: "SCRAM-SHA-1", payload: "xxx" } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:239 locks:{} 22ms



 Comments   
Comment by Ramon Fernandez Marina [ 08/Jun/16 ]

Thanks for your report liang.zhang@17zuoye.com. Please note that the SERVER project is for reporting bugs or feature suggestions for the MongoDB server. For MongoDB-related support discussion please post on the mongodb-user group or Stack Overflow with the mongodb tag, where your question will reach a larger audience. A question like this involving more discussion would be best posted on the mongodb-user group. See also our Technical Support page for additional support resources.

Regards,
Ramón.

Comment by ufo ofu [ 03/Jun/16 ]

unstable situation occurs again. and have debug log

Comment by ufo ofu [ 20/May/16 ]

log file

Generated at Thu Feb 08 04:05:46 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.