-
Type: Bug
-
Resolution: Incomplete
-
Priority: Major - P3
-
None
-
Affects Version/s: 1.8.2
-
Component/s: None
-
None
-
Environment:Ubuntu on EC2
-
Linux
Something (I don't know what yet) caused one of my shards to fail over. When I look at the logs on the old master, I see this:
Thu May 12 15:47:22 [conn21060] remove example.userEventsJournal query:
{ _id: ObjectId('4d56ea01fa00a03b38c17fde') } 232ms
Thu May 12 15:47:23 [conn21060] remove example.userEventsJournal query:
291ms
Thu May 12 15:47:25 [conn20802] query example.$cmd ntoreturn:1 command: { distinct: "cookbook", key: "cookbookId", query: {} } reslen:1445 1401ms
Thu May 12 15:47:25 [conn20841] query example.$cmd ntoreturn:1 command: { distinct: "cookbook", key: "cookbookId", query: {} } reslen:1445 664ms
Thu May 12 15:47:28 [dur] lsn set 120976684
Thu May 12 15:47:41 [conn21062] getmore local.oplog.rs cid:3926093491549750074 getMore: { ts:
} bytes:789 nreturned:3 12119ms
Thu May 12 15:47:41 [ReplSetHealthPollTask] replSet member mongodb05.example.com:27018 PRIMARY
Thu May 12 15:47:41 [conn21035] Assertion: 13312:replSet error : logOp() but not primary?
0x55f1b9 0x712a31 0x7108f2 0x668fa1 0x66ba60 0x75814f 0x75aae1 0x8a75ee 0x8ba510 0x7fb642df69ca 0x7fb6423a570d
/usr/bin/mongod(_ZN5mongo11msgassertedEiPKc+0x129) [0x55f1b9]
/usr/bin/mongod() [0x712a31]
/usr/bin/mongod(_ZN5mongo5logOpEPKcS1_RKNS_7BSONObjEPS2_Pb+0x42) [0x7108f2]
/usr/bin/mongod(_ZN5mongo14_updateObjectsEbPKcRKNS_7BSONObjES2_bbbRNS_7OpDebugEPNS_11RemoveSaverE+0x1131) [0x668fa1]
/usr/bin/mongod(_ZN5mongo13updateObjectsEPKcRKNS_7BSONObjES2_bbbRNS_7OpDebugE+0x130) [0x66ba60]
/usr/bin/mongod(_ZN5mongo14receivedUpdateERNS_7MessageERNS_5CurOpE+0x47f) [0x75814f]
/usr/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_8SockAddrE+0x1941) [0x75aae1]
/usr/bin/mongod(_ZN5mongo10connThreadEPNS_13MessagingPortE+0x21e) [0x8a75ee]
/usr/bin/mongod(thread_proxy+0x80) [0x8ba510]
/lib/libpthread.so.0(+0x69ca) [0x7fb642df69ca]
/lib/libc.so.6(clone+0x6d) [0x7fb6423a570d]
Thu May 12 15:47:41 [conn21035] update example.stats query:
exception 13312 replSet error : logOp() but not primary? 1ms
Thu May 12 15:47:41 [conn20999] Assertion: 13312:replSet error : logOp() but not primary?
0x55f1b9 0x712a31 0x7108f2 0x668fa1 0x66ba60 0x75814f 0x75aae1 0x8a75ee 0x8ba510 0x7fb642df69ca 0x7fb6423a570d
/usr/bin/mongod(_ZN5mongo11msgassertedEiPKc+0x129) [0x55f1b9]
/usr/bin/mongod() [0x712a31]
/usr/bin/mongod(_ZN5mongo5logOpEPKcS1_RKNS_7BSONObjEPS2_Pb+0x42) [0x7108f2]
/usr/bin/mongod(_ZN5mongo14_updateObjectsEbPKcRKNS_7BSONObjES2_bbbRNS_7OpDebugEPNS_11RemoveSaverE+0x1131) [0x668fa1]
/usr/bin/mongod(_ZN5mongo13updateObjectsEPKcRKNS_7BSONObjES2_bbbRNS_7OpDebugE+0x130) [0x66ba60]
/usr/bin/mongod(_ZN5mongo14receivedUpdateERNS_7MessageERNS_5CurOpE+0x47f) [0x75814f]
/usr/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_8SockAddrE+0x1941) [0x75aae1]
/usr/bin/mongod(_ZN5mongo10connThreadEPNS_13MessagingPortE+0x21e) [0x8a75ee]
/usr/bin/mongod(thread_proxy+0x80) [0x8ba510]
/lib/libpthread.so.0(+0x69ca) [0x7fb642df69ca]
/lib/libc.so.6(clone+0x6d) [0x7fb6423a570d]
There are many more of those "logOp() but not primary?" messages after this. Followed by several messages like this:
Thu May 12 15:47:41 [conn21003] update example.stats query:
{ entityId: "hal_livingSocial_DFA03", bucketId: "groups", variationId: "_default" } exception 13312 replSet error : logOp() but not primary? 1ms
Thu May 12 15:47:41 [conn21033] assertion 13435 not master and slaveok=false ns:example.ads query:
Thu May 12 15:47:41 [conn21033] ntoskip:0 ntoreturn:-1
Thu May 12 15:47:41 [conn21024] assertion 13435 not master and slaveok=false ns:example.cookbook query:
Thu May 12 15:47:41 [conn21024] ntoskip:0 ntoreturn:-1
Thu May 12 15:47:41 [conn21012] assertion 13435 not master and slaveok=false ns:example.cookbook query:
Thu May 12 15:47:41 [conn21012] ntoskip:0 ntoreturn:-1
Thu May 12 15:47:41 [conn21018] assertion 13435 not master and slaveok=false ns:example.cookbook query:
Thu May 12 15:47:41 [conn21018] ntoskip:0 ntoreturn:-1
Thu May 12 15:47:41 [conn20992] assertion 13435 not master and slaveok=false ns:example.cookbook query:
Thu May 12 15:47:41 [conn20992] ntoskip:0 ntoreturn:-1Thu May 12 15:47:41 [conn20994] assertion 13435 not master and slaveok=false ns:example.cookbook query:
{ cookbookId: "livingSocialBaselineVariation08" }Thu May 12 15:47:41 [conn20994] ntoskip:0 ntoreturn:-1
Then this:
Thu May 12 15:47:43 [replica set sync] replSet our last op time written: May 12 15:47:29:10
Thu May 12 15:47:43 [replica set sync] replset source's GTE: May 12 15:47:41:1
Thu May 12 15:47:43 [replica set sync] replSet rollback 0
Thu May 12 15:47:43 [replica set sync] replSet rollback 1
Thu May 12 15:47:43 [replica set sync] replSet rollback 2 FindCommonPoint
Thu May 12 15:47:43 [replica set sync] replSet info rollback our last optime: May 12 15:47:29:10
Thu May 12 15:47:43 [replica set sync] replSet info rollback their last optime: May 12 15:47:42:a
Thu May 12 15:47:43 [replica set sync] replSet info rollback diff in end of log times: -13 secondsThu May 12 15:47:43 [replica set sync] replSet rollback found matching events at May 12 15:47:29:d
Thu May 12 15:47:43 [replica set sync] replSet rollback findcommonpoint scanned : 38
Thu May 12 15:47:43 [replica set sync] replSet replSet rollback 3 fixup
Thu May 12 15:47:43 [replica set sync] replSet rollback 3.5
Thu May 12 15:47:43 [replica set sync] replSet rollback 4 n:3
Thu May 12 15:47:43 [replica set sync] replSet minvalid=May 12 15:47:42 4dcc011e:a
Thu May 12 15:47:43 [replica set sync] replSet rollback 4.6
Thu May 12 15:47:43 [replica set sync] replSet rollback 4.7
Thu May 12 15:47:43 [replica set sync] replSet rollback 5 d:0 u:3
Thu May 12 15:47:43 [replica set sync] replSet rollback 6
Thu May 12 15:47:43 [replica set sync] replSet rollback 7
Thu May 12 15:47:43 [replica set sync] replSet rollback done
Thu May 12 15:47:43 [dur] lsn set 120982480
And finally:
Thu May 12 15:48:24 [dur] lsn set 121000882
Thu May 12 15:48:25 [ReplSetHealthPollTask] DBClientCursor::init call() failed
Thu May 12 15:48:25 [ReplSetHealthPollTask] replSet info mongodb05.example.com:27018 is down (or slow to respond): DBClientBase::findOne: transport error: mongodb05.example.com:27018 query:
Thu May 12 15:48:25 [rs Manager] replSet info electSelf 0
Thu May 12 15:48:25 [rs Manager] replSet couldn't elect self, only received -9999 votes
Thu May 12 15:48:29 [ReplSetHealthPollTask] replSet info mongodb05.example.com:27018 is up
Thu May 12 15:49:10 [conn21155] query local.$cmd ntoreturn:1 command:
reslen:284 3638ms
Thu May 12 15:49:33 [dur] lsn set 121030916
Thu May 12 15:50:20 [dur] lsn set 121060941
On the new master, I see this:
Thu May 12 15:47:40 [ReplSetHealthPollTask] DBClientCursor::init call() failed
Thu May 12 15:47:40 [ReplSetHealthPollTask] replSet info mongodb04.socialmedia.com:27018 is down (or slow to respond): DBClientBase::findOne: transport error: mongodb04.socialmedia.com:27018 query:
Thu May 12 15:47:40 [rs Manager] replSet info electSelf 1
Thu May 12 15:47:40 [rs Manager] replSet PRIMARY
Thu May 12 15:47:41 [replica set sync] replSet stopping syncTail we are now primary
Thu May 12 15:47:42 [ReplSetHealthPollTask] replSet info mongodb04.socialmedia.com:27018 is up
Thu May 12 15:47:42 [ReplSetHealthPollTask] replSet member mongodb04.socialmedia.com:27018 RECOVERING
Thu May 12 15:47:46 [conn52890] remove socialmedia.userEventsJournal query:
115ms
Thu May 12 15:47:46 [ReplSetHealthPollTask] replSet member mongodb04.socialmedia.com:27018 SECONDARY
So failover appears to have completed successfully. Though now I'm seeing exceptions like this in my app:
ERR [20110512-15:47:43.401] blueeyes: com.mongodb.MongoException: not talking to master and retries used up
ERR [20110512-15:47:43.401] blueeyes: at com.mongodb.DBTCPConnector.call(DBTCPConnector.java:227)
ERR [20110512-15:47:43.401] blueeyes: at com.mongodb.DBTCPConnector.call(DBTCPConnector.java:229)
ERR [20110512-15:47:43.401] blueeyes: at com.mongodb.DBTCPConnector.call(DBTCPConnector.java:229)
ERR [20110512-15:47:43.401] blueeyes: at com.mongodb.DBApiLayer$MyCollection.__find(DBApiLayer.java:295)
ERR [20110512-15:47:43.401] blueeyes: at com.mongodb.DBCursor._check(DBCursor.java:354)
ERR [20110512-15:47:43.401] blueeyes: at com.mongodb.DBCursor._hasNext(DBCursor.java:484)
ERR [20110512-15:47:43.401] blueeyes: at com.mongodb.DBCursor.hasNext(DBCursor.java:509)
ERR [20110512-15:47:43.401] blueeyes: at blueeyes.persistence.mongo.RealDatabaseCollection$$anon$1.hasNext(RealMongoImplementation.scala:82)
ERR [20110512-15:47:43.401] blueeyes: at scala.collection.IterableLike$class.isEmpty(IterableLike.scala:92)
ERR [20110512-15:47:43.401] blueeyes: at blueeyes.persistence.mongo.IterableViewImpl.isEmpty(RealMongoImplementation.scala:122)
ERR [20110512-15:47:43.401] blueeyes: at scala.collection.TraversableLike$class.headOption(TraversableLike.scala:483)
ERR [20110512-15:47:43.401] blueeyes: at blueeyes.persistence.mongo.IterableViewImpl.headOption(RealMongoImplementation.scala:122)
ERR [20110512-15:47:43.401] blueeyes: at blueeyes.persistence.mongo.QueryBehaviours$SelectOneQueryBehaviour$class.query(QueryBehaviours.scala:128)
ERR [20110512-15:47:43.401] blueeyes: at blueeyes.persistence.mongo.MongoSelectOneQuery.query(MongoQuery.scala:71)
ERR [20110512-15:47:43.401] blueeyes: at blueeyes.persistence.mongo.MongoSelectOneQuery.query(MongoQuery.scala:71)
ERR [20110512-15:47:43.401] blueeyes: at blueeyes.persistence.mongo.QueryBehaviours$MongoQueryBehaviour$class.apply(QueryBehaviours.scala:16)
ERR [20110512-15:47:43.401] blueeyes: at blueeyes.persistence.mongo.MongoSelectOneQuery.apply(MongoQuery.scala:71)
ERR [20110512-15:47:43.401] blueeyes: at blueeyes.persistence.mongo.MongoSelectOneQuery.apply(MongoQuery.scala:71)
ERR [20110512-15:47:43.401] blueeyes: at blueeyes.persistence.mongo.MongoDatabase$$anonfun$blueeyes$persistence$mongo$MongoDatabase$$mongoActor$1.apply(Mongo.scala:46)
ERR [20110512-15:47:43.401] blueeyes: at blueeyes.persistence.mongo.MongoDatabase$$anonfun$blueeyes$persistence$mongo$MongoDatabase$$mongoActor$1.apply(Mongo.scala:43)
ERR [20110512-15:47:43.401] blueeyes: at blueeyes.concurrent.ActorExecutionStrategySequential$$anon$14$$anonfun$submit$1.apply(Actor.scala:17)
ERR [20110512-15:47:43.401] blueeyes: at blueeyes.concurrent.Future$$anonfun$deliver$1.apply(Future.scala:50)
ERR [20110512-15:47:43.401] blueeyes: at blueeyes.concurrent.Future$$anonfun$deliver$1.apply(Future.scala:46)
ERR [20110512-15:47:43.401] blueeyes: at blueeyes.concurrent.ReadWriteLock$class.writeLock(ReadWriteLock.scala:10)
ERR [20110512-15:47:43.401] blueeyes: at blueeyes.concurrent.Future$$anon$1.writeLock(Future.scala:22)
ERR [20110512-15:47:43.401] blueeyes: at blueeyes.concurrent.Future.deliver(Future.scala:45)
ERR [20110512-15:47:43.401] blueeyes: at blueeyes.concurrent.ActorExecutionStrategySequential$$anon$14.submit(Actor.scala:17)
ERR [20110512-15:47:43.401] blueeyes: at blueeyes.concurrent.ActorImplementationMultiThreaded$StrategyWorker$$anonfun$run$1.apply$mcV$sp(Actor.scala:195)
ERR [20110512-15:47:43.401] blueeyes: at blueeyes.concurrent.ActorImplementationMultiThreaded$ActorContext$.withActorFn(Actor.scala:230)
ERR [20110512-15:47:43.401] blueeyes: at blueeyes.concurrent.ActorImplementationMultiThreaded$StrategyWorker.run(Actor.scala:189)
ERR [20110512-15:47:43.401] blueeyes: (...more...)
I do not see any errors on the mongos.log around this time:
Thu May 12 15:47:16 [Balancer] skipping balancing round because balancing is disabled
Thu May 12 15:47:26 [ReplicaSetMonitorWatcher] ReplicaSetMonitor::_checkConnection: mongodb10.example.com:27018
checking replica set: 1
Thu May 12 15:47:26 [ReplicaSetMonitorWatcher] ReplicaSetMonitor::_checkConnection: mongodb01.example.com:27018
checking replica set: 2
Thu May 12 15:47:26 [ReplicaSetMonitorWatcher] ReplicaSetMonitor::_checkConnection: mongodb04.example.com:27018
checking replica set: 3
Thu May 12 15:47:26 [ReplicaSetMonitorWatcher] ReplicaSetMonitor::_checkConnection: mongodb07.example.com:27018
checking replica set: 4
Thu May 12 15:47:33 [LockPinger] dist_lock pinged successfully for: appserver01:1305178845:1804289383
Thu May 12 15:47:36 [WriteBackListener] writebacklisten result:
Thu May 12 15:47:36 [WriteBackListener] writebacklisten result:
{ noop: true, ok: 1.0 }Thu May 12 15:47:36 [WriteBackListener] writebacklisten result:
{ noop: true, ok: 1.0 }Thu May 12 15:47:36 [WriteBackListener] writebacklisten result:
{ noop: true, ok: 1.0 }Thu May 12 15:47:36 [WriteBackListener] writebacklisten result:
{ noop: true, ok: 1.0 }Thu May 12 15:47:36 [WriteBackListener] writebacklisten result:
{ noop: true, ok: 1.0 }Thu May 12 15:47:36 [WriteBackListener] writebacklisten result:
{ noop: true, ok: 1.0 }Thu May 12 15:47:46 [ReplicaSetMonitorWatcher] ReplicaSetMonitor::_checkConnection: mongodb10.example.com:27018
{ setName: "4", ismaster: true, secondary: false, hosts: [ "mongodb10.example.com:27018", "mongodb11.example.com:27018" ], arbiters: [ "mongodb12.example.com:27018" ], maxBsonObjectSize: 16777216, ok: 1.0 }checking replica set: 1
Thu May 12 15:47:46 [ReplicaSetMonitorWatcher] ReplicaSetMonitor::_checkConnection: mongodb01.example.com:27018
checking replica set: 2
Thu May 12 15:47:46 [ReplicaSetMonitorWatcher] ReplicaSetMonitor::_checkConnection: mongodb04.example.com:27018
_check : 2/mongodb05.example.com:27018,mongodb04.example.com:27018,mongodb04.example.com:27018
Thu May 12 15:47:46 [ReplicaSetMonitorWatcher] ReplicaSetMonitor::_checkConnection: mongodb05.example.com:27018
checking replica set: 3
Thu May 12 15:47:46 [ReplicaSetMonitorWatcher] ReplicaSetMonitor::_checkConnection: mongodb07.example.com:27018
checking replica set: 4
Thu May 12 15:47:46 BackgroundJob starting: ConnectBG
Thu May 12 15:47:46 [Balancer] skipping balancing round because balancing is disabled
Thu May 12 15:47:47 BackgroundJob starting: ConnectBG
Thu May 12 15:47:50 [WriteBackListener] writebacklisten result:
Thu May 12 15:48:03 [LockPinger] dist_lock pinged successfully for: appserver01:1305178845:1804289383
Thu May 12 15:48:06 [ReplicaSetMonitorWatcher] ReplicaSetMonitor::_checkConnection: mongodb10.example.com:27018
checking replica set: 1
Thu May 12 15:48:06 [ReplicaSetMonitorWatcher] ReplicaSetMonitor::_checkConnection: mongodb01.example.com:27018
checking replica set: 2
Thu May 12 15:48:06 [ReplicaSetMonitorWatcher] ReplicaSetMonitor::_checkConnection: mongodb05.example.com:27018
checking replica set: 3
Thu May 12 15:48:06 [ReplicaSetMonitorWatcher] ReplicaSetMonitor::_checkConnection: mongodb07.example.com:27018
checking replica set: 4
- related to
-
SERVER-2882 operations are still written on primary after it steps down
- Closed