[SERVER-3087] "Assertion: 13312:replSet error : logOp() but not primary?" and "com.mongodb.MongoException: not talking to master and retries used up" Created: 12/May/11  Updated: 15/Aug/12  Resolved: 13/Jun/12

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 1.8.2
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Michael Conigliaro Assignee: Greg Studer
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Ubuntu on EC2


Issue Links:
Depends
Related
related to SERVER-2882 operations are still written on prima... Closed
Operating System: Linux
Participants:

 Description   

Something (I don't know what yet) caused one of my shards to fail over. When I look at the logs on the old master, I see this:

Thu May 12 15:47:22 [conn21060] remove example.userEventsJournal query:

{ _id: ObjectId('4d56ea01fa00a03b38c17fde') }

232ms
Thu May 12 15:47:23 [conn21060] remove example.userEventsJournal query:

{ _id: ObjectId('4d56ea0a04e235872bd9e29e') }

291ms
Thu May 12 15:47:25 [conn20802] query example.$cmd ntoreturn:1 command: { distinct: "cookbook", key: "cookbookId", query: {} } reslen:1445 1401ms
Thu May 12 15:47:25 [conn20841] query example.$cmd ntoreturn:1 command: { distinct: "cookbook", key: "cookbookId", query: {} } reslen:1445 664ms
Thu May 12 15:47:28 [dur] lsn set 120976684
Thu May 12 15:47:41 [conn21062] getmore local.oplog.rs cid:3926093491549750074 getMore: { ts:

{ $gte: new Date(5605849438531616885) }

} bytes:789 nreturned:3 12119ms
Thu May 12 15:47:41 [ReplSetHealthPollTask] replSet member mongodb05.example.com:27018 PRIMARY
Thu May 12 15:47:41 [conn21035] Assertion: 13312:replSet error : logOp() but not primary?
0x55f1b9 0x712a31 0x7108f2 0x668fa1 0x66ba60 0x75814f 0x75aae1 0x8a75ee 0x8ba510 0x7fb642df69ca 0x7fb6423a570d
/usr/bin/mongod(_ZN5mongo11msgassertedEiPKc+0x129) [0x55f1b9]
/usr/bin/mongod() [0x712a31]
/usr/bin/mongod(_ZN5mongo5logOpEPKcS1_RKNS_7BSONObjEPS2_Pb+0x42) [0x7108f2]
/usr/bin/mongod(_ZN5mongo14_updateObjectsEbPKcRKNS_7BSONObjES2_bbbRNS_7OpDebugEPNS_11RemoveSaverE+0x1131) [0x668fa1]
/usr/bin/mongod(_ZN5mongo13updateObjectsEPKcRKNS_7BSONObjES2_bbbRNS_7OpDebugE+0x130) [0x66ba60]
/usr/bin/mongod(_ZN5mongo14receivedUpdateERNS_7MessageERNS_5CurOpE+0x47f) [0x75814f]
/usr/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_8SockAddrE+0x1941) [0x75aae1]
/usr/bin/mongod(_ZN5mongo10connThreadEPNS_13MessagingPortE+0x21e) [0x8a75ee]
/usr/bin/mongod(thread_proxy+0x80) [0x8ba510]
/lib/libpthread.so.0(+0x69ca) [0x7fb642df69ca]
/lib/libc.so.6(clone+0x6d) [0x7fb6423a570d]
Thu May 12 15:47:41 [conn21035] update example.stats query:

{ entityId: "hal_livingSocial_DFA03", bucketId: "groups", variationId: "_default" }

exception 13312 replSet error : logOp() but not primary? 1ms
Thu May 12 15:47:41 [conn20999] Assertion: 13312:replSet error : logOp() but not primary?
0x55f1b9 0x712a31 0x7108f2 0x668fa1 0x66ba60 0x75814f 0x75aae1 0x8a75ee 0x8ba510 0x7fb642df69ca 0x7fb6423a570d
/usr/bin/mongod(_ZN5mongo11msgassertedEiPKc+0x129) [0x55f1b9]
/usr/bin/mongod() [0x712a31]
/usr/bin/mongod(_ZN5mongo5logOpEPKcS1_RKNS_7BSONObjEPS2_Pb+0x42) [0x7108f2]
/usr/bin/mongod(_ZN5mongo14_updateObjectsEbPKcRKNS_7BSONObjES2_bbbRNS_7OpDebugEPNS_11RemoveSaverE+0x1131) [0x668fa1]
/usr/bin/mongod(_ZN5mongo13updateObjectsEPKcRKNS_7BSONObjES2_bbbRNS_7OpDebugE+0x130) [0x66ba60]
/usr/bin/mongod(_ZN5mongo14receivedUpdateERNS_7MessageERNS_5CurOpE+0x47f) [0x75814f]
/usr/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_8SockAddrE+0x1941) [0x75aae1]
/usr/bin/mongod(_ZN5mongo10connThreadEPNS_13MessagingPortE+0x21e) [0x8a75ee]
/usr/bin/mongod(thread_proxy+0x80) [0x8ba510]
/lib/libpthread.so.0(+0x69ca) [0x7fb642df69ca]
/lib/libc.so.6(clone+0x6d) [0x7fb6423a570d]

There are many more of those "logOp() but not primary?" messages after this. Followed by several messages like this:

Thu May 12 15:47:41 [conn21003] update example.stats query:

{ entityId: "hal_livingSocial_DFA03", bucketId: "groups", variationId: "_default" }

exception 13312 replSet error : logOp() but not primary? 1ms
Thu May 12 15:47:41 [conn21033] assertion 13435 not master and slaveok=false ns:example.ads query:

{ adId: "hal_livingSocial_cookbook_dfa_winning02" }

Thu May 12 15:47:41 [conn21033] ntoskip:0 ntoreturn:-1
Thu May 12 15:47:41 [conn21024] assertion 13435 not master and slaveok=false ns:example.cookbook query:

{ cookbookId: "livingSocialBaselineVariation07" }

Thu May 12 15:47:41 [conn21024] ntoskip:0 ntoreturn:-1
Thu May 12 15:47:41 [conn21012] assertion 13435 not master and slaveok=false ns:example.cookbook query:

{ cookbookId: "livingSocialBaselineVariation08" }

Thu May 12 15:47:41 [conn21012] ntoskip:0 ntoreturn:-1
Thu May 12 15:47:41 [conn21018] assertion 13435 not master and slaveok=false ns:example.cookbook query:

{ cookbookId: "livingSocialBaselineVariation07" }

Thu May 12 15:47:41 [conn21018] ntoskip:0 ntoreturn:-1
Thu May 12 15:47:41 [conn20992] assertion 13435 not master and slaveok=false ns:example.cookbook query:

{ cookbookId: "livingSocialBaselineVariation08" }

Thu May 12 15:47:41 [conn20992] ntoskip:0 ntoreturn:-1Thu May 12 15:47:41 [conn20994] assertion 13435 not master and slaveok=false ns:example.cookbook query:

{ cookbookId: "livingSocialBaselineVariation08" }

Thu May 12 15:47:41 [conn20994] ntoskip:0 ntoreturn:-1

Then this:

Thu May 12 15:47:43 [replica set sync] replSet our last op time written: May 12 15:47:29:10
Thu May 12 15:47:43 [replica set sync] replset source's GTE: May 12 15:47:41:1
Thu May 12 15:47:43 [replica set sync] replSet rollback 0
Thu May 12 15:47:43 [replica set sync] replSet rollback 1
Thu May 12 15:47:43 [replica set sync] replSet rollback 2 FindCommonPoint
Thu May 12 15:47:43 [replica set sync] replSet info rollback our last optime: May 12 15:47:29:10
Thu May 12 15:47:43 [replica set sync] replSet info rollback their last optime: May 12 15:47:42:a
Thu May 12 15:47:43 [replica set sync] replSet info rollback diff in end of log times: -13 secondsThu May 12 15:47:43 [replica set sync] replSet rollback found matching events at May 12 15:47:29:d
Thu May 12 15:47:43 [replica set sync] replSet rollback findcommonpoint scanned : 38
Thu May 12 15:47:43 [replica set sync] replSet replSet rollback 3 fixup
Thu May 12 15:47:43 [replica set sync] replSet rollback 3.5
Thu May 12 15:47:43 [replica set sync] replSet rollback 4 n:3
Thu May 12 15:47:43 [replica set sync] replSet minvalid=May 12 15:47:42 4dcc011e:a
Thu May 12 15:47:43 [replica set sync] replSet rollback 4.6
Thu May 12 15:47:43 [replica set sync] replSet rollback 4.7
Thu May 12 15:47:43 [replica set sync] replSet rollback 5 d:0 u:3
Thu May 12 15:47:43 [replica set sync] replSet rollback 6
Thu May 12 15:47:43 [replica set sync] replSet rollback 7
Thu May 12 15:47:43 [replica set sync] replSet rollback done
Thu May 12 15:47:43 [dur] lsn set 120982480

And finally:

Thu May 12 15:48:24 [dur] lsn set 121000882
Thu May 12 15:48:25 [ReplSetHealthPollTask] DBClientCursor::init call() failed
Thu May 12 15:48:25 [ReplSetHealthPollTask] replSet info mongodb05.example.com:27018 is down (or slow to respond): DBClientBase::findOne: transport error: mongodb05.example.com:27018 query:

{ replSetHeartbeat: "2", v: 7, pv: 1, checkEmpty: false, from: "mongodb04.example.com:27018" }

Thu May 12 15:48:25 [rs Manager] replSet info electSelf 0
Thu May 12 15:48:25 [rs Manager] replSet couldn't elect self, only received -9999 votes
Thu May 12 15:48:29 [ReplSetHealthPollTask] replSet info mongodb05.example.com:27018 is up
Thu May 12 15:49:10 [conn21155] query local.$cmd ntoreturn:1 command:

{ collstats: "oplog.rs" }

reslen:284 3638ms
Thu May 12 15:49:33 [dur] lsn set 121030916
Thu May 12 15:50:20 [dur] lsn set 121060941

On the new master, I see this:

Thu May 12 15:47:40 [ReplSetHealthPollTask] DBClientCursor::init call() failed
Thu May 12 15:47:40 [ReplSetHealthPollTask] replSet info mongodb04.socialmedia.com:27018 is down (or slow to respond): DBClientBase::findOne: transport error: mongodb04.socialmedia.com:27018 query:

{ replSetHeartbeat: "2", v: 7, pv: 1, checkEmpty: false, from: "mongodb05.socialmedia.com:27018" }

Thu May 12 15:47:40 [rs Manager] replSet info electSelf 1
Thu May 12 15:47:40 [rs Manager] replSet PRIMARY
Thu May 12 15:47:41 [replica set sync] replSet stopping syncTail we are now primary
Thu May 12 15:47:42 [ReplSetHealthPollTask] replSet info mongodb04.socialmedia.com:27018 is up
Thu May 12 15:47:42 [ReplSetHealthPollTask] replSet member mongodb04.socialmedia.com:27018 RECOVERING
Thu May 12 15:47:46 [conn52890] remove socialmedia.userEventsJournal query:

{ _id: ObjectId('4d56eb456655aa982210b2b9') }

115ms
Thu May 12 15:47:46 [ReplSetHealthPollTask] replSet member mongodb04.socialmedia.com:27018 SECONDARY

So failover appears to have completed successfully. Though now I'm seeing exceptions like this in my app:

ERR [20110512-15:47:43.401] blueeyes: com.mongodb.MongoException: not talking to master and retries used up
ERR [20110512-15:47:43.401] blueeyes: at com.mongodb.DBTCPConnector.call(DBTCPConnector.java:227)
ERR [20110512-15:47:43.401] blueeyes: at com.mongodb.DBTCPConnector.call(DBTCPConnector.java:229)
ERR [20110512-15:47:43.401] blueeyes: at com.mongodb.DBTCPConnector.call(DBTCPConnector.java:229)
ERR [20110512-15:47:43.401] blueeyes: at com.mongodb.DBApiLayer$MyCollection.__find(DBApiLayer.java:295)
ERR [20110512-15:47:43.401] blueeyes: at com.mongodb.DBCursor._check(DBCursor.java:354)
ERR [20110512-15:47:43.401] blueeyes: at com.mongodb.DBCursor._hasNext(DBCursor.java:484)
ERR [20110512-15:47:43.401] blueeyes: at com.mongodb.DBCursor.hasNext(DBCursor.java:509)
ERR [20110512-15:47:43.401] blueeyes: at blueeyes.persistence.mongo.RealDatabaseCollection$$anon$1.hasNext(RealMongoImplementation.scala:82)
ERR [20110512-15:47:43.401] blueeyes: at scala.collection.IterableLike$class.isEmpty(IterableLike.scala:92)
ERR [20110512-15:47:43.401] blueeyes: at blueeyes.persistence.mongo.IterableViewImpl.isEmpty(RealMongoImplementation.scala:122)
ERR [20110512-15:47:43.401] blueeyes: at scala.collection.TraversableLike$class.headOption(TraversableLike.scala:483)
ERR [20110512-15:47:43.401] blueeyes: at blueeyes.persistence.mongo.IterableViewImpl.headOption(RealMongoImplementation.scala:122)
ERR [20110512-15:47:43.401] blueeyes: at blueeyes.persistence.mongo.QueryBehaviours$SelectOneQueryBehaviour$class.query(QueryBehaviours.scala:128)
ERR [20110512-15:47:43.401] blueeyes: at blueeyes.persistence.mongo.MongoSelectOneQuery.query(MongoQuery.scala:71)
ERR [20110512-15:47:43.401] blueeyes: at blueeyes.persistence.mongo.MongoSelectOneQuery.query(MongoQuery.scala:71)
ERR [20110512-15:47:43.401] blueeyes: at blueeyes.persistence.mongo.QueryBehaviours$MongoQueryBehaviour$class.apply(QueryBehaviours.scala:16)
ERR [20110512-15:47:43.401] blueeyes: at blueeyes.persistence.mongo.MongoSelectOneQuery.apply(MongoQuery.scala:71)
ERR [20110512-15:47:43.401] blueeyes: at blueeyes.persistence.mongo.MongoSelectOneQuery.apply(MongoQuery.scala:71)
ERR [20110512-15:47:43.401] blueeyes: at blueeyes.persistence.mongo.MongoDatabase$$anonfun$blueeyes$persistence$mongo$MongoDatabase$$mongoActor$1.apply(Mongo.scala:46)
ERR [20110512-15:47:43.401] blueeyes: at blueeyes.persistence.mongo.MongoDatabase$$anonfun$blueeyes$persistence$mongo$MongoDatabase$$mongoActor$1.apply(Mongo.scala:43)
ERR [20110512-15:47:43.401] blueeyes: at blueeyes.concurrent.ActorExecutionStrategySequential$$anon$14$$anonfun$submit$1.apply(Actor.scala:17)
ERR [20110512-15:47:43.401] blueeyes: at blueeyes.concurrent.Future$$anonfun$deliver$1.apply(Future.scala:50)
ERR [20110512-15:47:43.401] blueeyes: at blueeyes.concurrent.Future$$anonfun$deliver$1.apply(Future.scala:46)
ERR [20110512-15:47:43.401] blueeyes: at blueeyes.concurrent.ReadWriteLock$class.writeLock(ReadWriteLock.scala:10)
ERR [20110512-15:47:43.401] blueeyes: at blueeyes.concurrent.Future$$anon$1.writeLock(Future.scala:22)
ERR [20110512-15:47:43.401] blueeyes: at blueeyes.concurrent.Future.deliver(Future.scala:45)
ERR [20110512-15:47:43.401] blueeyes: at blueeyes.concurrent.ActorExecutionStrategySequential$$anon$14.submit(Actor.scala:17)
ERR [20110512-15:47:43.401] blueeyes: at blueeyes.concurrent.ActorImplementationMultiThreaded$StrategyWorker$$anonfun$run$1.apply$mcV$sp(Actor.scala:195)
ERR [20110512-15:47:43.401] blueeyes: at blueeyes.concurrent.ActorImplementationMultiThreaded$ActorContext$.withActorFn(Actor.scala:230)
ERR [20110512-15:47:43.401] blueeyes: at blueeyes.concurrent.ActorImplementationMultiThreaded$StrategyWorker.run(Actor.scala:189)
ERR [20110512-15:47:43.401] blueeyes: (...more...)

I do not see any errors on the mongos.log around this time:

Thu May 12 15:47:16 [Balancer] skipping balancing round because balancing is disabled
Thu May 12 15:47:26 [ReplicaSetMonitorWatcher] ReplicaSetMonitor::_checkConnection: mongodb10.example.com:27018

{ setName: "4", ismaster: true, secondary: false, hosts: [ "mongodb10.example.com:27018", "mongodb11.example.com:27018" ], arbiters: [ "mongodb12.example.com:27018" ], maxBsonObjectSize: 16777216, ok: 1.0 }

checking replica set: 1
Thu May 12 15:47:26 [ReplicaSetMonitorWatcher] ReplicaSetMonitor::_checkConnection: mongodb01.example.com:27018

{ setName: "1", ismaster: true, secondary: false, hosts: [ "mongodb01.example.com:27018", "mongodb02.example.com:27018" ], arbiters: [ "mongodb03.example.com:27018" ], maxBsonObjectSize: 16777216, ok: 1.0 }

checking replica set: 2
Thu May 12 15:47:26 [ReplicaSetMonitorWatcher] ReplicaSetMonitor::_checkConnection: mongodb04.example.com:27018

{ setName: "2", ismaster: true, secondary: false, hosts: [ "mongodb04.example.com:27018", "mongodb05.example.com:27018" ], arbiters: [ "mongodb06.example.com:27018" ], maxBsonObjectSize: 16777216, ok: 1.0 }

checking replica set: 3
Thu May 12 15:47:26 [ReplicaSetMonitorWatcher] ReplicaSetMonitor::_checkConnection: mongodb07.example.com:27018

{ setName: "3", ismaster: true, secondary: false, hosts: [ "mongodb07.example.com:27018", "mongodb08.example.com:27018" ], arbiters: [ "mongodb09.example.com:27018" ], maxBsonObjectSize: 16777216, ok: 1.0 }

checking replica set: 4
Thu May 12 15:47:33 [LockPinger] dist_lock pinged successfully for: appserver01:1305178845:1804289383
Thu May 12 15:47:36 [WriteBackListener] writebacklisten result:

{ noop: true, ok: 1.0 }

Thu May 12 15:47:36 [WriteBackListener] writebacklisten result:

{ noop: true, ok: 1.0 }

Thu May 12 15:47:36 [WriteBackListener] writebacklisten result:

{ noop: true, ok: 1.0 }

Thu May 12 15:47:36 [WriteBackListener] writebacklisten result:

{ noop: true, ok: 1.0 }

Thu May 12 15:47:36 [WriteBackListener] writebacklisten result:

{ noop: true, ok: 1.0 }

Thu May 12 15:47:36 [WriteBackListener] writebacklisten result:

{ noop: true, ok: 1.0 }

Thu May 12 15:47:36 [WriteBackListener] writebacklisten result:

{ noop: true, ok: 1.0 }

Thu May 12 15:47:46 [ReplicaSetMonitorWatcher] ReplicaSetMonitor::_checkConnection: mongodb10.example.com:27018

{ setName: "4", ismaster: true, secondary: false, hosts: [ "mongodb10.example.com:27018", "mongodb11.example.com:27018" ], arbiters: [ "mongodb12.example.com:27018" ], maxBsonObjectSize: 16777216, ok: 1.0 }

checking replica set: 1
Thu May 12 15:47:46 [ReplicaSetMonitorWatcher] ReplicaSetMonitor::_checkConnection: mongodb01.example.com:27018

{ setName: "1", ismaster: true, secondary: false, hosts: [ "mongodb01.example.com:27018", "mongodb02.example.com:27018" ], arbiters: [ "mongodb03.example.com:27018" ], maxBsonObjectSize: 16777216, ok: 1.0 }

checking replica set: 2
Thu May 12 15:47:46 [ReplicaSetMonitorWatcher] ReplicaSetMonitor::_checkConnection: mongodb04.example.com:27018

{ setName: "2", ismaster: false, secondary: true, hosts: [ "mongodb04.example.com:27018", "mongodb05.example.com:27018" ], arbiters: [ "mongodb06.example.com:27018" ], primary: "mongodb05.example.com:27018", maxBsonObjectSize: 16777216, ok: 1.0 }

_check : 2/mongodb05.example.com:27018,mongodb04.example.com:27018,mongodb04.example.com:27018
Thu May 12 15:47:46 [ReplicaSetMonitorWatcher] ReplicaSetMonitor::_checkConnection: mongodb05.example.com:27018

{ setName: "2", ismaster: true, secondary: false, hosts: [ "mongodb05.example.com:27018", "mongodb04.example.com:27018" ], arbiters: [ "mongodb06.example.com:27018" ], maxBsonObjectSize: 16777216, ok: 1.0 }

checking replica set: 3
Thu May 12 15:47:46 [ReplicaSetMonitorWatcher] ReplicaSetMonitor::_checkConnection: mongodb07.example.com:27018

{ setName: "3", ismaster: true, secondary: false, hosts: [ "mongodb07.example.com:27018", "mongodb08.example.com:27018" ], arbiters: [ "mongodb09.example.com:27018" ], maxBsonObjectSize: 16777216, ok: 1.0 }

checking replica set: 4
Thu May 12 15:47:46 BackgroundJob starting: ConnectBG
Thu May 12 15:47:46 [Balancer] skipping balancing round because balancing is disabled
Thu May 12 15:47:47 BackgroundJob starting: ConnectBG
Thu May 12 15:47:50 [WriteBackListener] writebacklisten result:

{ noop: true, ok: 1.0 }

Thu May 12 15:48:03 [LockPinger] dist_lock pinged successfully for: appserver01:1305178845:1804289383
Thu May 12 15:48:06 [ReplicaSetMonitorWatcher] ReplicaSetMonitor::_checkConnection: mongodb10.example.com:27018

{ setName: "4", ismaster: true, secondary: false, hosts: [ "mongodb10.example.com:27018", "mongodb11.example.com:27018" ], arbiters: [ "mongodb12.example.com:27018" ], maxBsonObjectSize: 16777216, ok: 1.0 }

checking replica set: 1
Thu May 12 15:48:06 [ReplicaSetMonitorWatcher] ReplicaSetMonitor::_checkConnection: mongodb01.example.com:27018

{ setName: "1", ismaster: true, secondary: false, hosts: [ "mongodb01.example.com:27018", "mongodb02.example.com:27018" ], arbiters: [ "mongodb03.example.com:27018" ], maxBsonObjectSize: 16777216, ok: 1.0 }

checking replica set: 2
Thu May 12 15:48:06 [ReplicaSetMonitorWatcher] ReplicaSetMonitor::_checkConnection: mongodb05.example.com:27018

{ setName: "2", ismaster: true, secondary: false, hosts: [ "mongodb05.example.com:27018", "mongodb04.example.com:27018" ], arbiters: [ "mongodb06.example.com:27018" ], maxBsonObjectSize: 16777216, ok: 1.0 }

checking replica set: 3
Thu May 12 15:48:06 [ReplicaSetMonitorWatcher] ReplicaSetMonitor::_checkConnection: mongodb07.example.com:27018

{ setName: "3", ismaster: true, secondary: false, hosts: [ "mongodb07.example.com:27018", "mongodb08.example.com:27018" ], arbiters: [ "mongodb09.example.com:27018" ], maxBsonObjectSize: 16777216, ok: 1.0 }

checking replica set: 4



 Comments   
Comment by Greg Studer [ 13/Jun/12 ]

Cursor causing the problem has been deprecated since 1.8 - no repro.

Comment by Ian Whalen (Inactive) [ 25/Apr/12 ]

@michael could you let us know if you're seeing this on 2.0?

Comment by Greg Studer [ 02/Nov/11 ]

Are you still seeing this issue with newer mongos versions? The SerialServer cursor has been deprecated in 2.0.

Comment by Greg Studer [ 12/May/11 ]

Think there may be an issue in the way limits are propagated between mongo clients – looking into it...

See SERVER-2997 for a similar issue with this...

Comment by Michael Conigliaro [ 12/May/11 ]

Quick update. nscanned on shard 1 actually makes sense because it turns out that none of the documents there match the query (a full scan would have been necessary to figure that out). I don't understand what's going on with shard 2 though. I've been running that script directly on mongodb04 for over an hour now with no problems. If I try running it via mongos, it triggers a failover very quickly.

Comment by Michael Conigliaro [ 12/May/11 ]

Here's the explain() output from the mongos. Why is nscanned so much bigger here?

> db.userEventsJournal.find({ "eventId" :

{ $nin: ["click", "clickThrough", "conversion", "emailConversion", "purchaseConversion"]}

}).limit(1).explain()
{
"clusteredType" : "SerialServer",
"shards" : {
"1/mongodb01.example.com:27018,mongodb02.example.com:27018" : [
{
"cursor" : "BasicCursor",
"nscanned" : 132391,
"nscannedObjects" : 132391,
"n" : 0,
"millis" : 203,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {

}
}
],
"2/mongodb05.example.com:27018,mongodb04.example.com:27018" : [
{
"cursor" : "BasicCursor",
"nscanned" : 6985870,
"nscannedObjects" : 6985870,
"n" : 1,
"millis" : 24938,
"nYields" : 224,
"nChunkSkips" : 6974952,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {

}
}
],
"3/mongodb07.example.com:27018,mongodb08.example.com:27018" : [
{
"cursor" : "BasicCursor",
"nscanned" : 1,
"nscannedObjects" : 1,
"n" : 1,
"millis" : 13,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {

}
}
],
"4/mongodb10.example.com:27018,mongodb11.example.com:27018" : [
{
"cursor" : "BasicCursor",
"nscanned" : 1,
"nscannedObjects" : 1,
"n" : 1,
"millis" : 71,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {

}
}
]
},
"n" : 3,
"nChunkSkips" : 6974952,
"nYields" : 224,
"nscanned" : 7118263,
"nscannedObjects" : 7118263,
"millisTotal" : 25225,
"millisAvg" : 6306,
"numQueries" : 4,
"numShards" : 4
}

Comment by Michael Conigliaro [ 12/May/11 ]

OK, I just reproduced this. It was caused by this script I'm currently using to delete old data:

<code>
allowed_events = ["click", "clickThrough", "conversion", "emailConversion", "purchaseConversion"]
limit=500;

print("total: " + db.userEventsJournal.count())
i=0;
db.userEventsJournal.find({ "eventId" :

{ $nin: allowed_events }

}).limit(limit).forEach(function(o) { /* print(o.eventId); */ db.userEventsJournal.remove({_id: o._id}); i += 1; })
print("deleted: " + i);
</code>

When I run it via mongos, it absolutely hammers one of the masters in my replica set (mongodb04). CPU goes to 100% and if it stays pegged long enough, a failover occurs. Now here's the really interesting thing. When I run this script directly on the master instance, everything works fine. I can even remove 50000+ records at a time with no issues at all. Here's some explain() output:

db.userEventsJournal.find({ "eventId" :

{ $nin: allowed_events }

}).limit(limit).explain()
{
"cursor" : "BasicCursor",
"nscanned" : 592,
"nscannedObjects" : 592,
"n" : 500,
"millis" : 1,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {

}
}

So it only scans 592 before it finds the 500 I'm interested in. Great! But when I run it via the mongos, I see messages like this on the master:

Thu May 12 18:15:09 [conn22079] query example.userEventsJournal ntoreturn:5000 reslen:923122 nscanned:7414056 { eventId:

{ $nin: [ "click", "clickThrough", "conversion", "emailConversion", "purchaseConversion" ] }

} nreturned:5000 18539ms

Notice that via mongos, it's scanning 7414056 records before it finds 5000 that I'm interested in. I'm sure this is what's causing the master to become unresponsive, but I don't understand why the behavior is different when I run the query through a mongos...

Comment by Kristina Chodorow (Inactive) [ 12/May/11 ]

There seems to be more going on than just the stepdown bug, though

Generated at Thu Feb 08 03:02:02 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.