[SERVER-6684] multi_mongos2.js SCE thrown in 2.0 mongod/2.2 mongos test Created: 01/Aug/12  Updated: 11/Jul/16  Resolved: 08/Aug/12

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 2.2.0-rc1

Type: Bug Priority: Major - P3
Reporter: Greg Studer Assignee: Spencer Brody (Inactive)
Resolution: Done Votes: 0
Labels: buildbot
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File multi_mongos2.js    
Operating System: ALL
Participants:

 Description   

Not sure what is causing this.



 Comments   
Comment by auto [ 09/Aug/12 ]

Author:

{u'date': u'2012-08-09T08:25:33-07:00', u'email': u'spencer@10gen.com', u'name': u'Spencer T Brody'}

Message: Make handling of stale configs in ParallelSortClusteredCursor safer
by only checking the result object on commands. SERVER-6685 SERVER-6684
Branch: master
https://github.com/mongodb/mongo/commit/58acf9dc86cb571a18e5b720eff11c5e4b55a598

Comment by auto [ 08/Aug/12 ]

Author:

{u'date': u'2012-08-07T12:52:51-07:00', u'email': u'spencer@10gen.com', u'name': u'Spencer T Brody'}

Message: Check result object of command for stale config in ParallelSortClusteredCursor. SERVER-6684 SERVER-6685
Branch: master
https://github.com/mongodb/mongo/commit/797654e85cb58cc8fa6310be3b619bb3f568c784

Comment by Spencer Brody (Inactive) [ 07/Aug/12 ]

The problem seems to be with the handling of StaleConfigExceptions in the count command.

Running this same test with both mongos and mongod using 2.0.6 passes. In the logs from a purely 2.0 run you can see that at the point where it fails with a 2.2 mongos (during a count command), the same StaleConfigException is thrown and shows up in the logs, but the 2.0 mongos recovers while the 2.2 mongos does not. The mongos implementation of count in 2.0 had a bunch of logic for retrying on stale config exceptions (and sending an authoritative setShardVersion after 3 failed attempts), whereas the 2.2 mongos uses SHARDED->commandOp, which uses a ParallelSortClusteredCursor. I suspect that the ParallelSortClusteredCursor isn't handling StaleConfigExceptions in the same way that the count command did in 2.0, and that's what's causing the problem.

Comment by Greg Studer [ 01/Aug/12 ]

Could be due to better detection of dropped collections, but unsure why it wouldn't fail in 2.0 tests also.

Comment by Greg Studer [ 01/Aug/12 ]

bouncing_count.js may have a similar intermittent failure.

Generated at Thu Feb 08 03:12:24 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.