[SERVER-4215] multiple connections to shards can rarely get mutually inconsistent versions when migrate occurs Created: 04/Nov/11  Updated: 11/Jul/16  Resolved: 18/Jan/12

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 2.1.0
Fix Version/s: 2.1.0

Type: Bug Priority: Major - P3
Reporter: Greg Studer Assignee: Greg Studer
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File mr_output_that_fails.txt    
Issue Links:
Related
related to SERVER-4220 refactor getChunkManager() and shardi... Closed
Operating System: ALL
Participants:

 Description   

...think this has to do with migration triggered from the same mongos, so version is auto-reloaded afterward.

Sample bigMapReduce.js that fails - one of the connections has a stale version compared to the current version.



 Comments   
Comment by Greg Studer [ 18/Jan/12 ]

Fixed, will continue refactoring in a new ticket.

Comment by Greg Studer [ 29/Dec/11 ]

Keeping lock for sharded m/r output, but the cursor method here should allow m/r to work with migrations on the input collection. Added a new test of m/r with continuous migrations, can be tweaked in the future for sharded output as well.

Comment by auto [ 29/Dec/11 ]

Author:

{u'login': u'gregstuder', u'name': u'gregs', u'email': u'greg@10gen.com'}

Message: SERVER-4215 use local cursor instead of dist lock for protection against migrate in m/r
Branch: master
https://github.com/mongodb/mongo/commit/009cc229ea0af766e632db87d5d6b45f7a60a275

Comment by auto [ 29/Dec/11 ]

Author:

{u'login': u'gregstuder', u'name': u'gregs', u'email': u'greg@10gen.com'}

Message: SERVER-4215 check version and lock docs to server with cursor early in m/r
Branch: master
https://github.com/mongodb/mongo/commit/64b596993b18e5a22becf729549ec0386ce278a3

Comment by Greg Studer [ 23/Dec/11 ]

Was looking at this yesterday (and a bit more today) - think one issue isn't the number of retries per-se but that we don't check the ns version until after all the prep work has been done (creating temporary collections) - fast migrations in the meantime can interfere and with a large number of shards almost always will.

In this case, I think we may just need to check the ns version right away, and potentially open a cursor to make sure the data we're mapReducing doesn't get deleted. Long running ops in general are tricky.

Comment by auto [ 22/Dec/11 ]

Author:

{u'login': u'agirbal', u'name': u'agirbal', u'email': u'antoine@10gen.com'}

Message: SERVER-4215: cannot retry enough times to satisfy tests, use lock
Branch: master
https://github.com/mongodb/mongo/commit/4691085ea565db1b52be8ab63332da94347f2fea

Comment by auto [ 21/Dec/11 ]

Author:

{u'login': u'agirbal', u'name': u'agirbal', u'email': u'antoine@10gen.com'}

Message: SERVER-4215: previous commit broke output of MR
Branch: master
https://github.com/mongodb/mongo/commit/4fa93b277db5613bdd11019d04c1cd0da1c1d162

Comment by Antoine Girbal [ 17/Dec/11 ]

seems like the current code does not retry enough.
bigMapReduce.js fails about 50% of time.
{
"errmsg" : "exception: could not run map command on all shards for ns test.foo and query {} :: caused by :: ns: test.foo too many retries of sta
le version info(send)",
"code" : 13388,
"ok" : 0
}

Comment by auto [ 14/Dec/11 ]

Author:

{u'login': u'gregstuder', u'name': u'Greg Studer', u'email': u'greg@10gen.com'}

Message: SERVER-4215 use pcursor for finalized reduce step
Branch: master
https://github.com/mongodb/mongo/commit/44ef35280f4d1112934e7f338007f95a0aeadae1

Comment by auto [ 14/Dec/11 ]

Author:

{u'login': u'gregstuder', u'name': u'Greg Studer', u'email': u'greg@10gen.com'}

Message: SERVER-4215 use consistent cursor for map phase
Branch: master
https://github.com/mongodb/mongo/commit/082a92b92097041174a5bd32a4dcd352b5ae3a00

Comment by Greg Studer [ 04/Nov/11 ]

Also, to clarify the attachment, the versionStale doesn't mean that a stale config exception was triggered, just that the sequence number of the current version of the namespace/shard is different from the sequence number of the chunkManager of the correctly executed command (the actual version number used is not available after it changes).

Comment by Greg Studer [ 04/Nov/11 ]

may also apply for general parallel cursor case.

Generated at Thu Feb 08 03:05:18 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.