[SERVER-4215] multiple connections to shards can rarely get mutually inconsistent versions when migrate occurs Created: 04/Nov/11 Updated: 11/Jul/16 Resolved: 18/Jan/12 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 2.1.0 |
| Fix Version/s: | 2.1.0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Greg Studer | Assignee: | Greg Studer |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Operating System: | ALL | ||||||||
| Participants: | |||||||||
| Description |
|
...think this has to do with migration triggered from the same mongos, so version is auto-reloaded afterward. Sample bigMapReduce.js that fails - one of the connections has a stale version compared to the current version. |
| Comments |
| Comment by Greg Studer [ 18/Jan/12 ] |
|
Fixed, will continue refactoring in a new ticket. |
| Comment by Greg Studer [ 29/Dec/11 ] |
|
Keeping lock for sharded m/r output, but the cursor method here should allow m/r to work with migrations on the input collection. Added a new test of m/r with continuous migrations, can be tweaked in the future for sharded output as well. |
| Comment by auto [ 29/Dec/11 ] |
|
Author: {u'login': u'gregstuder', u'name': u'gregs', u'email': u'greg@10gen.com'}Message: |
| Comment by auto [ 29/Dec/11 ] |
|
Author: {u'login': u'gregstuder', u'name': u'gregs', u'email': u'greg@10gen.com'}Message: |
| Comment by Greg Studer [ 23/Dec/11 ] |
|
Was looking at this yesterday (and a bit more today) - think one issue isn't the number of retries per-se but that we don't check the ns version until after all the prep work has been done (creating temporary collections) - fast migrations in the meantime can interfere and with a large number of shards almost always will. In this case, I think we may just need to check the ns version right away, and potentially open a cursor to make sure the data we're mapReducing doesn't get deleted. Long running ops in general are tricky. |
| Comment by auto [ 22/Dec/11 ] |
|
Author: {u'login': u'agirbal', u'name': u'agirbal', u'email': u'antoine@10gen.com'}Message: |
| Comment by auto [ 21/Dec/11 ] |
|
Author: {u'login': u'agirbal', u'name': u'agirbal', u'email': u'antoine@10gen.com'}Message: |
| Comment by Antoine Girbal [ 17/Dec/11 ] |
|
seems like the current code does not retry enough. |
| Comment by auto [ 14/Dec/11 ] |
|
Author: {u'login': u'gregstuder', u'name': u'Greg Studer', u'email': u'greg@10gen.com'}Message: |
| Comment by auto [ 14/Dec/11 ] |
|
Author: {u'login': u'gregstuder', u'name': u'Greg Studer', u'email': u'greg@10gen.com'}Message: |
| Comment by Greg Studer [ 04/Nov/11 ] |
|
Also, to clarify the attachment, the versionStale doesn't mean that a stale config exception was triggered, just that the sequence number of the current version of the namespace/shard is different from the sequence number of the chunkManager of the correctly executed command (the actual version number used is not available after it changes). |
| Comment by Greg Studer [ 04/Nov/11 ] |
|
may also apply for general parallel cursor case. |