[SERVER-9622] establishing cursor for query can fail with stale config after first batch built - testshard1.js failing on Windows Created: 08/May/13 Updated: 06/Dec/22 Resolved: 21/Dec/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Ian Whalen (Inactive) | Assignee: | [DO NOT USE] Backlog - Sharding Team |
| Resolution: | Done | Votes: | 4 |
| Labels: | todo_in_code | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Assigned Teams: |
Sharding
|
||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Participants: | |||||||||||||||||
| Linked BF Score: | 0 | ||||||||||||||||
| Description |
|
This issue is related to rapid chunk migrations causing too many retries when setting the shard version despite successfully setting the shard version each time. First seen failing at http://buildbot.mongodb.org:8081/builders/Windows%2064-bit%202008R2%2B/builds/7 seems to be happening more frequently now (#22-#26) |
| Comments |
| Comment by Gregory McKeon (Inactive) [ 21/Dec/18 ] | ||||||
|
We upped the number of retries from 3 to > 10 in 3.6 | ||||||
| Comment by Greg Studer [ 13/Feb/14 ] | ||||||
|
This is non-regression and does not affect safety (just query retry). Writes now have a more advanced notion of "progress". | ||||||
| Comment by Greg Studer [ 13/Aug/13 ] | ||||||
|
Hmm... this is definitely a different issue - you're not moving any chunks (and the version isn't changing) when you're getting these errors. This issue is related to very rapid chunk migrations (though that wasn't at all obvious from the description). I'd recommend you open a SUPPORT ticket with the mongos log here as well as a mongod log (from shard "db2", or if you need to repro again a log from the shard primary you're versioning against) at logLevel 2. In particular, it seems like you're repeatedly setting the same version, but mongod is somehow not getting correctly set - something is definitely wrong, but it's not related to establishing a cursor b/c of rapid migrations so we should open a different ticket. | ||||||
| Comment by Peter Maedel [ 09/Aug/13 ] | ||||||
|
greg_10gen the affected collection is solely queried by _id index, also only a single document will be returned by a query though each document can be quite large (up to 800k), so problems due to batch size might apply. I have attached the relevant log from mongos | ||||||
| Comment by Greg Studer [ 08/Aug/13 ] | ||||||
|
pmaedel Do you have a sample error and the exact operation which caused this issue? Logs from the mongos and mongod at level 2 are really the only way to confirm - the error message "too many retries of stale version info" can be caused by a number of problems, not necessarily always related to this issue. If this is the problem, workarounds should be to turn off balancing during high-load periods using the balancer window, ensure indexes on slow queries, and/or change the query batchsize to return smaller batches. | ||||||
| Comment by Peter Maedel [ 06/Aug/13 ] | ||||||
|
This bug occurs frequently upon our production systems. Is there any known workaround? | ||||||
| Comment by Greg Studer [ 21/May/13 ] | ||||||
|
Failing because of query.cpp:715 - we yield when constructing the first batch of a cursor and throw a stale config. Pretty sure this is always wrong - if we have a cursor associated with a particular version, we never should throw stale config. | ||||||
| Comment by auto [ 15/May/13 ] | ||||||
|
Author: {u'date': u'2013-05-15T19:50:57Z', u'name': u'Greg Studer', u'email': u'greg@10gen.com'}Message: temporary, will be removed in final fix for | ||||||
| Comment by auto [ 15/May/13 ] | ||||||
|
Author: {u'date': u'2013-05-15T19:50:57Z', u'name': u'Greg Studer', u'email': u'greg@10gen.com'}Message: temporary, will be removed in final fix for | ||||||
| Comment by Greg Studer [ 15/May/13 ] | ||||||
|
Looks like late stale config exceptions in aggregation are causing problems somehow, enabled enhanced tracing to see exactly where they're being thrown from. | ||||||
| Comment by Greg Studer [ 13/May/13 ] | ||||||
|
Looks like this resolved the issue, but feel free to reopen if this starts failing again. | ||||||
| Comment by Greg Studer [ 10/May/13 ] | ||||||
|
Changed test to ensure there isn't any config-server-as-shard weirdness which would allow migrations to interfere with the loading of the migration metadata. Also higher loglevel in case this happens again to capture how the retries are happening. | ||||||
| Comment by auto [ 10/May/13 ] | ||||||
|
Author: {u'date': u'2013-05-10T15:25:13Z', u'name': u'Greg Studer', u'email': u'greg@10gen.com'}Message: | ||||||
| Comment by Mathias Stearn [ 08/May/13 ] | ||||||
|
Looks like a sharding issue to me. Feel free to reasign to me if it looks like agg is to blame. There haven't been any real changes to agg since 2.3.2 Failure message:
|