[SERVER-9622] establishing cursor for query can fail with stale config after first batch built - testshard1.js failing on Windows Created: 08/May/13  Updated: 06/Dec/22  Resolved: 21/Dec/18

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Ian Whalen (Inactive) Assignee: [DO NOT USE] Backlog - Sharding Team
Resolution: Done Votes: 4
Labels: todo_in_code
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File mongos_stale_info.log     Text File testshard1.txt    
Issue Links:
Depends
is depended on by SERVER-11299 still unresolved: assertion 13388 sha... Closed
Related
is related to SERVER-5752 query yielding with frequent migratio... Closed
Assigned Teams:
Sharding
Operating System: ALL
Participants:
Linked BF Score: 0

 Description   

This issue is related to rapid chunk migrations causing too many retries when setting the shard version despite successfully setting the shard version each time.

First seen failing at http://buildbot.mongodb.org:8081/builders/Windows%2064-bit%202008R2%2B/builds/7

seems to be happening more frequently now (#22-#26)



 Comments   
Comment by Gregory McKeon (Inactive) [ 21/Dec/18 ]

We upped the number of retries from 3 to > 10 in 3.6

Comment by Greg Studer [ 13/Feb/14 ]

This is non-regression and does not affect safety (just query retry). Writes now have a more advanced notion of "progress".

Comment by Greg Studer [ 13/Aug/13 ]

Hmm... this is definitely a different issue - you're not moving any chunks (and the version isn't changing) when you're getting these errors. This issue is related to very rapid chunk migrations (though that wasn't at all obvious from the description). I'd recommend you open a SUPPORT ticket with the mongos log here as well as a mongod log (from shard "db2", or if you need to repro again a log from the shard primary you're versioning against) at logLevel 2.

In particular, it seems like you're repeatedly setting the same version, but mongod is somehow not getting correctly set - something is definitely wrong, but it's not related to establishing a cursor b/c of rapid migrations so we should open a different ticket.

Comment by Peter Maedel [ 09/Aug/13 ]

greg_10gen the affected collection is solely queried by _id index, also only a single document will be returned by a query though each document can be quite large (up to 800k), so problems due to batch size might apply. I have attached the relevant log from mongos

Comment by Greg Studer [ 08/Aug/13 ]

pmaedel Do you have a sample error and the exact operation which caused this issue? Logs from the mongos and mongod at level 2 are really the only way to confirm - the error message "too many retries of stale version info" can be caused by a number of problems, not necessarily always related to this issue.

If this is the problem, workarounds should be to turn off balancing during high-load periods using the balancer window, ensure indexes on slow queries, and/or change the query batchsize to return smaller batches.

Comment by Peter Maedel [ 06/Aug/13 ]

This bug occurs frequently upon our production systems. Is there any known workaround?

Comment by Greg Studer [ 21/May/13 ]

Failing because of query.cpp:715 - we yield when constructing the first batch of a cursor and throw a stale config. Pretty sure this is always wrong - if we have a cursor associated with a particular version, we never should throw stale config.

Comment by auto [ 15/May/13 ]

Author:

{u'date': u'2013-05-15T19:50:57Z', u'name': u'Greg Studer', u'email': u'greg@10gen.com'}

Message: SERVER-9622 buildbot testshard1.js turn on exception tracing in mongod

temporary, will be removed in final fix for SERVER-9622
Branch: master
https://github.com/mongodb/mongo/commit/9df07aa58f54646c0792734d72f0702b41b16ceb

Comment by auto [ 15/May/13 ]

Author:

{u'date': u'2013-05-15T19:50:57Z', u'name': u'Greg Studer', u'email': u'greg@10gen.com'}

Message: SERVER-9622 buildbot testshard1.js turn on exception tracing in mongod

temporary, will be removed in final fix for SERVER-9622
Branch: master
https://github.com/mongodb/mongo/commit/9df07aa58f54646c0792734d72f0702b41b16ceb

Comment by Greg Studer [ 15/May/13 ]

Looks like late stale config exceptions in aggregation are causing problems somehow, enabled enhanced tracing to see exactly where they're being thrown from.

Comment by Greg Studer [ 13/May/13 ]

Looks like this resolved the issue, but feel free to reopen if this starts failing again.

Comment by Greg Studer [ 10/May/13 ]

Changed test to ensure there isn't any config-server-as-shard weirdness which would allow migrations to interfere with the loading of the migration metadata. Also higher loglevel in case this happens again to capture how the retries are happening.

Comment by auto [ 10/May/13 ]

Author:

{u'date': u'2013-05-10T15:25:13Z', u'name': u'Greg Studer', u'email': u'greg@10gen.com'}

Message: SERVER-9622 buildbot testshard1.js make shard not config server too, loglevel 2
Branch: master
https://github.com/mongodb/mongo/commit/f10527979b27da40fe23a4f2a5a75243b588d808

Comment by Mathias Stearn [ 08/May/13 ]

Looks like a sharding issue to me. Feel free to reasign to me if it looks like agg is to blame. There haven't been any real changes to agg since 2.3.2

Failure message:

 m30999| Thu Apr 25 03:27:36.634 [conn1] AssertionException while processing op type : 2004 to : aggShard.ts1 :: caused by :: 13388 too many retries of stale version info ( ns : aggShard.ts1, received : 9|1||5178dabbc4a798787fa6cbb2, wanted : 10|0||5178dabbc4a798787fa6cbb2, send )
	Thu Apr 25 03:27:36.634 JavaScript execution failed: error: {
		"$err" : "too many retries of stale version info ( ns : aggShard.ts1, received : 9|1||5178dabbc4a798787fa6cbb2, wanted : 10|0||5178dabbc4a798787fa6cbb2, send )",
		"code" : 13388
	} at src/mongo/shell/query.js:L131
	failed to load: D:\slave\Windows_64bit_2008R2+\mongo\jstests\aggregation\testshard1.js

Generated at Thu Feb 08 03:20:58 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.