[SERVER-32692] Make zbigMapReduce.js, sharding_balance4.js, and bulk_shard_insert.js more resilient under slow machines Created: 12/Jan/18  Updated: 30/Oct/23  Resolved: 23/Sep/19

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 4.3.1

Type: Bug Priority: Major - P3
Reporter: Jack Mulrow Assignee: Matthew Saltz (Inactive)
Resolution: Fixed Votes: 1
Labels: gm-ack
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
depends on SERVER-42914 Implement random chunk selection poli... Closed
Duplicate
is duplicated by SERVER-32694 Retry find on StaleShardVersion in sh... Closed
Related
related to SERVER-53670 Make zbigMapReduce.js more resilient ... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.0, v3.6
Sprint: Sharding 2019-09-09, Sharding 2019-09-23, Sharding 2019-10-07
Participants:
Linked BF Score: 37

 Description   

zbigMapReduce.js fails occasionally because more than 5 migrations manage finish since the beginning of either of the two bulk writes it executes, causing the test to fail since the write never establishes a shard version. Similarly to sharding_balance4.js as of SERVER-28697, we should ignore a certain number of NoProgressMade errors to make the test fail less frequently.

sharding_balance4.js and bulk_shard_insert.js occasionally fail because more than 10 migrations complete during the course of a find command exhausting mongos's retry attempts and failing the test. Modifying the test to retry a couple times on StaleShardVersion should make it fail less often.

We can also consider making a generic override for read commands that retry on StaleShardVersion errors, so it can be load-ed into tests that involve frequent migrations.



 Comments   
Comment by Githook User [ 18/Nov/19 ]

Author:

{'username': 'saltzm', 'email': 'matthew.saltz@mongodb.com', 'name': 'Matthew Saltz'}

Message: SERVER-32692 Make zbigMapReduce.js, sharding_balance4.js, and bulk_shard_insert.js more resilient under slow machines

(cherry picked from commit 1e0f4f8136e640d90093476695bb07b851da2da9)
(cherry picked from commit a17bb0d5dcf4294954c4b2468216335a5e9b9023)
Branch: v4.0
https://github.com/mongodb/mongo/commit/b7f72c287e1ab9a92b784697a80211e3f365cd08

Comment by Githook User [ 04/Nov/19 ]

Author:

{'username': 'saltzm', 'email': 'matthew.saltz@mongodb.com', 'name': 'Matthew Saltz'}

Message: SERVER-32692 Make zbigMapReduce.js, sharding_balance4.js, and bulk_shard_insert.js more resilient under slow machines

(cherry picked from commit 1e0f4f8136e640d90093476695bb07b851da2da9)
Branch: v3.6
https://github.com/mongodb/mongo/commit/a17bb0d5dcf4294954c4b2468216335a5e9b9023

Comment by Githook User [ 23/Sep/19 ]

Author:

{'name': 'Matthew Saltz', 'username': 'saltzm', 'email': 'matthew.saltz@mongodb.com'}

Message: SERVER-32692 Make zbigMapReduce.js, sharding_balance4.js, and bulk_shard_insert.js more resilient under slow machines
Branch: master
https://github.com/mongodb/mongo/commit/1e0f4f8136e640d90093476695bb07b851da2da9

Comment by Jack Mulrow [ 01/Aug/19 ]

Yeah I think throttling the balancer for these tests would help.

Comment by Matthew Saltz (Inactive) [ 01/Aug/19 ]

Being able to throttle the balancer actually seems like a useful feature in general - some parameter that lets you specify max migrations per second or a parameter that says how long to sleep in between rounds. Should be easy to implement and backportable too. Let me know if I should create a ticket for that.

Not sure if there's a good way to do it client side?

jack.mulrow does that seem like it'd help?

Comment by Kaloian Manassiev [ 01/Aug/19 ]

By "lowering constants" do you mean e.g. inserting less data?

Yes, this is what I meant. However what Randolph proposes above also seems legit.

Comment by Randolph Tan [ 01/Aug/19 ]

If there is a way to throttle the balancer to take a small pause after each migration, then I think that would help too.

Comment by Matthew Saltz (Inactive) [ 01/Aug/19 ]

By "lowering constants" do you mean e.g. inserting less data?

Comment by Matthew Saltz (Inactive) [ 01/Aug/19 ]

If kMaxNumStaleVersionRetries were a server parameter I'd say we should increase that value in the test, but it's not I was thinking we'd just have the tests themselves retry a few times on the error code for StaleShardVersion, as I believe Jack was suggesting. Bumping kMaxNumStaleVersionRetries would be easier since it's all in once place, didn't know if that was acceptable or not.

Comment by Kaloian Manassiev [ 01/Aug/19 ]

matthew.saltz, are you suggesting bumping the kMaxNumStaleVersionRetries? I have no recollection how this value was reached (renctan, do you?), but I don't think it is out of the question doubling it as long as we also obey the MaxTimeMS.

Alternatively, can we just lower some constants in the test so it is more lightweight?

Comment by Kaloian Manassiev [ 18/Jan/18 ]

jack.mulrow, I am not sure that retries for these tests is the right solution, because then I think it defeats their purpose, which is to make sure no anomalies are happening under some form of stress. It is a different question how useful these tests are.

max.hirschhorn, can we just blacklist these two tests in the DEBUG suites so we clear some red?

Generated at Thu Feb 08 04:31:00 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.