[SERVER-53094] Tests which use {waitForDelete:true} on moveChunk are not safe to run in the sharding_csrs_continuous_config_stepdown suite Created: 27/Nov/20  Updated: 05/Jan/24  Resolved: 05/Jan/24

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Kaloian Manassiev Assignee: Allison Easton
Resolution: Gone away Votes: 0
Labels: oldshardingemea, sharding-csrs-stepdown-upkeep, shardingemea-qw
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
is related to SERVER-46669 moveChunk may succeed but not respect... Closed
is related to SERVER-59891 Replace the coverage from sharding_co... Backlog
Assigned Teams:
Catalog and Routing
Operating System: ALL
Sprint: Sharding 2021-07-12, Sharding 2021-10-04, Sharding 2021-10-18, Sharding 2021-11-01, CAR Team 2023-12-25, CAR Team 2024-01-08
Participants:
Linked BF Score: 0
Story Points: 2

 Description   

As an example, all of the update_shard_key_*.js suites perform some kind of moveChunk with waitForDelete:true and rely on there not being any orphans for their testing expectations. However waitForDelete:true doesn't work well with stepdowns of the Config Server, because there are numerous places where we can't obey (this one for example).

Because of this, it is not safe to run tests with waitForDelete:true in the sharding_csrs_continuous_config_stepdown suite and they should be blacklisted.

In more detail, this is the least that can happen:

  • Some test relies on range deletion to be successful in order to not have orphans on the cluster
  • A moveChunk command succeeds to commit on the Config Server, BUT returns an error to the Balancer, because the Config Server was down and it couldn't re-check its work (this is okay). However, this means it didn't wait for a range deletion.
  • This code noticed that the chunk actually committed, so didn't pass the error to the Router/Client
  • The test happily continues, even though there are orphans on the donor shard


 Comments   
Comment by Lamont Nelson [ 24/Aug/22 ]

kaloian.manassiev@mongodb.com This was routed from the team triage meeting due to having SERVER-59891 on the emea backlog. I was just the messenger in this case

I agree regarding tech debt.

Comment by Max Hirschhorn [ 11/Sep/21 ]

Hoping to not do this ticket and to do SERVER-59891 instead.

Comment by Max Hirschhorn [ 17/Jun/21 ]

However waitForDelete:true doesn't work well with stepdowns of the Config Server, because there are numerous places where we can't obey (this one for example).

kaloian.manassiev, while working on SERVER-42192 during the resumable range deleter project, I had also faced how waitForDelete:true isn't upheld when the chunk has successfully migrated but the wait for cleanup got interrupted and had filed SERVER-46669 about it. The code you linked to rewrites the error so the client doesn't even know that the cleanup was interrupted.

Is there a reason you'd prefer to disable / change the jstests/sharding/ tests that use waitForDelete:true rather than addressing the error and retry behavior of the moveChunk command? Was it to make the build greener while we sit on making a more involved change to address the server behavior?

Generated at Thu Feb 08 05:29:54 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.