[SERVER-84384] Resharding test infrastructure must be resilient to intermittent errors. Created: 21/Dec/23  Updated: 07/Feb/24

Status: In Code Review
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Nandini Bhartiya Assignee: Aitor Esteve Alvarado
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Problem/Incident
Assigned Teams:
Catalog and Routing
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: CAR Team 2024-02-05, CAR Team 2024-02-19
Participants:
Linked BF Score: 161
Story Points: 2

 Description   

As seen in https://jira.mongodb.org/browse/BF-31177, resharding hit an intermittent error but was able to restart and move towards completion. However, the test interpreted the mongos error response as a resharding failure and proceeded to run the metadata consistency checks and end the test even though resharding was not yet complete. This test (and maybe the resharding test infrastructure) must be modified and made resilient to retryable errors.



 Comments   
Comment by Githook User [ 07/Feb/24 ]

Author:

{'name': 'atesteve', 'email': 'aitor.esteve@mongodb.com', 'username': 'atesteve'}

Message: SERVER-84384 Resharding test infrastructure must be resilient to intermittent errors (#18767)

GitOrigin-RevId: 8e4947825046a1a734c822c264b3d5a2134a03a4
Branch: master
https://github.com/mongodb/mongo/commit/dcb5f4d22f6e9717bf8a493e75e349cbb09153e2

Comment by Githook User [ 29/Jan/24 ]

Author:

{'name': 'Aitor Esteve Alvarado', 'email': 'aitor.esteve@mongodb.com', 'username': 'atesteve'}

Message: Revert "SERVER-84384 Ignore system.resharding.* collections in checkHistoricalPlacementMetadataConsistency (#18410)"

This reverts commit 16eccaef8a1066146ff72e49cf934afd1d725a7c.

GitOrigin-RevId: f0478c6df5266423f5991db784d64a599eb748a0
Branch: master
https://github.com/mongodb/mongo/commit/4b9445d06f0e850c56a3489cfe9600f5a63cd4ac

Comment by Githook User [ 29/Jan/24 ]

Author:

{'name': 'atesteve', 'email': 'aitor.esteve@mongodb.com', 'username': 'atesteve'}

Message: SERVER-84384 Ignore system.resharding.* collections in checkHistoricalPlacementMetadataConsistency (#18410)

GitOrigin-RevId: 16eccaef8a1066146ff72e49cf934afd1d725a7c
Branch: master
https://github.com/mongodb/mongo/commit/5202fce591b77bf54b3fee356c1ce1f21b3694e4

Comment by Max Hirschhorn [ 27/Dec/23 ]

Antithesis is designed to ignore the errors from individual JavaScript tests because our tests were not authored to handle intermittent errors (e.g. network errors). It is not practical as a general solution to retry within tests because some operations can still lead individual assertion statements to throw an exception (e.g. total count of number of documents updated not matching). Instead the errors which Antithesis propagates are related to properties which must always hold true such as the server not crashing and our data consistency checks.

To address the CheckRoutingTableConsistency hook failure in BF-31177, either (a) the RoutingTableConsistencyChecker hook must either wait for the resharding operation to complete or (b) the RoutingTableConsistencyChecker hook must ignore inconsistencies related the system.resharding collection and config.placementHistory when running in Antithesis. Data consistency checks are generally expected to wait for the system to have quiesced. (For historical context, a special procedure involving no-op collMod was used to drain any index builds still running as part of running the dbhash check.) A test failure is not expected to also lead to a hook failure. CC paolo.polato@mongodb.com

Generated at Thu Feb 08 06:54:51 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.