[SERVER-36751] Prevent concurrent dropDatabase commands in the concurrency_simultaneous_replication suite Created: 17/Aug/18  Updated: 29/Oct/23  Resolved: 26/Dec/18

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: None
Fix Version/s: 4.1.7

Type: Task Priority: Major - P3
Reporter: Robert Guo (Inactive) Assignee: Max Hirschhorn
Resolution: Fixed Votes: 0
Labels: tig-bfday-eligible, tig-concurrency
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Backwards Compatibility: Fully Compatible
Backport Requested:
v4.0, v3.6
Sprint: STM 2018-12-31
Participants:
Linked BF Score: 0
Story Points: 1

 Description   

Problem
In the concurrency_simultaneous_replication test suite, we run 10 operations in parallel on the same database, there's a small chance (e.g. 5% for some workloads) that an operation could be a dropDatabase. For slower build variants, a single dropDatabase command can take multiple minutes to finish if there is heavy activity from other workloads that are happening in parallel.

Our tests will retry an operation for up to 10 minutes if DatabaseDropPending errors are encountered. After seeing the error, A getLastError command is used to wait for the dropDatabase command to be committed.

There is variability in the order that getLastError returns from different workload clients, which may cause certain workload clients to always be stuck behind other clients that are doing more dropDatabase commands. When this happens, the client will receive another DatabaseDropPending error. But the client is unable to distinguish whether the error is caused by the same dropDatabase command or a new one, causing the new wait to continue eat into the 10 minute timeout. There is a small probability that this cycle will happen for a handful of times in a row, which when combined with slow multi-minute dropDatabase commands, will exceed the 10 minute timeout.

Solution
The solution is to avoid retrying dropDatabase commands when it returns a DatabaseDropPending error. This will cause the workload to transition to a new state and continue to do so until the new state is no longer a dropDatabase call. Then it will wait on the ongoing dropDatabase call.

When the database is finally dropped, it's guaranteed that none of the clients waiting on it would be another drop database, so they should all be able to proceed. There might be edge cases where one client is able to execute multiple commands and one of those commands is another dropDatabase, but the likelihood of this happening 5 times in a row should be much smaller if not negligible.

From a correctness perspective, this change will make some dropDatabase implicitly into no-ops, which should not cause loss of test coverage, as databases can't be dropped in parallel in the first place. The tests that run parallel dropDatabases also all randomized tests and don't expect these operations to all succeed when there are parallel clients operating on the same database.

We should also write a dedicated regression test that does a high number of collection DDL operations while dropping and creating databases to simulate the timeout failures we've seen, the changes from this ticket should prevent the test from failing.

The new test and the changes to not retry dropDatabase should be limited to affect only the concurrency_simultaneous_replication suite, as we have not seen this failure elsewhere so far.



 Comments   
Comment by Githook User [ 26/Dec/18 ]

Author:

{'username': 'visemet', 'email': 'max.hirschhorn@mongodb.com', 'name': 'Max Hirschhorn'}

Message: SERVER-36751 Skip retrying dropDatabase on DatabaseDropPending error.
Branch: master
https://github.com/mongodb/mongo/commit/62c7e599ba211209eb93ae8f652d17fc8f6c251f

Comment by Max Hirschhorn [ 24/Dec/18 ]

We should also write a dedicated regression test that does a high number of collection DDL operations while dropping and creating databases to simulate the timeout failures we've seen, the changes from this ticket should prevent the test from failing.

I don't think such a test case is going to be practical to run in Evergreen on a continuous basis.

Comment by Max Hirschhorn [ 30/Aug/18 ]

We can put the new behavior behind a TestData option that's only enabled for the concurrency framework if there are other users of the implicitly_retry_on_database_drop_pending.js override file outside of the concurrency framework.

Generated at Thu Feb 08 04:43:59 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.