[SERVER-57476] Operation may block on prepare conflict while holding oplog slot, stalling replication indefinitely Created: 05/Jun/21 Updated: 29/Oct/23 Resolved: 10/Jun/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 4.2.0, 4.4.0, 5.0.0-rc0 |
| Fix Version/s: | 4.2.15, 4.4.7, 5.0.0-rc2, 5.1.0-rc0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Max Hirschhorn | Assignee: | Daniel Gottlieb (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||
| Backport Requested: |
v5.0, v4.4, v4.2
|
||||||||||||||||||||||||
| Sprint: | Repl 2021-06-14 | ||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||||||
| Description |
|
wiredTigerPrepareConflictRetry() doesn't release any storage resources when WiredTiger returns WT_PREPARE_CONFLICT. This is because prepare conflicts are expected to resolve quickly and the storage transaction may already have a snapshot compatible with the commitTimestamp of the prepared transaction. However, committing or aborting a prepared transaction depends on the ability of the replica set to replicate and confirm majority-committed writes. If the operation blocked in wiredTigerPrepareConflictRetry() has reserved an oplog slot, then oplog readers will block until the oplog hole is resolved. But oplog readers being blocked prevents new incoming writes from becoming majority-committed, which can prevent a prepared transaction from exiting the kPrepared state.
|
| Comments |
| Comment by Githook User [ 11/Jun/21 ] | |||||||||||||||||||
|
Author: {'name': 'Daniel Gottlieb', 'email': 'daniel.gottlieb@mongodb.com', 'username': 'dgottlieb'}Message: (cherry picked from commit 1e7e343fb6c90fbf0c62deabf61630353e2e5e29) | |||||||||||||||||||
| Comment by Githook User [ 10/Jun/21 ] | |||||||||||||||||||
|
Author: {'name': 'Daniel Gottlieb', 'email': 'daniel.gottlieb@mongodb.com', 'username': 'dgottlieb'}Message: (cherry picked from commit 1e7e343fb6c90fbf0c62deabf61630353e2e5e29) | |||||||||||||||||||
| Comment by Githook User [ 10/Jun/21 ] | |||||||||||||||||||
|
Author: {'name': 'Daniel Gottlieb', 'email': 'daniel.gottlieb@mongodb.com', 'username': 'dgottlieb'}Message: (cherry picked from commit 1e7e343fb6c90fbf0c62deabf61630353e2e5e29) | |||||||||||||||||||
| Comment by Githook User [ 10/Jun/21 ] | |||||||||||||||||||
|
Author: {'name': 'Daniel Gottlieb', 'email': 'daniel.gottlieb@mongodb.com', 'username': 'dgottlieb'}Message: | |||||||||||||||||||
| Comment by Daniel Gottlieb (Inactive) [ 09/Jun/21 ] | |||||||||||||||||||
Changing my answer: no. This should only create {{WriteConflictException}}s that get retried internally and never get returned to a user. This is based off the assumption that (MDB) transactions don't generate their oplog timestamps until the very end. | |||||||||||||||||||
| Comment by Githook User [ 09/Jun/21 ] | |||||||||||||||||||
|
Author: {'name': 'Daniel Gottlieb', 'email': 'daniel.gottlieb@mongodb.com', 'username': 'dgottlieb'}Message: Revert " This reverts commit 44419183dfabb246c0a112f6060a372e90ee0d44. | |||||||||||||||||||
| Comment by Githook User [ 09/Jun/21 ] | |||||||||||||||||||
|
Author: {'name': 'Daniel Gottlieb', 'email': 'daniel.gottlieb@mongodb.com', 'username': 'dgottlieb'}Message: | |||||||||||||||||||
| Comment by Daniel Gottlieb (Inactive) [ 08/Jun/21 ] | |||||||||||||||||||
|
Thanks for the thoughtful questions bruce.lucas. Given the interest in this scenario, I appreciate the timely feedback that helps us quickly gain confidence that the described change is correct, complete and minimizes impact on users.
Operations that were unlikely to encounter a write conflict are now more likely to observe one[1]. But I don't believe we return a write conflict for anything that couldn't do so previously. I've been using this chart to describe the interleaving that leads to a stall:
But if we reorder who wins the race to allocate a timestamp, we'd now create an "unnecessary" write conflict (the prepare conflict would resolve, allowing the "vectored insert" to make progress without retrying the whole operation):
I don't think we'll see operations write conflict at a high rate, but it's hard to say for certain. The above diagram demonstrates how the benign case can result in a write conflict while we wait for a majority write. The window can be large (network timescale versus CPU timescale for healthy systems). But to be clear, I claim: That said, we aren't fully aware of all the cases where we have this race. From the stalls we've investigated, some don't seem to be this specific scenario with vectored inserts against a collection with a non-_id unique index (but we were unable to rule it out entirely – we didn't have visibility into unsharded collections and their index specifications).
I'm not familiar with any mechanism. I know we already struggle to return users an appropriate error message when violating a unique constraint for index keys in the presence of a non-default collation. This feels like a similar problem. [1] All of these prepare conflicts being transformed into a write conflict could* already be a write conflict today if, for example, the vectored insert tries to write to the document before the "prepared operation" goes into a prepared state. If the vectored insert was not part of an MDB transaction, the write conflict will be retried internally. If it were part of an MDB transaction, the transaction would be aborted and the application would have to retry. Applications that are already resilient to write conflicts in their MDB transactions today should continue to be so after this patch. As noted above, this patch will make the window where applications can see write conflicts change from a CPU timescale to network. Thus it's plausible this could expose retrying bugs in applications that were (un?)lucky enough to not run into them yet. | |||||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 08/Jun/21 ] | |||||||||||||||||||
|
daniel.gottlieb, does this have the possibility of user operations that didn't previously encounter write conflict errors, or were unlikely to, to now possibly see a write conflict? That's not necessarily a problem (unless it makes operations likely to see write conflicts at a high rate), but it may be something we need to be aware of in case customers have questions. Also, I wonder if there is any mechanism in the error returned to indicate to a user why the are seeing a write conflict? We have seen cases where write conflicts can be generated that catch users by surprise (even though we say they should always be prepared to handle write conflicts). | |||||||||||||||||||
| Comment by Daniel Gottlieb (Inactive) [ 07/Jun/21 ] | |||||||||||||||||||
|
For this ticket I plan on failing operations with a WriteConflictException that:
This resolution should eliminate the possibility that any deployments in the wild gets stuck in the way this ticket describes. However, this change won't discover existing potential problems that are not already being exercised. |