[SERVER-52667] write_ops_exec::performInserts with retryable writes can lose writes and return an out of order results array Created: 06/Nov/20  Updated: 29/Oct/23  Resolved: 16/Nov/20

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 4.9.0

Type: Bug Priority: Major - P3
Reporter: Daniel Gottlieb (Inactive) Assignee: Daniel Gottlieb (Inactive)
Resolution: Fixed Votes: 0
Labels: sharding-wfbf-day
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Problem/Incident
causes SERVER-52978 Ordered bulk insert no longer stops o... Closed
Backwards Compatibility: Fully Compatible
Participants:

 Description   

Found via code inspection in a CR.

performInserts takes as input a write_ops::Insert (a vector of documents) and returns a result that contains a vector of errors (results*). It's possible for some earlier writes to contain errors while later ones succeed. However, consider the example where an input batch contains 2 writes: {not yet executed insert, already executed insert}. The output results will be for {already executed insert, not yet executed insert}.

This comment demonstrates a scenario where I believe the bug in performInserts can manifest as a wrong response to a user.

My understanding is that today we have no bug. In all cases when the function is called, retryable writes are always at the beginning of the batch (I haven't verified this). But we are adding new code that uses this method and in ways where the writes may be relatively opaque (sourced by the oplog).

I think adding an invariant to formalize the contract or being more careful about batching are both acceptable outcomes.

Assigned to sharding for a first look as this seems specific to retryable writes.



 Comments   
Comment by Githook User [ 16/Nov/20 ]

Author:

{'name': 'Daniel Gottlieb', 'email': 'daniel.gottlieb@mongodb.com', 'username': 'dgottlieb'}

Message: SERVER-52667: Correct performInserts batching.
Branch: master
https://github.com/mongodb/mongo/commit/e1e8ea257b835c4d8e130205a2fd8297bcccbd47

Comment by Daniel Gottlieb (Inactive) [ 10/Nov/20 ]

Had a conversation with max.hirschhorn and there might be a bug today, but connecting the dots is a bit tedious and I'm not familiar enough to know if there's something the system does to prevent this. First to quickly summarize the contract bug with performInserts. The batching code roughly looks like this:

Results[] performInserts(Document[] inputBatch) {
    Result[] results;
    Document[] batchToInsert;
    for (auto doc in inputBatch) {
	if (wasAlreadyExecuted(doc)) {
	    results.append(OK);
	    continue;
	}
	batchToInsert.append(doc);
    }
 
    insert(batchToInsert, results);
}

Based on the subset of documents where wasAlreadyExecuted returns true, the indexes referring to a document inside inputBatch may not correspond with the index in the results array. In the simplest case, consider a batch of two inserts. The first needs to be processed, but the second was successfully executed and won't be retried: [Op0, Op1]. The output results will be in reverse order: [Result(Op1), Result(Op0)].

Additionally, when the last document in the input was an already executed retryable write, the pending batch to be inserted is never actually inserted. The results array will omit contain items for only the inserted data. But the response will seem as if everything was successful.

I believe a case that exercises this bug in sharding is the following:

  • Perform a retryable batch insert that goes to two shards, A and B.
  • Shard B succeeds inserting all documents.
  • Shard A gets a retryable error and is only partially successful.
  • Before retrying, a chunk migration moves documents that were in the retryable write from B to A.
  • The retryable write is reissued to A, including the new documents that were originally sent to B.
  • performInserts is called on A in a way that can give incorrect output.

    |                                           | S0                                                               | S1                     |
    |-------------------------------------------+------------------------------------------------------------------+------------------------|
    | Retryable BatchInsert({_id: 0}, {_id: 1}) |                                                                  |                        |
    |                                           | Insert({_id: 0}) -> Network Error, not inserted                  |                        |
    |                                           |                                                                  | Insert({_id: 1}) -> OK |
    | ChunkMigration {_id: 1} from S1 -> S0     |                                                                  |                        |
    |                                           | Insert({_id: 1}) -> OK                                           |                        |
    |                                           |                                                                  |                        |
    | Retry BatchInsert({_id: 0}, {_id: 1})     |                                                                  |                        |
    |                                           | performInserts([{_id: 0}, {_id: 1}]) -> {_id: 0} is not inserted |                        |
    |                                           | Response received: [{ok: 1, results: [OK]}]                      |                        |
    

Comment by Daniel Gottlieb (Inactive) [ 09/Nov/20 ]

Saving for posterity, but this comment adds little value. See the following comment for a simplified demonstration of the bug.

The perils of static analysis Without a full grasp of the context, it's hard to assess risk. As an extreme example, there could be code that changed the value of opCtx->inMultiDocumentTransaction. But that's not a viable criticism.

>What does it mean for a batch to contain both a retryable insert and a non-retryable insert?

Aside from the fact that the API can't express this (as retryability is noted on the OpCtx), I've now come to the understanding that, certainly, a batch of documents coming from a single insert command can only all be retryable, or none.

However, write_ops_exec::performInserts isn't aligned specifically with batches from the perspective of the insert command. One could certainly find two oplog entries that represent inserts, one of which was retryable and the other was not. However I now see that in the case of using performInserts to replay oplog entries, the caller must not be passing in inputs (i.e: setting OpCtx state) that dictate the inserts are to be considered retryable writes.

It's becoming a little unclear to me whether write_ops_exec::performInserts should really be used to serve both use-cases (writing a batch from an Insert command as well as doing replication). I believe you (Max) also recently had to avoid performInserts because it implicitly created a collection. Another property beneficial for the insert command that's not as well suited to replication.

But before I close this ticket, I'd like to at least correct the terminology (which I perhaps simplified to a fault) and double-check that I know all of the assumptions that are made for the function to be correct (i.e: the output "results" index maps to the input batch insert index).

Apologies for a terse mathematical definition: Let (P)redicate be the set of conditions such that a loop iteration hits this emplace_back. And (N)on-predicate represents when the condition is not held and the document is added to the batch.

The function is correct iff the input batch matches the pattern: P*N*. I think there's agreement on that?

Jumping a few levels of indirection, I get to this clause. It's not obvious that calling that function in a sequence (even with only increasing values of stmtId) can only return True*False* (relating to a higher level satisfaction of P*N*). I'd be more convinced if that method were doing a "range comparison", i.e: "all stmtIds <= 100 are executed and none > 100 are executed".

Comment by Max Hirschhorn [ 07/Nov/20 ]

However, consider the example where an input batch contains 2 documents: {non-retryable insert, retryable insert}. So long as the non-retryable document does not hit the batching limit (has its write deferred), the output results will be for {retryable insert, non-retryable insert}.

My understanding is that today we have no bug. In all cases when the function is called, retryable writes are always at the beginning of the batch (I haven't verified this). But we are adding new code that uses this method and in ways where the writes may be relatively opaque (sourced by the oplog).

What does it mean for a batch to contain both a retryable insert and a non-retryable insert? An insert command contains either non-retryable inserts or retryable inserts (or inserts in a multi-document transaction) depending on the "txnNumber" field being presented in the command arguments. The txnNumber becomes a property of the OperationContext and so a batch of inserts can be executed either only as retryable or only as non-retryable.

Generated at Thu Feb 08 05:28:40 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.