[SERVER-79848] Difference in behavior of $_resumeAfter between classic and SBE engines Created: 08/Aug/23  Updated: 18/Dec/23  Resolved: 15/Nov/23

Status: Closed
Project: Core Server
Component/s: Query Execution
Affects Version/s: None
Fix Version/s: 7.3.0-rc0, 7.2.0-rc2

Type: Bug Priority: Major - P3
Reporter: Irina Yatsenko (Inactive) Assignee: Mihai Andrei
Resolution: Fixed Votes: 0
Labels: query-director-triage
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
is depended on by COMPASS-7456 Investigate changes in SERVER-79848: ... Closed
is depended on by TOOLS-3421 Investigate changes in SERVER-79848: ... Closed
Documented
is documented by DOCS-16497 Investigate changes in SERVER-79848: ... Backlog
Problem/Incident
causes SERVER-83243 Fix clang tidy error Closed
Related
is related to SERVER-84205 Robustify find_with_resume_after_para... Open
Assigned Teams:
Query Execution
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v7.2, v7.0
Steps To Reproduce:

> db.u.drop()
> db.u.insertMany([{x:1}, {x:2}])
> db.runCommand({find: "u", batchSize: 1, $_requestResumeToken: true, hint:  {$natural:1}})
// it will return a resume token with {"$recordId" : NumberLong(1)}
> db.runCommand({find: "u", batchSize: 1, $_requestResumeToken: true, hint:  {$natural:1}, $_resumeAfter: {"$recordId" : NumberLong(1)}})
// it will return a resume token with {"$recordId" : NumberLong(2)} -- the last record in the collection
 
> db.adminCommand({setParameter: 1, internalQueryFrameworkControl: "trySbeEngine"})
> db.runCommand({find: "u", batchSize: 1, $_requestResumeToken: true, hint:  {$natural:1}, $_resumeAfter: {"$recordId" : NumberLong(2)}})
{
        "cursor" : {
                "firstBatch" : [ ],
                "postBatchResumeToken" : {
                        "$recordId" : NumberLong(2)
                },
                "id" : NumberLong(0),
                "ns" : "my.u"
        },
        "ok" : 1
}
 
> db.adminCommand({setParameter: 1, internalQueryFrameworkControl: "forceClassicEngine"})
> db.runCommand({find: "u", batchSize: 1, $_requestResumeToken: true, hint:  {$natural:1}, $_resumeAfter: {"$recordId" : NumberLong(2)}})
{
        "cursor" : {
                "firstBatch" : [ ],
                "postBatchResumeToken" : {
                        "$recordId" : null
                },
                "id" : NumberLong(0),
                "ns" : "my.u"
        },
        "ok" : 1
}

Sprint: QE 2023-11-13, QE 2023-11-27
Participants:
Linked BF Score: 138

 Description   

When resuming from the last record in a collection, the classic engine returns resume token {"$recordId" : null} and SBE returns the last record.
Resuming from {"$recordId" : null} in the classic record cycles back to the start of the collection and raises "KeyNotFound" error in SBE.



 Comments   
Comment by Githook User [ 20/Nov/23 ]

Author:

{'name': 'Mihai Andrei', 'email': 'mihai.andrei@mongodb.com', 'username': 'mtandrei'}

Message: SERVER-79848 Unify null semantics for $_requestResumeToken and $_resumeAfter for both engines

(cherry picked from commit 022c8097ecbf10757a8ee24ad8c957bea6130aa4)
Branch: v7.2
https://github.com/mongodb/mongo/commit/b0c4a578f5540a2f48d6ed157acbbe3f8c5ca4a2

Comment by Githook User [ 14/Nov/23 ]

Author:

{'name': 'Mihai Andrei', 'email': 'mihai.andrei@mongodb.com', 'username': 'mtandrei'}

Message: SERVER-79848 Unify null semantics for $_requestResumeToken and $_resumeAfter for both engines
Branch: master
https://github.com/mongodb/mongo/commit/022c8097ecbf10757a8ee24ad8c957bea6130aa4

Comment by Geert Bosch [ 07/Nov/23 ]

The BF that is waiting for this bug fix has amassed 4713 BFGs in the 3 months since August 8. It's hot due to severity and most importantly frequency, but it still sits here in "Needs Scheduling". How is this getting missed?

Comment by Jiawei Yang [ 01/Nov/23 ]

bernard.gorman@mongodb.com Resharding will use the resumeToken to track the status of cloning collection, so every time failover happens, the reshardingRecipientService will be reconstructed and use the resumeToken persisted in the metadata to resume.

My statement of always fail is actually not very accurate. It only always fail if the cloning of one donor shard has been completed, which is the following case:

  1. The cloning is done for one donor shard and the resumeToken is stored as null for classic engine (last record for SBE)
  2. Failover happens and resharding restarts, then resharding recipient uses the the resumeToken to try fetching from donor again.
  3. Classic engine will rescan the collection, giving the resharding recipient the same data that resharding recipient has already written, causing DuplicateKeyError. SBE is correct since it will return nothing but telling the recipient that the cursor reaches the end.

I feel handling that special case in resharding is not safe because it's dangerous for resharding to assume that null always means end. If in some cases resharding get a resumeToken: null but it actually doesn't reach the end, resharding will succeed in a state of data inconsistency.

Note: we see a lot of this failure in evergreen because those suites have less data and failover is quite frequent. In real production environment, it won't be this easy to hit.

Comment by Bernard Gorman [ 01/Nov/23 ]

I agree that the desirable behaviour here would be

  • SBE returns null when attempting to resume after the final record in the collection, signifying "no more results"
  • Classic and SBE both throw KeyNotFound if attempting to resume after null

The direct consequence of this issue is that if a mongod is using classic engine and running resharding, any failover during that resharding will make the resharding fail, but there should not be any unobservable data inconsistency after the resharding fails.

jiawei.yang@mongodb.com, can you explain why the resumeAfter behaviour would cause resharding to always fail in this scenario? And couldn't resharding just check whether the recordId it's attempting to use is null before proceeding?

Comment by Jiawei Yang [ 31/Oct/23 ]

Since this issue has not been scheduled and we are approaching 7.2.0-rc0, I want to write down the affect of this issue so we can better evaluate this.

The direct consequence of this issue is that if a mongod is using classic engine and running resharding, any failover during that resharding will make the resharding fail, but there should not be any unobservable data inconsistency after the resharding fails.

I don't know how many customers are still using classic engine in a new release like 7.2 so I'm not sure how much affect this issue will make if we don't fix it. 

Comment by Jiawei Yang [ 27/Sep/23 ]

bernard.gorman@mongodb.com Thanks for the context. I think {recordId: null} meaning no more records is okay but the scan should not restart if a query uses {recordId: null} in resumeAfter. I think it's this logic.

Comment by Bernard Gorman [ 27/Sep/23 ]

amr.elhelw@mongodb.com, irina.yatsenko@mongodb.com: the original use-case for this feature was Resumable Initial Sync; it allows the sync to resume a collection scan from the last point it saw if it disconnects while cloning the collection's data. From jiawei.yang@mongodb.com's comment above, it sounds like resharding has more recently started using this mechanism as well - I believe they were initially using a non-type-bracketing scan over _id for cloning.

The reason for the classic behaviour is, I suspect, in keeping with the original semantics of resume tokens and resumeAfter that change streams defined. You are resuming AFTER the specified point because by definition, if you have a resume token for recordID X it means you've already seen that record. If you attempt to resume after the last record, therefore, you get back null because there are no more records. I presume Resumable Initial Sync checks this to determine whether they've reached the end of the collection.

irina.yatsenko@mongodb.com, does this behaviour in SBE deviate from classic only for the very last record in the collection? That is, if we do {$_resumeAfter: {"$recordId" : NumberLong(1)}} in your example above, is the document with recordID 1 actually included in the batch of results we get back? Or does the batch start at the document with recordID 2? Is this behaviour the same in classic and SBE?

Comment by Jiawei Yang [ 26/Sep/23 ]

After PM-2322, resharding is now using $_requestResumeToken to do natural order scan during the cloning phase. I think the classic engine behavior is not correct since we expect to end the scan when resume token is pointing to the last record instead of restarting the scan.

I'm seeing in our evergreen classic engine build variants like https://buildbaron.corp.mongodb.com/ui/#/bfg/BFG-2102722, this behavior is causing resharding failing because when the resharding cloner receives a PBRT with "resumeToken: {$recordID: null}", it can continue use that resumeToken to request the next batch of data and end up receiving some already seen records, leading to duplicate key error on insert.

Comment by Irina Yatsenko (Inactive) [ 29/Aug/23 ]

amr.elhelw@mongodb.com, I don't know re the use case. I ran into this while investigating failures in jstests/core/timeseries/timeseries_resume_after.js.

Generated at Thu Feb 08 06:42:04 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.