[SERVER-79848] Difference in behavior of $_resumeAfter between classic and SBE engines Created: 08/Aug/23 Updated: 18/Dec/23 Resolved: 15/Nov/23 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Query Execution |
| Affects Version/s: | None |
| Fix Version/s: | 7.3.0-rc0, 7.2.0-rc2 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Irina Yatsenko (Inactive) | Assignee: | Mihai Andrei |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | query-director-triage | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||
| Assigned Teams: |
Query Execution
|
||||||||||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||||||||||||||
| Backport Requested: |
v7.2, v7.0
|
||||||||||||||||||||||||||||||||||||||||
| Steps To Reproduce: |
|
||||||||||||||||||||||||||||||||||||||||
| Sprint: | QE 2023-11-13, QE 2023-11-27 | ||||||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||||||
| Linked BF Score: | 138 | ||||||||||||||||||||||||||||||||||||||||
| Description |
|
When resuming from the last record in a collection, the classic engine returns resume token {"$recordId" : null} and SBE returns the last record. |
| Comments |
| Comment by Githook User [ 20/Nov/23 ] |
|
Author: {'name': 'Mihai Andrei', 'email': 'mihai.andrei@mongodb.com', 'username': 'mtandrei'}Message: (cherry picked from commit 022c8097ecbf10757a8ee24ad8c957bea6130aa4) |
| Comment by Githook User [ 14/Nov/23 ] |
|
Author: {'name': 'Mihai Andrei', 'email': 'mihai.andrei@mongodb.com', 'username': 'mtandrei'}Message: |
| Comment by Geert Bosch [ 07/Nov/23 ] |
|
The BF that is waiting for this bug fix has amassed 4713 BFGs in the 3 months since August 8. It's hot due to severity and most importantly frequency, but it still sits here in "Needs Scheduling". How is this getting missed? |
| Comment by Jiawei Yang [ 01/Nov/23 ] |
|
bernard.gorman@mongodb.com Resharding will use the resumeToken to track the status of cloning collection, so every time failover happens, the reshardingRecipientService will be reconstructed and use the resumeToken persisted in the metadata to resume. My statement of always fail is actually not very accurate. It only always fail if the cloning of one donor shard has been completed, which is the following case:
I feel handling that special case in resharding is not safe because it's dangerous for resharding to assume that null always means end. If in some cases resharding get a resumeToken: null but it actually doesn't reach the end, resharding will succeed in a state of data inconsistency. Note: we see a lot of this failure in evergreen because those suites have less data and failover is quite frequent. In real production environment, it won't be this easy to hit. |
| Comment by Bernard Gorman [ 01/Nov/23 ] |
|
I agree that the desirable behaviour here would be
jiawei.yang@mongodb.com, can you explain why the resumeAfter behaviour would cause resharding to always fail in this scenario? And couldn't resharding just check whether the recordId it's attempting to use is null before proceeding? |
| Comment by Jiawei Yang [ 31/Oct/23 ] |
|
Since this issue has not been scheduled and we are approaching 7.2.0-rc0, I want to write down the affect of this issue so we can better evaluate this. The direct consequence of this issue is that if a mongod is using classic engine and running resharding, any failover during that resharding will make the resharding fail, but there should not be any unobservable data inconsistency after the resharding fails. I don't know how many customers are still using classic engine in a new release like 7.2 so I'm not sure how much affect this issue will make if we don't fix it. |
| Comment by Jiawei Yang [ 27/Sep/23 ] |
|
bernard.gorman@mongodb.com Thanks for the context. I think {recordId: null} meaning no more records is okay but the scan should not restart if a query uses {recordId: null} in resumeAfter. I think it's this logic. |
| Comment by Bernard Gorman [ 27/Sep/23 ] |
|
amr.elhelw@mongodb.com, irina.yatsenko@mongodb.com: the original use-case for this feature was Resumable Initial Sync; it allows the sync to resume a collection scan from the last point it saw if it disconnects while cloning the collection's data. From jiawei.yang@mongodb.com's comment above, it sounds like resharding has more recently started using this mechanism as well - I believe they were initially using a non-type-bracketing scan over _id for cloning. The reason for the classic behaviour is, I suspect, in keeping with the original semantics of resume tokens and resumeAfter that change streams defined. You are resuming AFTER the specified point because by definition, if you have a resume token for recordID X it means you've already seen that record. If you attempt to resume after the last record, therefore, you get back null because there are no more records. I presume Resumable Initial Sync checks this to determine whether they've reached the end of the collection. irina.yatsenko@mongodb.com, does this behaviour in SBE deviate from classic only for the very last record in the collection? That is, if we do {$_resumeAfter: {"$recordId" : NumberLong(1)}} in your example above, is the document with recordID 1 actually included in the batch of results we get back? Or does the batch start at the document with recordID 2? Is this behaviour the same in classic and SBE? |
| Comment by Jiawei Yang [ 26/Sep/23 ] |
|
After PM-2322, resharding is now using $_requestResumeToken to do natural order scan during the cloning phase. I think the classic engine behavior is not correct since we expect to end the scan when resume token is pointing to the last record instead of restarting the scan. I'm seeing in our evergreen classic engine build variants like https://buildbaron.corp.mongodb.com/ui/#/bfg/BFG-2102722, this behavior is causing resharding failing because when the resharding cloner receives a PBRT with "resumeToken: {$recordID: null}", it can continue use that resumeToken to request the next batch of data and end up receiving some already seen records, leading to duplicate key error on insert. |
| Comment by Irina Yatsenko (Inactive) [ 29/Aug/23 ] |
|
amr.elhelw@mongodb.com, I don't know re the use case. I ran into this while investigating failures in jstests/core/timeseries/timeseries_resume_after.js. |