[SERVER-81473] Changing the timeseries granularity/bucketing values can cause tenant migration and logical initial sync to fail. Created: 26/Sep/23 Updated: 25/Jan/24 |
|
| Status: | Backlog |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Suganthi Mani | Assignee: | Backlog - Storage Execution Team |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | former-storex-namer | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Assigned Teams: |
Storage Execution
|
||||||||
| Operating System: | ALL | ||||||||
| Backport Requested: |
v7.3
|
||||||||
| Participants: | |||||||||
| Linked BF Score: | 35 | ||||||||
| Description |
|
Running a collMod command to change the timeseries granularity during tenant migration and logical initial sync, can cause those data migration protocols to fail with following error
The error is expected as we apply oplog entries on a inconsistent data for both tenant migration and logical initial sync. We need to ignore the error if the oplog application mode is kInitialSync and kUnstableRecovering, just like Regarding the fix, I'm considering whether it's the correct approach to catch these errors individually and ignore them for the kInitialSync oplog application mode. In the future, we may encounter similar cases, and waiting for build failures or issues in production to address them doesn't seem ideal. I'm thinking of a solution where we simply ignore any errors when applying oplog entries during kInitialSync mode. However, I'm unsure about the safety of this approach and believe it might require further investigation. (Attached a repro for initial sync case) EDIT (11/10/2023)
|
| Comments |
| Comment by Suganthi Mani [ 10/Nov/23 ] |
|
Considering whether to delay addressing this issue, I think it's better to fix it sooner. This problem affects logical initial sync. I agree, we retry upon initial sync failure for some number of times, and the retry should be successful for the issue mentioned this ticket. But, it's important to note that logical initial syncs are expensive and we drop all the cloned collections before retry, leading to wasted effort. Additionally, this problem causes test failures in evergreen, and we need to address it to reduce noise. It also raises questions about why this issue isn't caught in the initial sync test suite. Aren't we testing timeseries workload in those initial sync suite? gregory.noma@mongodb.com |
| Comment by Steven Vannelli [ 03/Oct/23 ] |
|
Backlogging this ticket for now since this is not a high priority for the team. This won't be a problem in future versions for tenant migrations. |