[SERVER-81473] Changing the timeseries granularity/bucketing values can cause tenant migration and logical initial sync to fail. Created: 26/Sep/23  Updated: 25/Jan/24

Status: Backlog
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Suganthi Mani Assignee: Backlog - Storage Execution Team
Resolution: Unresolved Votes: 0
Labels: former-storex-namer
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File SERVER-81473_repro.js    
Issue Links:
Backports
Depends
Assigned Teams:
Storage Execution
Operating System: ALL
Backport Requested:
v7.3
Participants:
Linked BF Score: 35

 Description   

Running a collMod command to change the timeseries granularity during tenant migration and logical initial sync, can cause those data migration protocols to fail with following error

"error":{"code":72,"codeName":"InvalidOptions","errmsg":"Invalid transition for timeseries.granularity. Can only transition from 'seconds' to 'minutes' or 'minutes' to 'hours'."}}}

The error is expected as we apply oplog entries on a inconsistent data for both tenant migration and logical initial sync. We need to ignore the error if the oplog application mode is kInitialSync and kUnstableRecovering, just like SERVER-80301. The fix would be to update the coll Mod ignore list with InvalidOptions

Regarding the fix, I'm considering whether it's the correct approach to catch these errors individually and ignore them for the kInitialSync oplog application mode. In the future, we may encounter similar cases, and waiting for build failures or issues in production to address them doesn't seem ideal. I'm thinking of a solution where we simply ignore any errors when applying oplog entries during kInitialSync mode. However, I'm unsure about the safety of this approach and believe it might require further investigation.

(Attached a repro for initial sync case)

EDIT (11/10/2023)
Modifying the bucket values during concurrent tenant migration/logical initial sync will cause the migration/initial sync to fail.

[j1:rs1:prim] | 2023-11-09T02:27:20.103+00:00 D1 TENANT_M 4886005 [TenantMigrationRecipientService-4] "TenantOplogApplier::_finishShutdown","attr":{"protocol":0,"migrationId":{"uuid":{"$uuid":"adc565c0-0844-4ea2-a36d-b76c4699bdfc"}},"error":"InvalidOptions: Timeseries 'bucketMaxSpanSeconds' needs to be equal or greater to transition"}



 Comments   
Comment by Suganthi Mani [ 10/Nov/23 ]

Considering whether to delay addressing this issue, I think it's better to fix it sooner. This problem affects logical initial sync. I agree, we retry upon initial sync failure for some number of times, and the retry should be successful for the issue mentioned this ticket. But, it's important to note that logical initial syncs are expensive and we drop all the cloned collections before retry, leading to wasted effort.

Additionally, this problem causes test failures in evergreen, and we need to address it to reduce noise. It also raises questions about why this issue isn't caught in the initial sync test suite. Aren't we testing timeseries workload in those initial sync suite? gregory.noma@mongodb.com

Comment by Steven Vannelli [ 03/Oct/23 ]

Backlogging this ticket for now since this is not a high priority for the team. This won't be a problem in future versions for tenant migrations. 

Generated at Thu Feb 08 06:46:36 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.