[SERVER-56178] Investigate oplog batcher during shutdown Created: 19/Apr/21  Updated: 10/May/21  Resolved: 10/May/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Jason Zhang Assignee: Lingzhi Deng
Resolution: Duplicate Votes: 0
Labels: post-rc0
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-56767 Check for interrupt before initializi... Closed
Operating System: ALL
Steps To Reproduce:

Start a migration, wait for donor to commit, and immediately shutdown donor and recipient

Sprint: Repl 2021-05-03, Repl 2021-05-17
Participants:

 Description   

If we stop a recipient replica set while the oplog applier is still running (namely when the oplog batcher is scheduling the next batch), we the migration seems to hang, signifying a potential race.

 



 Comments   
Comment by Lingzhi Deng [ 07/May/21 ]

In this log:

[js_test:tenant_migration_donor_kill_op_retry] d20022| 2021-04-16T18:41:27.101+00:00 D2 TENANT_M 5350800 [TenantMigrationRecipientService-1] "Already completed fetching retryable writes oplog entries from donor, skipping stage","attr":{"migrationId":{"uuid":{"$uuid":"1ce38aa5-b667-4e49-989a-f79912efd97f"}},"tenantId":"testTenantId2"}
[js_test:tenant_migration_donor_kill_op_retry] d20022| 2021-04-16T18:41:27.103+00:00 E  TENANT_M 4881204 [TenantMigrationRecipientService-1] "Recipient migration service oplog fetcher failed","attr":{"tenantId":"testTenantId2","migrationId":{"uuid":{"$uuid":"1ce38aa5-b667-4e49-989a-f79912efd97f"}},"error":{"code":120,"codeName":"OplogStartMissing","errmsg":"Received an empty batch from sync source."}}
[js_test:tenant_migration_donor_kill_op_retry] d20022| 2021-04-16T18:41:27.103+00:00 D1 TENANT_M 4881202 [TenantMigrationRecipientService-4] "Recipient migration service creating oplog applier","attr":{"tenantId":"testTenantId2","migrationId":{"uuid":{"$uuid":"1ce38aa5-b667-4e49-989a-f79912efd97f"}},"startApplyingDonorOpTime":{"ts":{"$timestamp":{"t":1618598470,"i":4}},"t":1}}
[js_test:tenant_migration_donor_kill_op_retry] d20022| 2021-04-16T18:41:27.103+00:00 D2 TENANT_M 5351401 [TenantMigrationRecipientService-4] "Already completed fetching committed transactions from donor, skipping stage","attr":{"migrationId":{"uuid":{"$uuid":"1ce38aa5-b667-4e49-989a-f79912efd97f"}},"tenantId":"testTenantId2"}
[js_test:tenant_migration_donor_kill_op_retry] d20022| 2021-04-16T18:41:27.103+00:00 D1 TENANT_M 4881200 [TenantMigrationRecipientService-4] "Recipient migration service starting oplog applier","attr":{"tenantId":"testTenantId2","migrationId":{"uuid":{"$uuid":"1ce38aa5-b667-4e49-989a-f79912efd97f"}}}

The oplog fetcher failed before the oplog applier was created. And it looks like the oplog applier was never shut down. I think this is likely a dup of SERVER-56767.

Generated at Thu Feb 08 05:38:34 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.