[SERVER-30901] RocksDB Initial sync crash due to stopTimestamp < startTimestamp Created: 30/Aug/17  Updated: 06/Dec/22  Resolved: 03/Jul/18

Status: Closed
Project: Core Server
Component/s: Replication, Storage
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Spencer Brody (Inactive) Assignee: Backlog - Storage Execution Team
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
Assigned Teams:
Storage Execution
Operating System: ALL
Participants:
Linked BF Score: 25

 Description   

We saw this failure on evergreen: https://evergreen.mongodb.com/task/mongodb_mongo_master_ubuntu1404_rocksdb_sharding_auth_ea31111dc95eb309269545348c34791b472f6c25_17_08_15_22_36_34/0

A secondary crashed during intial sync:

[js_test:sharding_rs1] 2017-08-16T10:25:17.000+0000 d20762| 2017-08-16T10:25:16.999+0000 I REPL     [repl writer worker 5] Finished cloning data: OK. Beginning oplog replay.
[js_test:sharding_rs1] 2017-08-16T10:25:17.011+0000 d20762| 2017-08-16T10:25:17.010+0000 E REPL     [replication-1] Possible rollback on sync source ip-10-186-5-61:20760. Currently at { : Timestamp 1502879111000|2 }. Started at { : Timestamp 1502879111000|3 }
[js_test:sharding_rs1] 2017-08-16T10:25:17.011+0000 d20762| 2017-08-16T10:25:17.011+0000 I ASIO     [NetworkInterfaceASIO-RS-0] Ending connection to host ip-10-186-5-61:20760 due to bad connection status; 2 connections to that host remain open
[js_test:sharding_rs1] 2017-08-16T10:25:17.012+0000 d20762| 2017-08-16T10:25:17.011+0000 I REPL     [replication-0] Finished fetching oplog during initial sync: CallbackCanceled: error in fetcher batch callback: oplog fetcher is shutting down. Last fetched optime and hash: { ts: Tim
[js_test:sharding_rs1] 2017-08-16T10:25:17.012+0000 d20762| 2017-08-16T10:25:17.011+0000 I REPL     [replication-0] Initial sync attempt finishing up.
[js_test:sharding_rs1] 2017-08-16T10:25:17.012+0000 d20762| 2017-08-16T10:25:17.011+0000 I REPL     [replication-0] Initial Sync Attempt Statistics: { failedInitialSyncAttempts: 0, maxFailedInitialSyncAttempts: 1, initialSyncStart: new Date(1502879111978), initialSyncAttempts: [], f
[js_test:sharding_rs1] 2017-08-16T10:25:17.013+0000 d20762| 2017-08-16T10:25:17.011+0000 E REPL     [replication-0] Initial sync attempt failed -- attempts left: 0 cause: OplogOutOfOrder: Possible rollback on sync source ip-10-186-5-61:20760. Currently at { : Timestamp 1502879111000
[js_test:sharding_rs1] 2017-08-16T10:25:17.013+0000 d20762| 2017-08-16T10:25:17.011+0000 F REPL     [replication-0] The maximum number of retries have been exhausted for initial sync.
[js_test:sharding_rs1] 2017-08-16T10:25:17.015+0000 d20762| 2017-08-16T10:25:17.013+0000 E REPL     [replication-0] Initial sync failed, shutting down now. Restart the server to attempt a new initial sync.
[js_test:sharding_rs1] 2017-08-16T10:25:17.015+0000 d20762| 2017-08-16T10:25:17.013+0000 F -        [replication-0] Fatal assertion 40088 OplogOutOfOrder: Possible rollback on sync source ip-10-186-5-61:20760. Currently at { : Timestamp 1502879111000|2 }. Started at { : Timestamp 15
[js_test:sharding_rs1] 2017-08-16T10:25:17.015+0000 d20762| 2017-08-16T10:25:17.013+0000 F -        [replication-0]
[js_test:sharding_rs1] 2017-08-16T10:25:17.016+0000 d20762|
[js_test:sharding_rs1] 2017-08-16T10:25:17.016+0000 d20762| ***aborting after fassert() failure
[js_test:sharding_rs1] 2017-08-16T10:25:17.016+0000 d20762|
[js_test:sharding_rs1] 2017-08-16T10:25:17.016+0000 d20762|



 Comments   
Comment by Matthew Russotto [ 29/Sep/17 ]

Specifically, it appears RocksDB respects oplog visibility on reverse scans; the last timestamp fetcher depends on a reverse scan giving the last entry regardless of visibility.

Comment by Spencer Brody (Inactive) [ 30/Aug/17 ]

Filed https://github.com/mongodb-partners/mongo-rocks/issues/102 for tracking.

Comment by Spencer Brody (Inactive) [ 30/Aug/17 ]

Copying a comment from siyuan.zhou from the initial investigation on the build failure ticket:

"Agree with Yves, the root cause of this issue is _initialSyncState->beginTimestamp > _initialSyncState->stopTimestamp.

[js_test:sharding_rs1] 2017-08-16T10:25:17.011+0000 d20762| 2017-08-16T10:25:17.010+0000 E REPL     [replication-1] Possible rollback on sync source ip-10-186-5-61:20760. Currently at { : Timestamp 1502879111000|2 }. Started at { : Timestamp 1502879111000|3 }

Before the start of initial sync at , the sync source has optime at 1502879111000|3, t: 1.

[js_test:sharding_rs1] 2017-08-16T10:25:11.923+0000 d20760| 2017-08-16T10:25:11.921+0000 D REPL_HB  [conn5] Processed heartbeat from  and generated response, ... durableOpTime: { ts: Timestamp 1502879111000|2, t: 1 }, opTime: { ts: Timestamp 1502879111000|3, t: 1 }

Later on the latest optime ts: Timestamp 1502879111000|3, t: 1 became durable and committed.

[js_test:sharding_rs1] 2017-08-16T10:25:16.984+0000 d20762| 2017-08-16T10:25:16.983+0000 D REPL_HB  [replexec-3] Received response to heartbeat (requestId: 24) from ip-10-186-5-61:20760, { ok: 1.0, electionTime: new Date(6454816622996619266), state: 1, v: 2, hbmsg: "", set: "sharding_rs1-rs0", term: 1, primaryId: 0, durableOpTime: { ts: Timestamp 1502879111000|3, t: 1 }, opTime: { ts: Timestamp 1502879111000|3, t: 1 }, $replData: { term: 1, lastOpCommitted: { ts: Timestamp 1502879111000|3, t: 1 }, lastOpVisible: { ts: Timestamp 0|0, t: -1 }, configVersion: 2, replicaSetId: ObjectId('59941d85a5f5304f61ec7a0f'), primaryIndex: 0, syncSourceIndex: -1 } }

However the last timestamp fetcher got a stale optime ts: Timestamp 1502879111000|2, t: 1.

I suspect this is an issue about the oplog visibility rules in RocksDB since we changed the code of oplog visibility recently."

Generated at Thu Feb 08 04:25:23 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.