[SERVER-38790] The secondary of a shard aborting after fassert() failure Created: 28/Dec/18  Updated: 04/Jan/19  Resolved: 28/Dec/18

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.4.9
Fix Version/s: None

Type: Question Priority: Major - P3
Reporter: chen hui Assignee: Danny Hatcher (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Participants:

 Description   

Hello everyone.

I installed mongodb3.4.9 on my computer with centos7.

I configured mongodb with one shard which had a primary, a secondary and a arbiter, and then inseted data into mongodb through mongos with 10 threads. But the secondary crashed(I tried many times). The error infomation in the log file  is shown below.

 

2018-12-28T10:01:23.530+0800 I REPL     [replication-4] Restarting oplog query due to error: OperationFailed: GetMore command executor error: CappedPositionLost: CollectionScan died due to failure to restore tailable cursor posit 2018-12-28T10:01:23.530+0800 I REPL     [replication-4] Restarting oplog query due to error: OperationFailed: GetMore command executor error: CappedPositionLost: CollectionScan died due to failure to restore tailable cursor posit         ion. Last seen record id: RecordId(6639858301032998504). Last fetched optime (with hash): \{ ts: Timestamp 1545962482000|9832, t: 1 }[-5405713631010586723]. Restarts remaining: 3     124 2018-12-28T10:01:23.531+0800 I REPL     [replication-4] Scheduled new oplog query Fetcher source: localhost:21001 database: local query: \{ find: "oplog.rs", filter: { ts: { $gte: Timestamp 1545962482000|9832 } }, tailable: true,         oplogReplay: true, awaitData: true, maxTimeMS: 60000, term: 1 } query metadata: \{ $replData: 1, $oplogQueryData: 1, $ssm: { $secondaryOk: true } } active: 1 timeout: 65000ms shutting down?: 0 first: 1 firstCommandScheduler: Remot         eCommandRetryScheduler request: RemoteCommand 175756 -- target:localhost:21001 db:local cmd:\{ find: "oplog.rs", filter: { ts: { $gte: Timestamp 1545962482000|9832 } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS         : 60000, term: 1 } active: 1 callbackHandle.valid: 1 callbackHandle.cancelled: 0 attempt: 1 retryPolicy: RetryPolicyImpl maxAttempts: 1 maxTimeMillis: -1ms     125 2018-12-28T10:01:23.872+0800 I REPL     [rsBackgroundSync] Starting rollback due to OplogStartMissing: Received an empty batch from sync source.     126 2018-12-28T10:01:23.872+0800 I REPL     [rsBackgroundSync] rollback 0     127 2018-12-28T10:01:23.872+0800 I REPL     [rsBackgroundSync] transition to ROLLBACK     128 2018-12-28T10:01:23.872+0800 I NETWORK  [rsBackgroundSync] legacy transport layer closing all connections     129 2018-12-28T10:01:23.872+0800 I NETWORK  [rsBackgroundSync] Skip closing connection for connection # 5     130 2018-12-28T10:01:23.872+0800 I NETWORK  [rsBackgroundSync] Skip closing connection for connection # 4     131 2018-12-28T10:01:23.873+0800 I REPL     [rsBackgroundSync] beginning rollback     132 2018-12-28T10:01:23.873+0800 I -        [conn9] end connection 127.0.0.1:58724 (6 connections now open)     133 2018-12-28T10:01:23.873+0800 I -        [conn8] end connection 127.0.0.1:58722 (6 connections now open)     134 2018-12-28T10:01:23.873+0800 I -        [conn7] end connection 127.0.0.1:58125 (6 connections now open)     135 2018-12-28T10:01:23.873+0800 I REPL     [rsBackgroundSync] rollback 1     136 2018-12-28T10:01:23.873+0800 I -        [conn6] end connection 127.0.0.1:58110 (6 connections now open)     137 2018-12-28T10:01:23.874+0800 I REPL     [rsBackgroundSync] rollback 2 FindCommonPoint     138 2018-12-28T10:01:23.874+0800 I REPL     [rsBackgroundSync] rollback our last optime:   Dec 28 10:01:22:2668     139 2018-12-28T10:01:23.874+0800 I REPL     [rsBackgroundSync] rollback their last optime: Dec 28 10:01:23:9de8     140 2018-12-28T10:01:23.874+0800 I REPL     [rsBackgroundSync] rollback diff in end of log times: -1 seconds     141 2018-12-28T10:01:23.961+0800 F REPL     [rsBackgroundSync] rollback error RS100 reached beginning of remote oplog     142 2018-12-28T10:01:23.961+0800 I REPL     [rsBackgroundSync]   them:      localhost:21001 (127.0.0.1)     143 2018-12-28T10:01:23.961+0800 I REPL     [rsBackgroundSync]   theirTime: Dec 28 10:01:22 5c2583f2:6201     144 2018-12-28T10:01:23.961+0800 I REPL     [rsBackgroundSync] rollback finished     145 2018-12-28T10:01:23.961+0800 F REPL     [rsBackgroundSync] Unable to complete rollback. A full resync may be needed: UnrecoverableRollbackError: need to rollback, but unable to determine common point between local and remote oplo         g: NoMatchingDocument: RS100 reached beginning of remote oplog [1] @ 18752     146 2018-12-28T10:01:23.961+0800 I -        [rsBackgroundSync] Fatal Assertion 28723 at src/mongo/db/repl/rs_rollback.cpp 915     147 2018-12-28T10:01:23.961+0800 I -        [rsBackgroundSync]     148     149 ***aborting after fassert() failure



 Comments   
Comment by chen hui [ 04/Jan/19 ]

Thank you for your answer

Comment by Danny Hatcher (Inactive) [ 28/Dec/18 ]

Hello,

From the log snippet you posted, it appears that you are overflowing the oplog of the Primary before the Secondary can adequately catch up. If the Secondary node is unable to find a common point within its oplog and the sync source, then it can no longer function correctly. I recommend increasing the size of the oplog on both the Primary and Secondary and/or change the write concern of your inserts to

{w:"majority"}

. This will increase the total time that it takes for your writes to complete but it should help prevent your Secondary from falling off the oplog.

Please note that SERVER project is for reporting bugs or feature suggestions for the MongoDB server. For MongoDB-related support discussion please post on the mongodb-user group or Stack Overflow with the mongodb tag. See also our Technical Support page for additional support resources.

Thank you,

Danny

Generated at Thu Feb 08 04:50:03 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.