[SERVER-23841] Mongod always complain "Fatal assertion 18750 UnrecoverableRollbackError" after mongod abnormal termination Created: 21/Apr/16  Updated: 08/Feb/23  Resolved: 08/Sep/16

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.2.5
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: 아나 하리 Assignee: Backlog - Replication Team
Resolution: Duplicate Votes: 3
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File mongodb_log.tar.gz    
Issue Links:
Duplicate
duplicates SERVER-7275 node can't roll back if behind minValid Closed
duplicates SERVER-7200 use oplog as op buffer on secondaries Closed
Related
Assigned Teams:
Replication
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

1) kill primary mongod of replica-set during crud operation,
2) restart primary mongod

Sprint: Repl 15 (06/03/16), Repl 16 (06/24/16)
Participants:
Case:

 Description   

MongoDB server always complain "Fatal assertion 18750" after abnormal mongod termination.

Sometimes mongod server is hang and no response, at that time mongod server almost not response (SERVER-23778). So I can't terminate mongod normally.
So I kill mongod with SIGKILL. Or sometimes linux kill mongod because excessive memory usage. Anyway always mongod complain this assertion error and stop replicating during startup after mongod abnormal shutdown.

2016-04-20T22:51:04.192+0900 I REPL     [ReplicationExecutor] This node is test02-mongo3:27017 in the config
2016-04-20T22:51:04.193+0900 I REPL     [ReplicationExecutor] Member test02-mongo1:27017 is now in state SECONDARY
2016-04-20T22:51:04.193+0900 I REPL     [ReplicationExecutor] Member test02-mongo2:27017 is now in state PRIMARY
2016-04-20T22:51:05.183+0900 I REPL     [ReplicationExecutor] syncing from: test02-mongo2:27017
2016-04-20T22:51:05.184+0900 I REPL     [SyncSourceFeedback] setting syncSourceFeedback to test02-mongo2:27017
2016-04-20T22:51:05.185+0900 I ASIO     [NetworkInterfaceASIO-BGSync-0] Successfully connected to test02-mongo2:27017
2016-04-20T22:51:05.187+0900 I REPL     [rsBackgroundSync] Starting rollback due to OplogStartMissing: our last op time fetched: (term: 15, timestamp: Apr 20 10:20:00:7). source's GTE: (term: 16, timestamp: Apr 20 10:20:00:7) hashes: (-157997914050892151/-8410528485294631657)
2016-04-20T22:51:05.187+0900 I -        [rsBackgroundSync] Fatal assertion 18750 UnrecoverableRollbackError: need to rollback, but in inconsistent state. minvalid: (term: 16, timestamp: Apr 20 10:20:01:2) > our last optime: (term: 15, timestamp: Apr 20 10:20:00:7)
2016-04-20T22:51:05.187+0900 I -        [rsBackgroundSync]
 
***aborting after fassert() failure

Is this normal behavior or there might be something wrong on my configuration.
And is there any fast way to recover without full re-sync ?

I found a few bug-report from JIRA, but they say it's already fixed.
So I report this issue again.

Thanks.



 Comments   
Comment by Eric Milkie [ 08/Sep/16 ]

This problem is now fixed by the work in SERVER-7200.

Comment by Scott Hernandez (Inactive) [ 22/Jul/16 ]

This is really a duplicate of SERVER-7275, where the solution may be SERVER-7200.

Comment by Benety Goh [ 09/Jun/16 ]

Hi matt.lee,

After examining the logs, it appears shard01-mongo3 tried to sync from another secondary (shard01-mongo2) that has not yet sync'ed up its oplog with the new primary (shard01-mongo1). The crash resulted because shard01-mongo3 contained internal post-rollback state that was inconsistent with the secondary it was trying to sync from. We recognize that this is a defect with the server and the work that will address this issue is described in SERVER-7200.

If you restart shard01-mongo3, it should be able to find two consistent nodes to sync from.

I'll close this ticket as a duplicate of SERVER-7200. Free free to continue this discussion in SERVER-7200.

Regards,
Ben

Comment by Kelsey Schubert [ 12/May/16 ]

Hi matt.lee,

Thank you for providing the logs files. We have what we need to debug this issue so I am sending this ticket to the replication team. Please continue to watch for updates.

Kind regards,
Thomas

Comment by 아나 하리 [ 12/May/16 ]

Hi Thomas Schubert.

I have experienced the same issue on stable replica-set.

shard01-mongo1 : Primary
shard01-mongo2 : Secondary
shard01-mongo3 : Secondary

I stepped down primary(shard01-mongo1) with changing priority, at this time everything's fine.
And (shard01-mongo2) became new primary.

>> shard01-mongo1.domain.io
2016-05-12T14:57:52.540+0900 I REPL     [conn32458] replSetReconfig admin command received from client

After 1 minutes, I stepped down primary (shard01-mongo2) with changing priority. But this time shard01-mongo3 could not make it.
And wrote assertion error to it's log file and crashed.

shard01-mongo2.domain.io
2016-05-12T14:58:52.437+0900 I REPL     [conn29958] replSetReconfig admin command received from client

Please check the attached log files.
Thanks.

Comment by 아나 하리 [ 25/Apr/16 ]

Looks like this issue is caused by SERVER-23778.
This issue happens when doing primary step-down on replica-set which suffer SERVER-23778 issue.

Thanks.

Comment by 아나 하리 [ 22/Apr/16 ]

Hi Thomas Schubert.

1. Yes

storage:
    syncPeriodSecs: 60
    directoryPerDB: true
    dbPath: /data/mongodb
    journal:
        enabled: true
        commitIntervalMs: 100
 
    engine: wiredTiger
 
    wiredTiger:
        engineConfig:
            cacheSizeGB: 40
            journalCompressor: none
            directoryForIndexes: false
        collectionConfig:
            blockCompressor: snappy
        indexConfig:
            prefixCompression: false

2-1. This is test environment, And there's no backup. So I use only MongoDB 3.2.5
2-2. 3.2.5
2-3. I don't know yet.
3. I already remove all log files during copy whole data directory from healthy node. (I will attach if this happen again)

Anyway, Is it correct that MongoDB server can find last location of oplog and replicate remain operations and join replica-set even though that node is crashed(even if there's rollback) (if journal is enabled and if disk is not corrupted) ?

Thanks.

Comment by Kelsey Schubert [ 21/Apr/16 ]

Hi matt.lee,

Thank you for reporting this behavior. To help us investigate this issue, please answer the following questions:

  1. Can you confirm that journaling is enabled?
  2. Can you provide some details about the history of the data files node?
    1. On which version of MongoDB did you perform an initial sync or restore from a back up?
    2. Which versions of MongoDB has this node run?
    3. Have you observed any other issues associated with this node besides SERVER-23778?
  3. Can you please upload the complete logs for each node in the replicaset?

To recover, I would recommend either completing an initial sync or restoring from a backup.

Kind regards,
Thomas

Generated at Thu Feb 08 04:04:37 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.