[SERVER-27573] Member stuck in ROLLBACK after unclean restart Created: 04/Jan/17  Updated: 29/Jul/17  Resolved: 26/Jun/17

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.2.11
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Amanpreet Singh Assignee: Kelsey Schubert
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File affected_mongo.log     PNG File mms-fullpage.png     File mongod-27017.log.2017-01-04T06-05-48     File mongod-27017.log.2017-01-04T06-36-02     File mongod-27017.log.2017-01-04T07-18-26     Text File primary_mongo.log    
Operating System: ALL
Steps To Reproduce:

It does not happen every time.

1. Kill primary member by OOMKILL
2. Wait for a secondary member to become primary
3. Restart killed member

Participants:

 Description   

The primary member of the 3 node replica set was OOM killed and a secondary member was promoted to primary. Upon restart the dead member, it came up but it's stuck in ROLLBACK state with these logs:

mongod.log

2017-01-04T06:08:58.781+0000 I REPL     [ReplicationExecutor] syncing from: ip-10-0-17-156:27017
2017-01-04T06:08:58.796+0000 I REPL     [rsBackgroundSync] Starting rollback due to OplogStartMissing: our last op time fetched: (term: -1, timestamp: Jan  4 03:39:41:1d). source's GTE: (term: -1, timestamp: Jan  4 03:39:46:1) hashes: (-8237435499851558070/-4585935198278308689)
2017-01-04T06:08:58.796+0000 I REPL     [rsBackgroundSync] beginning rollback
2017-01-04T06:08:58.796+0000 I REPL     [rsBackgroundSync] rollback 0
2017-01-04T06:08:58.796+0000 I REPL     [rsBackgroundSync] rollback 1
2017-01-04T06:08:58.798+0000 I REPL     [rsBackgroundSync] rollback 2 FindCommonPoint
2017-01-04T06:08:58.799+0000 I REPL     [rsBackgroundSync] rollback our last optime:   Jan  4 03:39:41:1d
2017-01-04T06:08:58.799+0000 I REPL     [rsBackgroundSync] rollback their last optime: Jan  4 06:08:57:2
2017-01-04T06:08:58.799+0000 I REPL     [rsBackgroundSync] rollback diff in end of log times: -8956 seconds
2017-01-04T06:09:28.797+0000 I -        [rsBackgroundSync] caught exception (socket exception [FAILED_STATE] for ip-10-0-17-156:27017 (10.0.17.156) failed) in destructor (kill)
2017-01-04T06:09:28.797+0000 W REPL     [rsBackgroundSync] rollback 2 exception 10278 dbclient error communicating with server: ip-10-0-17-156:27017; sleeping 1 min

The error suggests network issue which is totally incorrect. The servers can access each other just fine:

[ec2-user@ip-10-0-33-140 ~]$ nc -v -z ip-10-0-17-156 27017
Connection to ip-10-0-17-156 27017 port [tcp/*] succeeded!

I can even connect mongo shell to remote server and run queries fine. Plus, if I delete all data and do a full resync, it's able to connect without any issues.



 Comments   
Comment by Kelsey Schubert [ 26/Jun/17 ]

Hi amanpreet@codigami.com,

Sorry for the delay getting back to you. Unfortunately, we have not been able to reproduce this issue.

If this continues to be an issue, I would recommend posting on mongodb-user group or Stack Overflow with the mongodb tag for MongoDB-related support discussion. It's possible that there may be some particular network configuration that is contributing to this issue.

Additionally, please note that we have SERVER-20739 open to track work to improve how we handle network errors during rollback, which may improve the behavior you're observing – please feel free to vote for it and watch it for updates.

Kind regards,
Thomas

Comment by Amanpreet Singh [ 13/Jan/17 ]

Hey Thomas,

Sorry about that. I've attached some other relevant log files as well.

Thanks!

Comment by Kelsey Schubert [ 12/Jan/17 ]

Hi amanpreet@codigami.com,

I took a look at the logs you've provided, but am not seeing rollback described. Would you please ensure that the correct log files have been uploaded?

Thank you,
Thomas

Comment by Amanpreet Singh [ 08/Jan/17 ]

Hi @Thomas Schubert,

I've attached logs and a screenshot of MMS dashboard to give some context.

Thanks!

Comment by Kelsey Schubert [ 04/Jan/17 ]

Hi amanpreet@codigami.com,

Thanks for the report. So we can continue to investigate, would you please upload the complete logs for the affected mongod and the primary node?

Kind regards,
Thomas

Generated at Thu Feb 08 04:15:33 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.