Details
-
Bug
-
Resolution: Cannot Reproduce
-
Major - P3
-
None
-
3.2.11
-
None
-
ALL
-
Description
The primary member of the 3 node replica set was OOM killed and a secondary member was promoted to primary. Upon restart the dead member, it came up but it's stuck in ROLLBACK state with these logs:
|
mongod.log |
2017-01-04T06:08:58.781+0000 I REPL [ReplicationExecutor] syncing from: ip-10-0-17-156:27017
|
2017-01-04T06:08:58.796+0000 I REPL [rsBackgroundSync] Starting rollback due to OplogStartMissing: our last op time fetched: (term: -1, timestamp: Jan 4 03:39:41:1d). source's GTE: (term: -1, timestamp: Jan 4 03:39:46:1) hashes: (-8237435499851558070/-4585935198278308689)
|
2017-01-04T06:08:58.796+0000 I REPL [rsBackgroundSync] beginning rollback
|
2017-01-04T06:08:58.796+0000 I REPL [rsBackgroundSync] rollback 0
|
2017-01-04T06:08:58.796+0000 I REPL [rsBackgroundSync] rollback 1
|
2017-01-04T06:08:58.798+0000 I REPL [rsBackgroundSync] rollback 2 FindCommonPoint
|
2017-01-04T06:08:58.799+0000 I REPL [rsBackgroundSync] rollback our last optime: Jan 4 03:39:41:1d
|
2017-01-04T06:08:58.799+0000 I REPL [rsBackgroundSync] rollback their last optime: Jan 4 06:08:57:2
|
2017-01-04T06:08:58.799+0000 I REPL [rsBackgroundSync] rollback diff in end of log times: -8956 seconds
|
2017-01-04T06:09:28.797+0000 I - [rsBackgroundSync] caught exception (socket exception [FAILED_STATE] for ip-10-0-17-156:27017 (10.0.17.156) failed) in destructor (kill)
|
2017-01-04T06:09:28.797+0000 W REPL [rsBackgroundSync] rollback 2 exception 10278 dbclient error communicating with server: ip-10-0-17-156:27017; sleeping 1 min
|
The error suggests network issue which is totally incorrect. The servers can access each other just fine:
[ec2-user@ip-10-0-33-140 ~]$ nc -v -z ip-10-0-17-156 27017
|
Connection to ip-10-0-17-156 27017 port [tcp/*] succeeded!
|
I can even connect mongo shell to remote server and run queries fine. Plus, if I delete all data and do a full resync, it's able to connect without any issues.