[SERVER-33899] Replica Node Never Becomes Secondary after Recovering Created: 15/Mar/18  Updated: 07/Apr/23  Resolved: 21/Jun/18

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Question Priority: Major - P3
Reporter: Pratiksha Aggarwal Assignee: Kelsey Schubert
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File logs_of_node2.txt    
Participants:

 Description   

Configuration - MongoDB with three nodes : Primary , Secondary,1 Secondary2

When we shutdown secondary1 node and do not bring it up for long time (say more the 10 mins) then in the output of rs.status() we will see that secondary1 is in RECOVERING state and it doesn't resume into state SECONDARY until we do not perform the manual resync steps (by empyting the data directory and restart process)

Referred https://jira.mongodb.org/browse/SERVER-26360 but can't get the exact RCA or fix.



 Comments   
Comment by Kelsey Schubert [ 05/Apr/18 ]

Hi pragupta,

It looks like your configured oplog size is very small, which is likely resulting in this issue. I'd recommend increasing your oplog size as I suspect that this may resolve this issue.

Please note that the SERVER project is for reporting bugs or feature suggestions for the MongoDB server. For MongoDB-related support discussion please post on the mongodb-user group or Stack Overflow with the mongodb tag. A question like this involving more discussion would be best posted on the mongodb-users group.

See also our Technical Support page for additional support resources.

Kind regards,
Kelsey

Comment by Pratiksha Aggarwal [ 28/Mar/18 ]

Hi,

Any update on this? Please look into this on priority.

Comment by Pratiksha Aggarwal [ 21/Mar/18 ]

Below is the output of rs.printReplicationInfo()

configured oplog size: 128MB
log length start to end: 381201secs (105.89hrs)
oplog first event time: Fri Mar 16 2018 14:52:41 GMT+0530 (IST)
oplog last event time: Wed Mar 21 2018 00:46:02 GMT+0530 (IST)
now: Wed Mar 21 2018 10:07:04 GMT+0530 (IST)

Also as far as I have faced the issue, every time we shut down a node then it never resumes to SECONDARY state automatically. Always goes into RECOVERING state.

Comment by Kelsey Schubert [ 20/Mar/18 ]

Hi pragupta,

Thanks for the logs it appears that after restarting the node it has fallen off the oplog and can no longer be resynced. To continue to investigate, would you please prove the output of rs.printReplicationInfo() so we can determine the size of the oplog window?

Thank you,
Kelsey

Comment by Pratiksha Aggarwal [ 19/Mar/18 ]

Hi
Any update on this?

Comment by Pratiksha Aggarwal [ 16/Mar/18 ]

MongoDB version - 3.2.11

Output of rs.status :

rs0:PRIMARY> rs.status()
{
	"set" : "rs0",
	"date" : ISODate("2018-03-16T09:43:13.836Z"),
	"myState" : 1,
	"term" : NumberLong(19),
	"heartbeatIntervalMillis" : NumberLong(2000),
	"members" : [
		{
			"_id" : 0,
			"name" : "<ip of primary instance>:27017",
			"health" : 1,
			"state" : 1,
			"stateStr" : "PRIMARY",
			"uptime" : 98424,
			"optime" : {
				"ts" : Timestamp(1521192631, 248),
				"t" : NumberLong(19)
			},
			"optimeDate" : ISODate("2018-03-16T09:30:31Z"),
			"electionTime" : Timestamp(1521094984, 1),
			"electionDate" : ISODate("2018-03-15T06:23:04Z"),
			"configVersion" : 41212,
			"self" : true
		},
		{
			"_id" : 1,
			"name" : "<ip of secondary1 instance>:27017",
			"health" : 1,
			"state" : 3,
			"stateStr" : "RECOVERING",
			"uptime" : 770,
			"optime" : {
				"ts" : Timestamp(1521192112, 103),
				"t" : NumberLong(19)
			},
			"optimeDate" : ISODate("2018-03-16T09:21:52Z"),
			"lastHeartbeat" : ISODate("2018-03-16T09:43:11.988Z"),
			"lastHeartbeatRecv" : ISODate("2018-03-16T09:43:08.969Z"),
			"pingMs" : NumberLong(0),
			"configVersion" : 41212
		},
		{
			"_id" : 2,
			"name" : "<ip of secondary2 instance>:27017",
			"health" : 1,
			"state" : 2,
			"stateStr" : "SECONDARY",
			"uptime" : 15612,
			"optime" : {
				"ts" : Timestamp(1521192631, 248),
				"t" : NumberLong(19)
			},
			"optimeDate" : ISODate("2018-03-16T09:30:31Z"),
			"lastHeartbeat" : ISODate("2018-03-16T09:43:13.515Z"),
			"lastHeartbeatRecv" : ISODate("2018-03-16T09:43:12.771Z"),
			"pingMs" : NumberLong(0),
			"syncingTo" : "<ip of primary instance>:27017",
			"configVersion" : 41212
		}
	],
	"ok" : 1
}

logs_of_node2.txt Attaching logs of node which went to recovering after start up.

Scenario:
Shut down instance i.e secondary 1 of mongo and started it after 8-10 mins. Then it went to recovering state forever.

Comment by Ramon Fernandez Marina [ 15/Mar/18 ]

SERVER-26360 has already been fixed, but you may or may not be affected by it – what version(s) of MongoDB are you using?

Can you also please provide:

  • The output of rs.status()
  • The logs for the affected node after restarting it

Thanks,
Ramón.

Generated at Thu Feb 08 04:34:55 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.