[SERVER-67960] Promotion of new 4.2 config server primary stuck creating index on chunks.ns_1_min_1 Created: 11/Jul/22  Updated: 20/Sep/22  Resolved: 20/Sep/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Iván Groenewold Assignee: Chris Kelly
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Participants:

 Description   

Summary
I have a sharded cluster running MongoDB community 4.0.28 and I am trying to upgrade the config servers to 4.2.20. 

The config server replica set has 3 nodes running 4.0.28, and I have added 3 new nodes running 4.2.20 for a total of six servers. I have set priority to 1 in just 2 nodes (one 4.0 and one 4.2) to be able to control the promotion process (the rest of the nodes all have prio:0). 

When trying to promote the 4.2 server via rs.stepDown(), the process doesn't complete and clients start complaining that they are unable to reach primary.

Looking at db.currentOp() on the 4.2 host that is trying to become primary I see the following operation which seems to be blocking the promotion process:

{
			"type" : "op",
			"host" : "xxxxx:27019",
			"desc" : "rsSync-0",
			"active" : true,
			"currentOpTime" : "2022-07-11T17:51:11.927+0000",
			"effectiveUsers" : [
				{
					"user" : "__system",
					"db" : "local"
				}
			],
			"opid" : 19266327,
			"secs_running" : NumberLong(271),
			"microsecs_running" : NumberLong(271661567),
			"op" : "command",
			"ns" : "config.$cmd",
			"command" : {
				"createIndexes" : "chunks",
				"indexes" : [
					{
						"name" : "ns_1_min_1",
						"key" : {
							"ns" : 1,
							"min" : 1
						},
						"unique" : true
					}
				],
				"$db" : "config"
			},
			"numYields" : 0,
			"waitingForLatch" : {
				"timestamp" : ISODate("2022-07-11T17:46:40.366Z"),
				"captureName" : "ReplicationCoordinatorImpl::_mutex"
			},
			"locks" : {
				"ReplicationStateTransition" : "W"
			},
			"waitingForLock" : false,
			"lockStats" : {
				"ParallelBatchWriterMode" : {
					"acquireCount" : {
						"r" : NumberLong(2)
					}
				},
				"ReplicationStateTransition" : {
					"acquireCount" : {
						"w" : NumberLong(2)
					}
				},
				"Global" : {
					"acquireCount" : {
						"w" : NumberLong(2)
					}
				},
				"Database" : {
					"acquireCount" : {
						"w" : NumberLong(2)
					}
				},
				"Collection" : {
					"acquireCount" : {
						"w" : NumberLong(1)
					}
				},
				"Mutex" : {
					"acquireCount" : {
						"r" : NumberLong(1)
					}
				}
			},
			"waitingForFlowControl" : false,
			"flowControlStats" : {
 
			}
		},

The config.chunks collection only has 9 chunks so the index creation should be very fast.
Any help would be appreciated



 Comments   
Comment by Chris Kelly [ 20/Sep/22 ]

Hi igroene@gmail.com,

 

Thanks for your report - just to add:

MongoDB uses a write concern of {{"majority" }}when writing to config servers already in 4.2. In 4.4+,  setDefaultRWConcern is introduced instead.

Starting in 5.0, we completely ignore getLastErrorDefaults (SERVER-55701) and we will fail to startup/reconfig a node that sets getLastErrorDefaults (SERVER-56241)

 

Regards,

Christopher

Comment by Iván Groenewold [ 15/Jul/22 ]

For anyone that runs into a similar issue the problem in this case was the config server replica set was provisioned with a non-standard default write concern:

		"getLastErrorDefaults" : {
			"w" : "majority",
			"j" : true,
			"wtimeout" : 0
		},

instead of the default values:

		"getLastErrorDefaults" : {
			"w" : 1,
			"wtimeout" : 0
		},

changing to w:1 fixed the problem. For some reason this setting didn't cause issues in elections in versions earlier than 4.2

Generated at Thu Feb 08 06:09:30 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.