[SERVER-47966] Operation interrupted error adding second csrs member on 4.4.0 Created: 05/May/20  Updated: 27/Oct/23  Resolved: 06/May/20

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Question Priority: Major - P3
Reporter: Louisa Berger Assignee: Pavithra Vetriselvan
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File csrs_primary.log     Text File new_csrs.log    
Issue Links:
Related
Sprint: Repl 2020-05-18
Participants:

 Description   

I'm getting "operation was interrupted" errors trying to add a second csrs member to a sharded cluster on 4.4.

I started up a sharded cluster on 4.4.0-rc3 with 1 csrs member. I waited for the first csrs member to be primary with ismaster:true. Then, I started up a second csrs member and tried to add it to the csrs, but the reconfig fails after exactly 10 seconds with

(InterruptedDueToReplStateChange) Reconfig finished but failed to propagate to a majority :: caused by :: Current config with {version: 2, term: 1} has not yet propagated to a majority of nodes :: caused by :: operation was interrupted


Fuller log snippet from us:

[.info] [cm/action/dbcmd.go:runRsReconfigCmd:237] <shFT_cs1> [13:44:59.824] Running ReplSetReconfig
[.error] [cm/action/dbcmd.go:runRsReconfigCmd:256] <shFT_cs1> [13:45:09.932] Error running command:
	cmd=[{replSetReconfig map[_id:csrs configsvr:true members:[map[_id:0 arbiterOnly:false buildIndexes:true hidden:false host:louisamac:9004 priority:1 slaveDelay:0 tags:map[] votes:1] map[_id:1 arbiterOnly:false buildIndexes:true hidden:false host:louisamac:9019 priority:1 slaveDelay:0 tags:map[] votes:1]] protocolVersion:1 settings:map[catchUpTakeoverDelayMillis:30000 catchUpTimeoutMillis:-1 chainingAllowed:true electionTimeoutMillis:10000 getLastErrorDefaults:map[w:1 wtimeout:0] getLastErrorModes:map[] heartbeatIntervalMillis:2000 heartbeatTimeoutSecs:10 replicaSetId:ObjectID("5eb1a5c5950801a13ee31207")] version:2]} {maxTimeMS 40000}]
	connParams=louisamac:9004 (local=false) : (InterruptedDueToReplStateChange) Reconfig finished but failed to propagate to a majority :: caused by :: Current config with {version: 2, term: 1} has not yet propagated to a majority of nodes :: caused by :: operation was interrupted

Entirely possible we're doing something wrong here on our end, but I'm not sure what step we're missing.

Attached the first and second CSRS member logs. Let me know if you need anything else.



 Comments   
Comment by Louisa Berger [ 06/May/20 ]

The rest is being covered in HELP-15470, feel free to close this one out, and thank you for your help!

Comment by Pavithra Vetriselvan [ 05/May/20 ]

No problem! And sure, feel free to post updated logs if you need insight from us!

Comment by Louisa Berger [ 05/May/20 ]

Thank you Pavi! The config svr flag was the problem here. We're now looking into a second problem that might be related. Is it okay if I leave this ticket open until we iron that out?

Comment by Pavithra Vetriselvan [ 05/May/20 ]

louisa.berger It looks like the root cause of this has to do with the new node rejecting the config from the primary because of a BadValue error. This comes from our config checks, so I don't think this is a bug with the safe reconfig protocol. I think this has to do with how we're setting up the new node based on the following log:

Nodes being used for config servers must be started with the --configsvr flag

I'm not too familiar with starting config servers, but could you try starting the node with that flag? I see that the new config already has a "configcvr": true field and "writeConcernMajorityJournalDefault":true, which is good because both those fields are necessary.

Comment by Pavithra Vetriselvan [ 05/May/20 ]

It looks like the new node 9019 rejects the new config:

| 2020-05-05T13:44:59.864-04:00 I  REPL     21400   [conn21] "Scheduling heartbeat to fetch a new config since we are not a member of our current config","attr":{"senderHost":"louisamac:9004"}
| 2020-05-05T13:44:59.865-04:00 I  CONNPOOL 22576   [ReplNetwork] "Connecting to {hostAndPort}","attr":{"hostAndPort":"louisamac:9004"}
| 2020-05-05T13:44:59.911-04:00 W  REPL     21487   [ReplCoord-0] "Not persisting new configuration in heartbeat response to disk because it is invalid","attr":{"error":{"code":2,"codeName":"BadValue","errmsg":"Nodes being used for config servers must be started with the --configsvr flag"}}
| 2020-05-05T13:44:59.911-04:00 E  REPL     21490   [ReplCoord-0] "Could not validate configuration received from remote node; Removing self until an acceptable configuration arrives","attr":{"error":{"code":2,"codeName":"BadValue","errmsg":"Nodes being used for config servers must be started with the --configsvr flag"}}
| 2020-05-05T13:44:59.911-04:00 I  REPL     21392   [ReplCoord-0] "New replica set config in use","attr":{"config":{"_id":"csrs","version":2,"term":1,"configsvr":true,"protocolVersion":1,"writeConcernMajorityJournalDefault":true,"members":[{"_id":0,"host":"louisamac:9004","arbiterOnly":false,"buildIndexes":true,"hidden":false,"priority":1.0,"tags":{},"slaveDelay":0,"votes":1},{"_id":1,"host":"louisamac:9019","arbiterOnly":false,"buildIndexes":true,"hidden":false,"priority":1.0,"tags":{},"slaveDelay":0,"votes":1}],"settings":{"chainingAllowed":true,"heartbeatIntervalMillis":2000,"heartbeatTimeoutSecs":10,"electionTimeoutMillis":10000,"catchUpTimeoutMillis":-1,"catchUpTakeoverDelayMillis":30000,"getLastErrorModes":{},"getLastErrorDefaults":{"w":1,"wtimeout":0},"replicaSetId":{"$oid":"5eb1a5c5950801a13ee31207"}}}}
| 2020-05-05T13:44:59.924-04:00 I  REPL     21394   [ReplCoord-0] "This node is not a member of the config"
| 2020-05-05T13:44:59.937-04:00 I  REPL     21358   [ReplCoord-0] "Replica set state transition","attr":{"newState":"REMOVED","oldState":"STARTUP"}

Since node 9019 is removed, I think the original primary 9004 will eventually step down (after 10 seconds, which is the default election timeout). The reconfig command is interrupted because we kill the operation on step down:

| 2020-05-05T13:45:09.913-04:00 I  REPL     21809   [ReplCoord-2] "Can't see a majority of the set, relinquishing primary"
| 2020-05-05T13:45:09.922-04:00 I  REPL     21475   [ReplCoord-2] "Stepping down from primary in response to heartbeat"
| 2020-05-05T13:45:09.922-04:00 I  REPL     21343   [RstlKillOpThread] "Starting to kill user operations"
| 2020-05-05T13:45:09.922-04:00 I  REPL     21344   [RstlKillOpThread] "Stopped killing user operations"
| 2020-05-05T13:45:09.922-04:00 I  REPL     21340   [RstlKillOpThread] "State transition ops metrics","attr":{"metrics":{"lastStateTransition":"stepDown","userOpsKilled":2,"userOpsRunning":0}}
...
| 2020-05-05T13:45:09.923-04:00 I  COMMAND  51803   [conn104] "Slow query","attr":{"type":"command","ns":"local.system.replset","appName":"MongoDB Automation Agent v10.15.0 (git: DEV)","command":{"replSetReconfig":{"_id":"csrs","version":2,"members":[{"slaveDelay":0,"votes":1,"arbiterOnly":false,"hidden":false,"tags":{},"buildIndexes":true,"host":"louisamac:9004","priority":1.0,"_id":0},{"hidden":false,"priority":1.0,"tags":{},"votes":1,"host":"louisamac:9019","buildIndexes":true,"arbiterOnly":false,"slaveDelay":0,"_id":1}],"settings":{"chainingAllowed":true,"heartbeatTimeoutSecs":10,"catchUpTimeoutMillis":-1,"getLastErrorDefaults":{"w":1,"wtimeout":0},"getLastErrorModes":{},"electionTimeoutMillis":10000,"replicaSetId":{"$oid":"5eb1a5c5950801a13ee31207"},"heartbeatIntervalMillis":2000,"catchUpTakeoverDelayMillis":30000},"configsvr":true,"protocolVersion":1},"maxTimeMS":40000,"lsid":{"id":{"$uuid":"da7e2d9a-165d-4383-8033-10bc0b1b480e"}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1588700699,"i":1}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}},"$db":"admin","$readPreference":{"mode":"primaryPreferred"}},"numYields":0,"ok":0,"errMsg":"Reconfig finished but failed to propagate to a majority :: caused by :: Current config with {version: 2, term: 1} has not yet propagated to a majority of nodes :: caused by :: operation was interrupted","errName":11602,"errCode":11602,"reslen":604,"locks":{"ParallelBatchWriterMode":{"acquireCount":{"r":5}},"ReplicationStateTransition":{"acquireCount":{"w":5}},"Global":{"acquireCount":{"r":1,"w":4}},"Database":{"acquireCount":{"r":1,"w":2,"W":1}},"Collection":{"acquireCount":{"w":2}},"Mutex":{"acquireCount":{"r":3}},"oplog":{"acquireCount":{"r":1}}},"flowControl":{"acquireCount":2,"timeAcquiringMicros":3},"protocol":"op_msg","durationMillis":10097}

Comment by Pavithra Vetriselvan [ 05/May/20 ]

It looks like the primary (9004) is able to persist the new config to disk and is waiting for a majority to replicate the config:

| 2020-05-05T13:44:59.906-04:00 I  REPL     21392   [conn104] "New replica set config in use","attr":{"config":{"_id":"csrs","version":2,"term":1,"configsvr":true,"protocolVersion":1,"writeConcernMajorityJournalDefault":true,"members":[{"_id":0,"host":"louisamac:9004","arbiterOnly":false,"buildIndexes":true,"hidden":false,"priority":1.0,"tags":{},"slaveDelay":0,"votes":1},{"_id":1,"host":"louisamac:9019","arbiterOnly":false,"buildIndexes":true,"hidden":false,"priority":1.0,"tags":{},"slaveDelay":0,"votes":1}],"settings":{"chainingAllowed":true,"heartbeatIntervalMillis":2000,"heartbeatTimeoutSecs":10,"electionTimeoutMillis":10000,"catchUpTimeoutMillis":-1,"catchUpTakeoverDelayMillis":30000,"getLastErrorModes":{},"getLastErrorDefaults":{"w":1,"wtimeout":0},"replicaSetId":{"$oid":"5eb1a5c5950801a13ee31207"}}}}
| 2020-05-05T13:44:59.909-04:00 I  REPL     21393   [conn104] "Found self in config","attr":{"hostAndPort":"louisamac:9004"}
| 2020-05-05T13:44:59.910-04:00 I  REPL     4508702 [conn104] "Waiting for the current config to propagate to a majority of nodes."

A bit later, we see failed heartbeats to the new node we added (9019):

| 2020-05-05T13:45:01.922-04:00 I  REPL_HB  23974   [ReplCoord-2] "Heartbeat failed after max retries","attr":{"target":"louisamac:9019","maxHeartbeatRetries":2,"error":{"code":93,"codeName":"InvalidReplicaSetConfig","errmsg":"Our replica set configuration is invalid or does not include us"}}

Generated at Thu Feb 08 05:15:44 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.