[SERVER-56129] Problem on rs.stepDown() when upgrading from 4.2.13 to 4.4.5 Created: 15/Apr/21  Updated: 15/Feb/23  Resolved: 12/May/21

Status: Closed
Project: Core Server
Component/s: Upgrade/Downgrade
Affects Version/s: 4.4.4, 4.4.5
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Augusto Fonseca Assignee: Matthew Russotto
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-54876 Upgrade mongodb 4.4.4 from 4.2.12 failed Closed
Operating System: ALL
Steps To Reproduce:

1 - Stop secondary

2- Upgrade secondary from 4.2.13 to 4.4.4

3- Start secondary with new version

4- On primary do rs.stepDown()

Participants:

 Description   

Hi, i have a replica set of 3 nodes, with 4.2.13 version, i upgraded the secondary to 4.4.4 and started it. I see that this node becames SECONDARY and can comunicate with PRIMARY. 

Then i go to the primary and i do rs.stepDown() to install the new version. The upgraded node becames PRIMARY and then dies.

Log of node 4.4.4 becaming primary:

{"t":{"$date":"2021-04-15T13:16:53.211+01:00"},"s":"I",  "c":"COMMAND",  "id":21581,   "ctx":"conn5","msg":"Received replSetStepUp request"}
{"t":{"$date":"2021-04-15T13:16:53.211+01:00"},"s":"I",  "c":"ELECTION", "id":4615661, "ctx":"conn5","msg":"Starting an election due to step up request"}
{"t":{"$date":"2021-04-15T13:16:53.211+01:00"},"s":"I",  "c":"ELECTION", "id":21437,   "ctx":"conn5","msg":"Skipping dry run and running for election","attr":{"newTerm":23}}
{"t":{"$date":"2021-04-15T13:16:53.212+01:00"},"s":"I",  "c":"REPL",     "id":21752,   "ctx":"ReplCoord-1","msg":"Scheduling remote command request","attr":{"context":"vote request","request":"RemoteCommand 202 -- target:127.0.0.1:27017 db:admin cmd:{ replSetRequestVotes: 1, setName: \"sigo\", dryRun: false, term: 23, candidateIndex: 1, configVersion: 1396130, lastCommittedOp: { ts: Timestamp(1618489005, 1), t: 22 } }"}}
{"t":{"$date":"2021-04-15T13:16:53.213+01:00"},"s":"I",  "c":"REPL",     "id":21752,   "ctx":"ReplCoord-1","msg":"Scheduling remote command request","attr":{"context":"vote request","request":"RemoteCommand 203 -- target:127.0.0.3:27018 db:admin cmd:{ replSetRequestVotes: 1, setName: \"sigo\", dryRun: false, term: 23, candidateIndex: 1, configVersion: 1396130, lastCommittedOp: { ts: Timestamp(1618489005, 1), t: 22 } }"}}
{"t":{"$date":"2021-04-15T13:16:53.213+01:00"},"s":"I",  "c":"ELECTION", "id":51799,   "ctx":"ReplCoord-0","msg":"VoteRequester processResponse","attr":{"term":23,"dryRun":false,"vote":"no","from":"127.0.0.3:27018","reason":"can see a healthy primary (127.0.0.1:27017) of equal or greater priority","message":{"term":23,"voteGranted":false,"reason":"can see a healthy primary (127.0.0.1:27017) of equal or greater priority","ok":1.0}}}
{"t":{"$date":"2021-04-15T13:16:53.214+01:00"},"s":"I",  "c":"ELECTION", "id":51799,   "ctx":"ReplCoord-2","msg":"VoteRequester processResponse","attr":{"term":23,"dryRun":false,"vote":"yes","from":"127.0.0.1:27017","message":{"term":23,"voteGranted":true,"reason":"","ok":1.0,"$clusterTime":{"clusterTime":{"$timestamp":{"t":1618489005,"i":1}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}},"operationTime":{"$timestamp":{"t":1618489005,"i":1}}}}}
{"t":{"$date":"2021-04-15T13:16:53.214+01:00"},"s":"I",  "c":"ELECTION", "id":21450,   "ctx":"ReplCoord-2","msg":"Election succeeded, assuming primary role","attr":{"term":23}}
{"t":{"$date":"2021-04-15T13:16:53.214+01:00"},"s":"I",  "c":"REPL",     "id":21358,   "ctx":"ReplCoord-2","msg":"Replica set state transition","attr":{"newState":"PRIMARY","oldState":"SECONDARY"}}
{"t":{"$date":"2021-04-15T13:16:53.214+01:00"},"s":"I",  "c":"REPL",     "id":21106,   "ctx":"ReplCoord-2","msg":"Resetting sync source to empty","attr":{"previousSyncSource":"127.0.0.1:27017"}}
{"t":{"$date":"2021-04-15T13:16:53.214+01:00"},"s":"I",  "c":"REPL",     "id":21359,   "ctx":"ReplCoord-2","msg":"Entering primary catch-up mode"}
{"t":{"$date":"2021-04-15T13:16:53.214+01:00"},"s":"I",  "c":"REPL",     "id":21215,   "ctx":"ReplCoord-5","msg":"Member is in new state","attr":{"hostAndPort":"127.0.0.1:27017","newState":"SECONDARY"}}
{"t":{"$date":"2021-04-15T13:16:53.214+01:00"},"s":"I",  "c":"REPL",     "id":21364,   "ctx":"ReplCoord-5","msg":"Caught up to the latest optime known via heartbeats after becoming primary","attr":{"targetOpTime":{"ts":{"$timestamp":{"t":1618489005,"i":1}},"t":22},"myLastApplied":{"ts":{"$timestamp":{"t":1618489005,"i":1}},"t":22}}}
{"t":{"$date":"2021-04-15T13:16:53.214+01:00"},"s":"I",  "c":"REPL",     "id":21363,   "ctx":"ReplCoord-5","msg":"Exited primary catch-up mode"}
{"t":{"$date":"2021-04-15T13:16:53.214+01:00"},"s":"I",  "c":"REPL",     "id":21107,   "ctx":"ReplCoord-5","msg":"Stopping replication producer"}
{"t":{"$date":"2021-04-15T13:16:53.214+01:00"},"s":"I",  "c":"NETWORK",  "id":20125,   "ctx":"ReplCoordExtern-0","msg":"DBClientConnection failed to receive message","attr":{"connString":"127.0.0.1:27017","error":"HostUnreachable: Connection closed by peer"}}
{"t":{"$date":"2021-04-15T13:16:53.215+01:00"},"s":"I",  "c":"REPL",     "id":21275,   "ctx":"ReplCoordExtern-0","msg":"Recreating cursor for oplog fetcher due to error","attr":{"lastOpTimeFetched":{"ts":{"$timestamp":{"t":1618489005,"i":1}},"t":22},"attemptsRemaining":1,"error":"HostUnreachable: Error while getting the next batch in the oplog fetcher :: caused by :: dbclient error communicating with server 127.0.0.1:27017 :: caused by :: Connection closed by peer"}}
{"t":{"$date":"2021-04-15T13:16:53.214+01:00"},"s":"I",  "c":"REPL",     "id":21239,   "ctx":"ReplBatcher","msg":"Oplog buffer has been drained","attr":{"term":23}}
{"t":{"$date":"2021-04-15T13:16:53.215+01:00"},"s":"I",  "c":"REPL",     "id":21343,   "ctx":"RstlKillOpThread","msg":"Starting to kill user operations"}
{"t":{"$date":"2021-04-15T13:16:53.215+01:00"},"s":"I",  "c":"REPL",     "id":21344,   "ctx":"RstlKillOpThread","msg":"Stopped killing user operations"}
{"t":{"$date":"2021-04-15T13:16:53.215+01:00"},"s":"I",  "c":"REPL",     "id":21340,   "ctx":"RstlKillOpThread","msg":"State transition ops metrics","attr":{"metrics":{"lastStateTransition":"stepUp","userOpsKilled":0,"userOpsRunning":0}}}
{"t":{"$date":"2021-04-15T13:16:53.215+01:00"},"s":"I",  "c":"STORAGE",  "id":20657,   "ctx":"OplogApplier-0","msg":"IndexBuildsCoordinator::onStepUp - this node is stepping up to primary"}
{"t":{"$date":"2021-04-15T13:16:53.216+01:00"},"s":"I",  "c":"REPL",     "id":21331,   "ctx":"OplogApplier-0","msg":"Transition to primary complete; database writes are now permitted"}
{"t":{"$date":"2021-04-15T13:16:53.216+01:00"},"s":"I",  "c":"NETWORK",  "id":20120,   "ctx":"ReplCoordExtern-0","msg":"Trying to reconnect","attr":{"connString":"127.0.0.1:27017 failed"}}
{"t":{"$date":"2021-04-15T13:16:53.216+01:00"},"s":"I",  "c":"NETWORK",  "id":20121,   "ctx":"ReplCoordExtern-0","msg":"Reconnect attempt failed","attr":{"connString":"127.0.0.1:27017 failed","error":""}}
{"t":{"$date":"2021-04-15T13:16:53.216+01:00"},"s":"I",  "c":"-",        "id":4615600, "ctx":"ReplCoordExtern-0","msg":"caught exception in destructor","attr":{"exception":"socket exception [CONNECT_ERROR] server [socket exception [FAILED_STATE] server [127.0.0.1:27017 failed]]","function":"kill"}}
{"t":{"$date":"2021-04-15T13:16:53.216+01:00"},"s":"I",  "c":"REPL",     "id":21094,   "ctx":"BackgroundSync","msg":"Replication producer stopped after oplog fetcher finished returning a batch from our sync source. Abandoning this batch of oplog entries and re-evaluating our sync source"}
{"t":{"$date":"2021-04-15T13:16:55.213+01:00"},"s":"F",  "c":"-",        "id":23079,   "ctx":"conn6","msg":"Invariant failure","attr":{"expr":"opTime.opTime.isNull() || opTime.wallTime > Date_t()","file":"src/mongo/db/repl/member_data.cpp","line":175}}
{"t":{"$date":"2021-04-15T13:16:55.213+01:00"},"s":"F",  "c":"-",        "id":23080,   "ctx":"conn6","msg":"\n\n***aborting after invariant() failure\n\n"}
{"t":{"$date":"2021-04-15T13:16:55.213+01:00"},"s":"F",  "c":"CONTROL",  "id":4757800, "ctx":"conn6","msg":"Writing fatal message","attr":{"message":"Got signal: 6 (Aborted).\n"}}
{"t":{"$date":"2021-04-15T13:16:55.213+01:00"},"s":"I",  "c":"NETWORK",  "id":22943,   "ctx":"listener","msg":"Connection accepted","attr":{"remote":"127.0.0.1:37538","connectionId":14,"connectionCount":4}}
{"t":{"$date":"2021-04-15T13:16:55.213+01:00"},"s":"I",  "c":"NETWORK",  "id":51800,   "ctx":"conn14","msg":"client metadata","attr":{"remote":"127.0.0.1:37538","client":"conn14","doc":{"driver":{"name":"NetworkInterfaceTL","version":"4.2.13"},"os":{"type":"Linux","name":"Red Hat Enterprise Linux Server release 6.8 (Santiago)","architecture":"x86_64","version":"Kernel 2.6.32-642.11.1.el6.x86_64"}}}}
{"t":{"$date":"2021-04-15T13:16:55.376+01:00"},"s":"I",  "c":"CONTROL",  "id":31431,   "ctx":"conn6","msg":"BACKTRACE: {bt}","attr":{"bt":{"backtrace":[{"a":"7FEE165E5101","b":"7FEE138DC000","o":"2D09101","s":"_ZN5mongo18stack_trace_detail12_GLOBAL__N_119printStackTraceImplERKNS1_7OptionsEPNS_14StackTraceSinkE.constprop.606","s+":"1E1"},{"a":"7FEE165E6739","b":"7FEE138DC000","o":"2D0A739","s":"_ZN5mongo15printStackTraceEv","s+":"29"},{"a":"7FEE165E3F96","b":"7FEE138DC000","o":"2D07F96","s":"_ZN5mongo12_GLOBAL__N_116abruptQuitActionEiP7siginfoPv","s+":"66"},{"a":"7FEE11F257E0","b":"7FEE11F16000","o":"F7E0","s":"_L_unlock_16","s+":"2D"},{"a":"7FEE11BB45E5","b":"7FEE11B82000","o":"325E5","s":"gsignal","s+":"35"},{"a":"7FEE11BB5DC5","b":"7FEE11B82000","o":"33DC5","s":"abort","s+":"175"},{"a":"7FEE147F6AB3","b":"7FEE138DC000","o":"F1AAB3","s":"_ZN5mongo15invariantFailedEPKcS1_j","s+":"12C"},{"a":"7FEE145489D0","b":"7FEE138DC000","o":"C6C9D0","s":"_ZN5mongo4repl10MemberData35advanceLastDurableOpTimeAndWallTimeENS0_17OpTimeAndWallTimeENS_6Date_tE.cold.338","s+":"18"},{"a":"7FEE14C6CC33","b":"7FEE138DC000","o":"1390C33","s":"_ZN5mongo4repl19TopologyCoordinator13setLastOptimeERKNS0_18UpdatePositionArgs10UpdateInfoENS_6Date_tEPx","s+":"2E3"},{"a":"7FEE14BBA0C5","b":"7FEE138DC000","o":"12DE0C5","s":"_ZN5mongo4repl26ReplicationCoordinatorImpl14_setLastOptimeENS_8WithLockERKNS0_18UpdatePositionArgs10UpdateInfoEPx","s+":"55"},{"a":"7FEE14BBA22F","b":"7FEE138DC000","o":"12DE22F","s":"_ZN5mongo4repl26ReplicationCoordinatorImpl28processReplSetUpdatePositionERKNS0_18UpdatePositionArgsEPx","s+":"AF"},{"a":"7FEE14B74777","b":"7FEE138DC000","o":"1298777","s":"_ZN5mongo4repl24CmdReplSetUpdatePosition3runEPNS_16OperationContextERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKNS_7BSONObjERNS_14BSONObjBuilderE","s+":"167"},{"a":"7FEE155500CA","b":"7FEE138DC000","o":"1C740CA","s":"_ZN5mongo37BasicCommandWithReplyBuilderInterface10Invocation3runEPNS_16OperationContextEPNS_3rpc21ReplyBuilderInterfaceE","s+":"11A"},{"a":"7FEE1554A60F","b":"7FEE138DC000","o":"1C6E60F","s":"_ZN5mongo14CommandHelpers20runCommandInvocationEPNS_16OperationContextERKNS_12OpMsgRequestEPNS_17CommandInvocationEPNS_3rpc21ReplyBuilderInterfaceE","s+":"7F"},{"a":"7FEE14CF1AC5","b":"7FEE138DC000","o":"1415AC5","s":"_ZN5mongo12_GLOBAL__N_114runCommandImplEPNS_16OperationContextEPNS_17CommandInvocationERKNS_12OpMsgRequestEPNS_3rpc21ReplyBuilderInterfaceENS_11LogicalTimeERKNS_23ServiceEntryPointCommon5HooksEPNS_14BSONObjBuilderERKNS_30OperationSessionInfoFromClientE","s+":"135"},{"a":"7FEE14CF4929","b":"7FEE138DC000","o":"1418929","s":"_ZN5mongo12_GLOBAL__N_119execCommandDatabaseEPNS_16OperationContextEPNS_7CommandERKNS_12OpMsgRequestEPNS_3rpc21ReplyBuilderInterfaceERKNS_23ServiceEntryPointCommon5HooksE","s+":"11B9"},{"a":"7FEE14CF5C10","b":"7FEE138DC000","o":"1419C10","s":"_ZN5mongo12_GLOBAL__N_116receivedCommandsEPNS_16OperationContextERKNS_7MessageERKNS_23ServiceEntryPointCommon5HooksE","s+":"650"},{"a":"7FEE14CF689D","b":"7FEE138DC000","o":"141A89D","s":"_ZN5mongo23ServiceEntryPointCommon13handleRequestEPNS_16OperationContextERKNS_7MessageERKNS0_5HooksE","s+":"50D"},{"a":"7FEE14CE434C","b":"7FEE138DC000","o":"140834C","s":"_ZN5mongo23ServiceEntryPointMongod13handleRequestEPNS_16OperationContextERKNS_7MessageE","s+":"3C"},{"a":"7FEE14CEEDD5","b":"7FEE138DC000","o":"1412DD5","s":"_ZN5mongo19ServiceStateMachine15_processMessageENS0_11ThreadGuardE","s+":"F5"},{"a":"7FEE14CECAE5","b":"7FEE138DC000","o":"1410AE5","s":"_ZN5mongo19ServiceStateMachine15_runNextInGuardENS0_11ThreadGuardE","s+":"125"},{"a":"7FEE14CEDA76","b":"7FEE138DC000","o":"1411A76","s":"_ZNSt17_Function_handlerIFvvEZN5mongo19ServiceStateMachine22_scheduleNextWithGuardENS2_11ThreadGuardENS1_9transport15ServiceExecutor13ScheduleFlagsENS4_23ServiceExecutorTaskNameENS2_9OwnershipEEUlvE_E9_M_invokeERKSt9_Any_data","s+":"56"},{"a":"7FEE15FABBE2","b":"7FEE138DC000","o":"26CFBE2","s":"_ZN5mongo9transport26ServiceExecutorSynchronous8scheduleESt8functionIFvvEENS0_15ServiceExecutor13ScheduleFlagsENS0_23ServiceExecutorTaskNameE","s+":"182"},{"a":"7FEE14CEBEFB","b":"7FEE138DC000","o":"140FEFB","s":"_ZN5mongo19ServiceStateMachine22_scheduleNextWithGuardENS0_11ThreadGuardENS_9transport15ServiceExecutor13ScheduleFlagsENS2_23ServiceExecutorTaskNameENS0_9OwnershipE","s+":"DB"},{"a":"7FEE14CED25D","b":"7FEE138DC000","o":"141125D","s":"_ZN5mongo19ServiceStateMachine15_sourceCallbackENS_6StatusE","s+":"6AD"},{"a":"7FEE14CED340","b":"7FEE138DC000","o":"1411340","s":"_ZN5mongo14future_details4callIRZNS_19ServiceStateMachine14_sourceMessageENS2_11ThreadGuardEEUlNS_10StatusWithINS_7MessageEEEE0_S6_EEDaOT_OT0_","s+":"60"},{"a":"7FEE14CED5E5","b":"7FEE138DC000","o":"14115E5","s":"_ZN5mongo19ServiceStateMachine14_sourceMessageENS0_11ThreadGuardE","s+":"145"},{"a":"7FEE14CECB5A","b":"7FEE138DC000","o":"1410B5A","s":"_ZN5mongo19ServiceStateMachine15_runNextInGuardENS0_11ThreadGuardE","s+":"19A"},{"a":"7FEE14CEDA76","b":"7FEE138DC000","o":"1411A76","s":"_ZNSt17_Function_handlerIFvvEZN5mongo19ServiceStateMachine22_scheduleNextWithGuardENS2_11ThreadGuardENS1_9transport15ServiceExecutor13ScheduleFlagsENS4_23ServiceExecutorTaskNameENS2_9OwnershipEEUlvE_E9_M_invokeERKSt9_Any_data","s+":"56"},{"a":"7FEE15FAC248","b":"7FEE138DC000","o":"26D0248","s":"_ZNSt17_Function_handlerIFvvEZN5mongo9transport26ServiceExecutorSynchronous8scheduleESt8functionIS0_ENS2_15ServiceExecutor13ScheduleFlagsENS2_23ServiceExecutorTaskNameEEUlvE0_E9_M_invokeERKSt9_Any_data","s+":"B8"},{"a":"7FEE162FC706","b":"7FEE138DC000","o":"2A20706","s":"_ZNSt17_Function_handlerIFvvEZN5mongo25launchServiceWorkerThreadESt8functionIS0_EEUlvE1_E9_M_invokeERKSt9_Any_data","s+":"56"},{"a":"7FEE162FC774","b":"7FEE138DC000","o":"2A20774","s":"_ZN5mongo12_GLOBAL__N_17runFuncEPv","s+":"14"},{"a":"7FEE11F1DAA1","b":"7FEE11F16000","o":"7AA1","s":"start_thread","s+":"D1"},{"a":"7FEE11C6AAAD","b":"7FEE11B82000","o":"E8AAD","s":"clone","s+":"6D"}],"processInfo":{"mongodbVersion":"4.4.4","gitVersion":"8db30a63db1a9d84bdcad0c83369623f708e0397","compiledModules":[],"uname":{"sysname":"Linux","release":"2.6.32-642.3.1.el6.x86_64","version":"#1 SMP Sun Jun 26 18:16:44 EDT 2016","machine":"x86_64"},"somap":[{"b":"7FEE138DC000","elfType":3,"buildId":"58BE9325EB37EA25F1A25D0963EAF5DB9C4F4A67"}]}}}}
{"t":{"$date":"2021-04-15T13:16:55.376+01:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"conn6","msg":"  Frame: {frame}","attr":{"frame":{"a":"7FEE165E5101","b":"7FEE138DC000","o":"2D09101","s":"_ZN5mongo18stack_trace_detail12_GLOBAL__N_119printStackTraceImplERKNS1_7OptionsEPNS_14StackTraceSinkE.constprop.606","s+":"1E1"}}}
{"t":{"$date":"2021-04-15T13:16:55.376+01:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"conn6","msg":"  Frame: {frame}","attr":{"frame":{"a":"7FEE165E6739","b":"7FEE138DC000","o":"2D0A739","s":"_ZN5mongo15printStackTraceEv","s+":"29"}}}
{"t":{"$date":"2021-04-15T13:16:55.376+01:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"conn6","msg":"  Frame: {frame}","attr":{"frame":{"a":"7FEE165E3F96","b":"7FEE138DC000","o":"2D07F96","s":"_ZN5mongo12_GLOBAL__N_116abruptQuitActionEiP7siginfoPv","s+":"66"}}}
{"t":{"$date":"2021-04-15T13:16:55.376+01:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"conn6","msg":"  Frame: {frame}","attr":{"frame":{"a":"7FEE11F257E0","b":"7FEE11F16000","o":"F7E0","s":"_L_unlock_16","s+":"2D"}}}
{"t":{"$date":"2021-04-15T13:16:55.376+01:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"conn6","msg":"  Frame: {frame}","attr":{"frame":{"a":"7FEE11BB45E5","b":"7FEE11B82000","o":"325E5","s":"gsignal","s+":"35"}}}
{"t":{"$date":"2021-04-15T13:16:55.376+01:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"conn6","msg":"  Frame: {frame}","attr":{"frame":{"a":"7FEE11BB5DC5","b":"7FEE11B82000","o":"33DC5","s":"abort","s+":"175"}}}
{"t":{"$date":"2021-04-15T13:16:55.376+01:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"conn6","msg":"  Frame: {frame}","attr":{"frame":{"a":"7FEE147F6AB3","b":"7FEE138DC000","o":"F1AAB3","s":"_ZN5mongo15invariantFailedEPKcS1_j","s+":"12C"}}}
{"t":{"$date":"2021-04-15T13:16:55.376+01:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"conn6","msg":"  Frame: {frame}","attr":{"frame":{"a":"7FEE145489D0","b":"7FEE138DC000","o":"C6C9D0","s":"_ZN5mongo4repl10MemberData35advanceLastDurableOpTimeAndWallTimeENS0_17OpTimeAndWallTimeENS_6Date_tE.cold.338","s+":"18"}}}
{"t":{"$date":"2021-04-15T13:16:55.376+01:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"conn6","msg":"  Frame: {frame}","attr":{"frame":{"a":"7FEE14C6CC33","b":"7FEE138DC000","o":"1390C33","s":"_ZN5mongo4repl19TopologyCoordinator13setLastOptimeERKNS0_18UpdatePositionArgs10UpdateInfoENS_6Date_tEPx","s+":"2E3"}}}
{"t":{"$date":"2021-04-15T13:16:55.376+01:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"conn6","msg":"  Frame: {frame}","attr":{"frame":{"a":"7FEE14BBA0C5","b":"7FEE138DC000","o":"12DE0C5","s":"_ZN5mongo4repl26ReplicationCoordinatorImpl14_setLastOptimeENS_8WithLockERKNS0_18UpdatePositionArgs10UpdateInfoEPx","s+":"55"}}}
{"t":{"$date":"2021-04-15T13:16:55.376+01:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"conn6","msg":"  Frame: {frame}","attr":{"frame":{"a":"7FEE14BBA22F","b":"7FEE138DC000","o":"12DE22F","s":"_ZN5mongo4repl26ReplicationCoordinatorImpl28processReplSetUpdatePositionERKNS0_18UpdatePositionArgsEPx","s+":"AF"}}}
{"t":{"$date":"2021-04-15T13:16:55.376+01:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"conn6","msg":"  Frame: {frame}","attr":{"frame":{"a":"7FEE14B74777","b":"7FEE138DC000","o":"1298777","s":"_ZN5mongo4repl24CmdReplSetUpdatePosition3runEPNS_16OperationContextERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKNS_7BSONObjERNS_14BSONObjBuilderE","s+":"167"}}}
{"t":{"$date":"2021-04-15T13:16:55.376+01:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"conn6","msg":"  Frame: {frame}","attr":{"frame":{"a":"7FEE155500CA","b":"7FEE138DC000","o":"1C740CA","s":"_ZN5mongo37BasicCommandWithReplyBuilderInterface10Invocation3runEPNS_16OperationContextEPNS_3rpc21ReplyBuilderInterfaceE","s+":"11A"}}}
{"t":{"$date":"2021-04-15T13:16:55.376+01:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"conn6","msg":"  Frame: {frame}","attr":{"frame":{"a":"7FEE1554A60F","b":"7FEE138DC000","o":"1C6E60F","s":"_ZN5mongo14CommandHelpers20runCommandInvocationEPNS_16OperationContextERKNS_12OpMsgRequestEPNS_17CommandInvocationEPNS_3rpc21ReplyBuilderInterfaceE","s+":"7F"}}}
{"t":{"$date":"2021-04-15T13:16:55.376+01:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"conn6","msg":"  Frame: {frame}","attr":{"frame":{"a":"7FEE14CF1AC5","b":"7FEE138DC000","o":"1415AC5","s":"_ZN5mongo12_GLOBAL__N_114runCommandImplEPNS_16OperationContextEPNS_17CommandInvocationERKNS_12OpMsgRequestEPNS_3rpc21ReplyBuilderInterfaceENS_11LogicalTimeERKNS_23ServiceEntryPointCommon5HooksEPNS_14BSONObjBuilderERKNS_30OperationSessionInfoFromClientE","s+":"135"}}}
{"t":{"$date":"2021-04-15T13:16:55.376+01:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"conn6","msg":"  Frame: {frame}","attr":{"frame":{"a":"7FEE14CF4929","b":"7FEE138DC000","o":"1418929","s":"_ZN5mongo12_GLOBAL__N_119execCommandDatabaseEPNS_16OperationContextEPNS_7CommandERKNS_12OpMsgRequestEPNS_3rpc21ReplyBuilderInterfaceERKNS_23ServiceEntryPointCommon5HooksE","s+":"11B9"}}}
{"t":{"$date":"2021-04-15T13:16:55.376+01:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"conn6","msg":"  Frame: {frame}","attr":{"frame":{"a":"7FEE14CF5C10","b":"7FEE138DC000","o":"1419C10","s":"_ZN5mongo12_GLOBAL__N_116receivedCommandsEPNS_16OperationContextERKNS_7MessageERKNS_23ServiceEntryPointCommon5HooksE","s+":"650"}}}
{"t":{"$date":"2021-04-15T13:16:55.376+01:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"conn6","msg":"  Frame: {frame}","attr":{"frame":{"a":"7FEE14CF689D","b":"7FEE138DC000","o":"141A89D","s":"_ZN5mongo23ServiceEntryPointCommon13handleRequestEPNS_16OperationContextERKNS_7MessageERKNS0_5HooksE","s+":"50D"}}}
{"t":{"$date":"2021-04-15T13:16:55.376+01:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"conn6","msg":"  Frame: {frame}","attr":{"frame":{"a":"7FEE14CE434C","b":"7FEE138DC000","o":"140834C","s":"_ZN5mongo23ServiceEntryPointMongod13handleRequestEPNS_16OperationContextERKNS_7MessageE","s+":"3C"}}}
{"t":{"$date":"2021-04-15T13:16:55.376+01:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"conn6","msg":"  Frame: {frame}","attr":{"frame":{"a":"7FEE14CEEDD5","b":"7FEE138DC000","o":"1412DD5","s":"_ZN5mongo19ServiceStateMachine15_processMessageENS0_11ThreadGuardE","s+":"F5"}}}
{"t":{"$date":"2021-04-15T13:16:55.376+01:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"conn6","msg":"  Frame: {frame}","attr":{"frame":{"a":"7FEE14CECAE5","b":"7FEE138DC000","o":"1410AE5","s":"_ZN5mongo19ServiceStateMachine15_runNextInGuardENS0_11ThreadGuardE","s+":"125"}}}
{"t":{"$date":"2021-04-15T13:16:55.376+01:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"conn6","msg":"  Frame: {frame}","attr":{"frame":{"a":"7FEE14CEDA76","b":"7FEE138DC000","o":"1411A76","s":"_ZNSt17_Function_handlerIFvvEZN5mongo19ServiceStateMachine22_scheduleNextWithGuardENS2_11ThreadGuardENS1_9transport15ServiceExecutor13ScheduleFlagsENS4_23ServiceExecutorTaskNameENS2_9OwnershipEEUlvE_E9_M_invokeERKSt9_Any_data","s+":"56"}}}
{"t":{"$date":"2021-04-15T13:16:55.376+01:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"conn6","msg":"  Frame: {frame}","attr":{"frame":{"a":"7FEE15FABBE2","b":"7FEE138DC000","o":"26CFBE2","s":"_ZN5mongo9transport26ServiceExecutorSynchronous8scheduleESt8functionIFvvEENS0_15ServiceExecutor13ScheduleFlagsENS0_23ServiceExecutorTaskNameE","s+":"182"}}}
{"t":{"$date":"2021-04-15T13:16:55.376+01:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"conn6","msg":"  Frame: {frame}","attr":{"frame":{"a":"7FEE14CEBEFB","b":"7FEE138DC000","o":"140FEFB","s":"_ZN5mongo19ServiceStateMachine22_scheduleNextWithGuardENS0_11ThreadGuardENS_9transport15ServiceExecutor13ScheduleFlagsENS2_23ServiceExecutorTaskNameENS0_9OwnershipE","s+":"DB"}}}
{"t":{"$date":"2021-04-15T13:16:55.376+01:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"conn6","msg":"  Frame: {frame}","attr":{"frame":{"a":"7FEE14CED25D","b":"7FEE138DC000","o":"141125D","s":"_ZN5mongo19ServiceStateMachine15_sourceCallbackENS_6StatusE","s+":"6AD"}}}
{"t":{"$date":"2021-04-15T13:16:55.376+01:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"conn6","msg":"  Frame: {frame}","attr":{"frame":{"a":"7FEE14CED340","b":"7FEE138DC000","o":"1411340","s":"_ZN5mongo14future_details4callIRZNS_19ServiceStateMachine14_sourceMessageENS2_11ThreadGuardEEUlNS_10StatusWithINS_7MessageEEEE0_S6_EEDaOT_OT0_","s+":"60"}}}
{"t":{"$date":"2021-04-15T13:16:55.376+01:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"conn6","msg":"  Frame: {frame}","attr":{"frame":{"a":"7FEE14CED5E5","b":"7FEE138DC000","o":"14115E5","s":"_ZN5mongo19ServiceStateMachine14_sourceMessageENS0_11ThreadGuardE","s+":"145"}}}
{"t":{"$date":"2021-04-15T13:16:55.376+01:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"conn6","msg":"  Frame: {frame}","attr":{"frame":{"a":"7FEE14CECB5A","b":"7FEE138DC000","o":"1410B5A","s":"_ZN5mongo19ServiceStateMachine15_runNextInGuardENS0_11ThreadGuardE","s+":"19A"}}}
{"t":{"$date":"2021-04-15T13:16:55.376+01:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"conn6","msg":"  Frame: {frame}","attr":{"frame":{"a":"7FEE14CEDA76","b":"7FEE138DC000","o":"1411A76","s":"_ZNSt17_Function_handlerIFvvEZN5mongo19ServiceStateMachine22_scheduleNextWithGuardENS2_11ThreadGuardENS1_9transport15ServiceExecutor13ScheduleFlagsENS4_23ServiceExecutorTaskNameENS2_9OwnershipEEUlvE_E9_M_invokeERKSt9_Any_data","s+":"56"}}}
{"t":{"$date":"2021-04-15T13:16:55.376+01:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"conn6","msg":"  Frame: {frame}","attr":{"frame":{"a":"7FEE15FAC248","b":"7FEE138DC000","o":"26D0248","s":"_ZNSt17_Function_handlerIFvvEZN5mongo9transport26ServiceExecutorSynchronous8scheduleESt8functionIS0_ENS2_15ServiceExecutor13ScheduleFlagsENS2_23ServiceExecutorTaskNameEEUlvE0_E9_M_invokeERKSt9_Any_data","s+":"B8"}}}
{"t":{"$date":"2021-04-15T13:16:55.376+01:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"conn6","msg":"  Frame: {frame}","attr":{"frame":{"a":"7FEE162FC706","b":"7FEE138DC000","o":"2A20706","s":"_ZNSt17_Function_handlerIFvvEZN5mongo25launchServiceWorkerThreadESt8functionIS0_EEUlvE1_E9_M_invokeERKSt9_Any_data","s+":"56"}}}
{"t":{"$date":"2021-04-15T13:16:55.376+01:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"conn6","msg":"  Frame: {frame}","attr":{"frame":{"a":"7FEE162FC774","b":"7FEE138DC000","o":"2A20774","s":"_ZN5mongo12_GLOBAL__N_17runFuncEPv","s+":"14"}}}
{"t":{"$date":"2021-04-15T13:16:55.376+01:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"conn6","msg":"  Frame: {frame}","attr":{"frame":{"a":"7FEE11F1DAA1","b":"7FEE11F16000","o":"7AA1","s":"start_thread","s+":"D1"}}}
{"t":{"$date":"2021-04-15T13:16:55.376+01:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"conn6","msg":"  Frame: {frame}","attr":{"frame":{"a":"7FEE11C6AAAD","b":"7FEE11B82000","o":"E8AAD","s":"clone","s+":"6D"}}}

IP INFO, not the original ip's:
127.0.0.1: PRIMARY 4.2.13
127.0.0.2: SECONDARY 4.4.4
127.0.0.3: ABRITER 4.2.13

Replicaset conf:

 {
        "_id" : "myid",
        "version" : 1396130,
        "protocolVersion" : NumberLong(1),
        "writeConcernMajorityJournalDefault" : true,
        "members" : [
                {
                        "_id" : 0,
                        "host" : "127.0.0.1:27017",
                        "arbiterOnly" : false,
                        "buildIndexes" : true,
                        "hidden" : false,
                        "priority" : 2,
                        "tags" : {                        },
                        "slaveDelay" : NumberLong(0),
                        "votes" : 1
                },
                {
                        "_id" : 1,
                        "host" : "127.0.0.2:27017",
                        "arbiterOnly" : false,
                        "buildIndexes" : true,
                        "hidden" : false,
                        "priority" : 2,
                        "tags" : {                        },
                        "slaveDelay" : NumberLong(0),
                        "votes" : 1
                },
                {
                        "_id" : 3,
                        "host" : "127.0.0.3:27018",
                        "arbiterOnly" : true,
                        "buildIndexes" : true,
                        "hidden" : false,
                        "priority" : 0,
                        "tags" : {                        },
                        "slaveDelay" : NumberLong(0),
                        "votes" : 1
                }
        ],
        "settings" : {
                "chainingAllowed" : true,
                "heartbeatIntervalMillis" : 2000,
                "heartbeatTimeoutSecs" : 10,
                "electionTimeoutMillis" : 10000,
                "catchUpTimeoutMillis" : -1,
                "catchUpTakeoverDelayMillis" : 30000,
                "getLastErrorModes" : {                },
                "getLastErrorDefaults" : {
                        "w" : 1,
                        "wtimeout" : 0
                }
        }
}

I already googled for a solution but now i am stuck. This is a quality enviroment and i want to upgrade PRD, but first i need to understand the problem.



 Comments   
Comment by Tin Cvitković [ 15/Feb/23 ]

Any solution yet for anyone ?

I am experiencing same issues running 4.4.18 version of MongoDB, same error happens during step down and one of the nodes gets killed and I can't find a reason why

Has anyone tried upgrading to newer versions ?

Comment by Augusto Fonseca [ 21/Apr/21 ]

Upgrading arbiter problem persists. I hope its nothing to do with timezones .

Comment by Augusto Fonseca [ 20/Apr/21 ]

Question, should i try update arbiter and then rs.stepDown() on the old node?

Comment by Augusto Fonseca [ 20/Apr/21 ]

Hi, failed again, need a resolution or a workarround . Side note, i didnt upgrade arbiter yet.

Comment by Augusto Fonseca [ 19/Apr/21 ]

Hi tried to upgrade my PRD cluster and i got the same problem after rs.stepDown(). Sent the logs of all nodes as prd_logs.7z.

To clarify, 

primary: rs1logmongo.log

secondary: rs2logmongo.log

alternative_secondary_cannot_be_primary: rs3logmongo.log

arbiter: arblogmongo.log

 

Did rs.stepDown() on primary @2021-04-19T18:32:46

secondary crashs 2021-04-19T18:32:47

 

Will try tomorow to upgrade.

Comment by Augusto Fonseca [ 16/Apr/21 ]

Uploaded the logs, the node that crashed is the one with log name "rs2logmongo.log".

Started node @11:58:37

Step Down @12:27:18

Error @12:27:19.643

Then i tried to use the version 4.4.4 and observed the same error.

Comment by Augusto Fonseca [ 16/Apr/21 ]

In my case, was allways a arbiter. I had to "resync" data to wiretiger because he was using MMAPv1, but this was on 3.x version. So its wierd.

I will check for the time that i did the operation and upload the logs.

Comment by Augusto Fonseca [ 16/Apr/21 ]

Let me check.

Comment by Bruce Lucas (Inactive) [ 16/Apr/21 ]

s0ulhunt3rpt@gmail.com, there's another user who seems to be hitting the same issue (SERVER-54876). The cause is unknown but one theory relates to the possibility of oplog entries in the arbiter written by earlier versions of mongod when the arbiter was a data-bearing secondary, before being made an arbiter. Is that a possible scenario in your case?

Also, even though you didn't reproduce the problem again, is it possible to upload the log files covering a previous occurrence of the problem?

Comment by Augusto Fonseca [ 16/Apr/21 ]

BTW what are you suspecting the problem was?

Comment by Augusto Fonseca [ 16/Apr/21 ]

Well, will leave this during the weekend, but this time i did the upgrade and rs.stepDown() and did not crash, and with this, i was able to upgrade all nodes without any problem. Will not execute "db.adminCommand( { setFeatureCompatibilityVersion: "4.4" } )" for now, if i want to go back and if any problems happens.

Will say something next week.

 I will not upload the logs because i was expectic crash.

Comment by Augusto Fonseca [ 16/Apr/21 ]

Hi Bruce, this replicaset has authentication, so i cant execute admin commands on arbiter.

I executed the command's on primary.

xxx:PRIMARY> use local
switched to db local
xxx:PRIMARY> show collections
me
oplog.rs
replset.election
replset.minvalid
replset.oplogTruncateAfterPoint
startup_log
system.replset
system.rollback.id
xxx:PRIMARY> db.oplog.rs.find().sort({$natural: -1}).limit(1)
{ "ts" : Timestamp(1618584665, 1), "t" : NumberLong(27), "h" : NumberLong(0), "v" : 2, "op" : "n", "ns" : "", "wall" : ISODate("2021-04-16T14:51:05.737Z"), "o" : { "msg" : "periodic noop" } }

Comment by Bruce Lucas (Inactive) [ 16/Apr/21 ]

s0ulhunt3rpt@gmail.com, thank you for the problem report.

Can you please upload the complete log files for all three nodes to this secure private portal?

Also, can you please connect to the arbiter with the mongo shell show us whether there is an oplog, and if so what the last entry is:

use local
show collections
db.oplog.rs.find().sort({$natural: -1}).limit(1)

Generated at Thu Feb 08 05:38:25 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.