[SERVER-54876] Upgrade mongodb 4.4.4 from 4.2.12 failed Created: 02/Mar/21  Updated: 06/Dec/22  Resolved: 12/May/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.4.3, 4.4.4
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Nicola Battista Assignee: Backlog - Replication Team
Resolution: Done Votes: 0
Labels: Upgrade/Downgrade, post-rc0
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Vmware virtual machines Centos 7.9.
Replica set 7 nodes ( six db and one arbiter).


Attachments: File mongod.log.rpmsave.gz     Zip Archive mongod.log.rpmsave.zip    
Issue Links:
Backports
Duplicate
is duplicated by SERVER-56129 Problem on rs.stepDown() when upgradi... Closed
Issue split
split to SERVER-56619 Arbiters should explicitly clear thei... Backlog
split to SERVER-56620 Nodes should explicitly clear arbiter... Closed
Assigned Teams:
Replication
Operating System: ALL
Backport Requested:
v4.4
Sprint: Repl 2021-04-19, Repl 2021-05-03, Repl 2021-05-17
Participants:

 Description   

Hi,

we've a replica set with 6 nodes mongodb e one arbiter.

I  tried upgrade from 4.2.12 to 4.4.4 (rolling upgrade) and i did the following steps: 

  1. Upgrade from 4.2.12 to 4.4.2 one secondary node and the node it works perfectly in secondary mode.
  2. switch primary from node 4.2.12 to 4.4.4 and the node 4.4.4 crashed.

This is the output log trace before the node crash :

{"t":\{"$date":"2021-03-02T08:08:19.684+00:00"},"s":"I", "c":"CONTROL", "id":31431, "ctx":"conn1407","msg":"BACKTRACE: \{bt}","attr":\{"bt":{"backtrace":[{"a":"559B86781BE1","b":"559B83A79000","o":"2D08BE1","s":"_ZN5mongo18stack_trace_detail12_GLOBAL__N_119printStackTraceImplERKNS1_7OptionsEPNS_14StackTraceSinkE.constprop.606","s+":"1E1"},\{"a":"559B86783219","b":"559B83A79000","o":"2D0A219","s":"_ZN5mongo15printStackTraceEv","s+":"29"},\{"a":"559B86780A76","b":"559B83A79000","o":"2D07A76","s":"_ZN5mongo12_GLOBAL__N_116abruptQuitActionEiP9siginfo_tPv","s+":"66"},\{"a":"7F336A2F1630","b":"7F336A2E2000","o":"F630","s":"_L_unlock_13","s+":"34"},\{"a":"7F3369F4A3D7","b":"7F3369F14000","o":"363D7","s":"gsignal","s+":"37"},\{"a":"7F3369F4BAC8","b":"7F3369F14000","o":"37AC8","s":"abort","s+":"148"},\{"a":"559B84993677","b":"559B83A79000","o":"F1A677","s":"_ZN5mongo15invariantFailedEPKcS1_j","s+":"12C"},\{"a":"559B846E5560","b":"559B83A79000","o":"C6C560","s":"_ZN5mongo4repl10MemberData35advanceLastDurableOpTimeAndWallTimeENS0_17OpTimeAndWallTimeENS_6Date_tE.cold.338","s+":"18"},\{"a":"559B84E097D3","b":"559B83A79000","o":"13907D3","s":"_ZN5mongo4repl19TopologyCoordinator13setLastOptimeERKNS0_18UpdatePositionArgs10UpdateInfoENS_6Date_tEPx","s+":"2E3"},\{"a":"559B84D56C65","b":"559B83A79000","o":"12DDC65","s":"_ZN5mongo4repl26ReplicationCoordinatorImpl14_setLastOptimeENS_8WithLockERKNS0_18UpdatePositionArgs10UpdateInfoEPx","s+":"55"},\{"a":"559B84D56DCF","b":"559B83A79000","o":"12DDDCF","s":"_ZN5mongo4repl26ReplicationCoordinatorImpl28processReplSetUpdatePositionERKNS0_18UpdatePositionArgsEPx","s+":"AF"},\{"a":"559B84D11317","b":"559B83A79000","o":"1298317","s":"_ZN5mongo4repl24CmdReplSetUpdatePosition3runEPNS_16OperationContextERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKNS_7BSONObjERNS_14BSONObjBuilderE","s+":"167"},\{"a":"559B856EC64A","b":"559B83A79000","o":"1C7364A","s":"_ZN5mongo37BasicCommandWithReplyBuilderInterface10Invocation3runEPNS_16OperationContextEPNS_3rpc21ReplyBuilderInterfaceE","s+":"11A"},\{"a":"559B856E6B8F","b":"559B83A79000","o":"1C6DB8F","s":"_ZN5mongo14CommandHelpers20runCommandInvocationEPNS_16OperationContextERKNS_12OpMsgRequestEPNS_17CommandInvocationEPNS_3rpc21ReplyBuilderInterfaceE","s+":"7F"},\{"a":"559B84E8E665","b":"559B83A79000","o":"1415665","s":"_ZN5mongo12_GLOBAL__N_114runCommandImplEPNS_16OperationContextEPNS_17CommandInvocationERKNS_12OpMsgRequestEPNS_3rpc21ReplyBuilderInterfaceENS_11LogicalTimeERKNS_23ServiceEntryPointCommon5HooksEPNS_14BSONObjBuilderERKNS_30OperationSessionInfoFromClientE","s+":"135"},\{"a":"559B84E914C9","b":"559B83A79000","o":"14184C9","s":"_ZN5mongo12_GLOBAL__N_119execCommandDatabaseEPNS_16OperationContextEPNS_7CommandERKNS_12OpMsgRequestEPNS_3rpc21ReplyBuilderInterfaceERKNS_23ServiceEntryPointCommon5HooksE","s+":"11B9"},\{"a":"559B84E927B0","b":"559B83A79000","o":"14197B0","s":"_ZN5mongo12_GLOBAL__N_116receivedCommandsEPNS_16OperationContextERKNS_7MessageERKNS_23ServiceEntryPointCommon5HooksE","s+":"650"},\{"a":"559B84E9343D","b":"559B83A79000","o":"141A43D","s":"_ZN5mongo23ServiceEntryPointCommon13handleRequestEPNS_16OperationContextERKNS_7MessageERKNS0_5HooksE","s+":"50D"},\{"a":"559B84E80EEC","b":"559B83A79000","o":"1407EEC","s":"_ZN5mongo23ServiceEntryPointMongod13handleRequestEPNS_16OperationContextERKNS_7MessageE","s+":"3C"},\{"a":"559B84E8B975","b":"559B83A79000","o":"1412975","s":"_ZN5mongo19ServiceStateMachine15_processMessageENS0_11ThreadGuardE","s+":"F5"},\{"a":"559B84E89685","b":"559B83A79000","o":"1410685","s":"_ZN5mongo19ServiceStateMachine15_runNextInGuardENS0_11ThreadGuardE","s+":"125"},\{"a":"559B84E8A616","b":"559B83A79000","o":"1411616","s":"_ZNSt17_Function_handlerIFvvEZN5mongo19ServiceStateMachine22_scheduleNextWithGuardENS2_11ThreadGuardENS1_9transport15ServiceExecutor13ScheduleFlagsENS4_23ServiceExecutorTaskNameENS2_9OwnershipEEUlvE_E9_M_invokeERKSt9_Any_data","s+":"56"},\{"a":"559B86148162","b":"559B83A79000","o":"26CF162","s":"_ZN5mongo9transport26ServiceExecutorSynchronous8scheduleESt8functionIFvvEENS0_15ServiceExecutor13ScheduleFlagsENS0_23ServiceExecutorTaskNameE","s+":"182"},\{"a":"559B84E88A9B","b":"559B83A79000","o":"140FA9B","s":"_ZN5mongo19ServiceStateMachine22_scheduleNextWithGuardENS0_11ThreadGuardENS_9transport15ServiceExecutor13ScheduleFlagsENS2_23ServiceExecutorTaskNameENS0_9OwnershipE","s+":"DB"},\{"a":"559B84E89DFD","b":"559B83A79000","o":"1410DFD","s":"_ZN5mongo19ServiceStateMachine15_sourceCallbackENS_6StatusE","s+":"6AD"},\{"a":"559B84E89EE0","b":"559B83A79000","o":"1410EE0","s":"_ZN5mongo14future_details4callIRZNS_19ServiceStateMachine14_sourceMessageENS2_11ThreadGuardEEUlNS_10StatusWithINS_7MessageEEEE0_S6_EEDaOT_OT0_","s+":"60"},\{"a":"559B84E8A185","b":"559B83A79000","o":"1411185","s":"_ZN5mongo19ServiceStateMachine14_sourceMessageENS0_11ThreadGuardE","s+":"145"},\{"a":"559B84E896FA","b":"559B83A79000","o":"14106FA","s":"_ZN5mongo19ServiceStateMachine15_runNextInGuardENS0_11ThreadGuardE","s+":"19A"},\{"a":"559B84E8A616","b":"559B83A79000","o":"1411616","s":"_ZNSt17_Function_handlerIFvvEZN5mongo19ServiceStateMachine22_scheduleNextWithGuardENS2_11ThreadGuardENS1_9transport15ServiceExecutor13ScheduleFlagsENS4_23ServiceExecutorTaskNameENS2_9OwnershipEEUlvE_E9_M_invokeERKSt9_Any_data","s+":"56"},\{"a":"559B861487C8","b":"559B83A79000","o":"26CF7C8","s":"_ZNSt17_Function_handlerIFvvEZN5mongo9transport26ServiceExecutorSynchronous8scheduleESt8functionIS0_ENS2_15ServiceExecutor13ScheduleFlagsENS2_23ServiceExecutorTaskNameEEUlvE0_E9_M_invokeERKSt9_Any_data","s+":"B8"},\{"a":"559B864991E6","b":"559B83A79000","o":"2A201E6","s":"_ZNSt17_Function_handlerIFvvEZN5mongo25launchServiceWorkerThreadESt8functionIS0_EEUlvE1_E9_M_invokeERKSt9_Any_data","s+":"56"},\{"a":"559B86499254","b":"559B83A79000","o":"2A20254","s":"_ZN5mongo12_GLOBAL__N_17runFuncEPv","s+":"14"},\{"a":"7F336A2E9EA5","b":"7F336A2E2000","o":"7EA5","s":"start_thread","s+":"C5"},\{"a":"7F336A0129FD","b":"7F3369F14000","o":"FE9FD","s":"clone","s+":"6D"}],"processInfo":\{"mongodbVersion":"4.4.4","gitVersion":"8db30a63db1a9d84bdcad0c83369623f708e0397","compiledModules":[],"uname":{"sysname":"Linux","release":"3.10.0-1160.15.2.el7.x86_64","version":"#1 SMP Wed Feb 3 15:06:38 UTC 2021","machine":"x86_64"},"somap":[\{"b":"559B83A79000","elfType":3,"buildId":"CC93F86C6D00CE717DB23676747D258B4027E066"},\{"b":"7F336A2E2000","path":"/lib64/libpthread.so.0","elfType":3,"buildId":"E10CC8F2B932FC3DAEDA22F8DAC5EBB969524E5B"},\{"b":"7F3369F14000","path":"/lib64/libc.so.6","elfType":3,"buildId":"7CD4A08C18C60E1E2EA1AEBB88C9379BD7289D38"}]}}}} 

I tried also with the version 4.4.3 but have the same errors.

 



 Comments   
Comment by Matthew Russotto [ 12/May/21 ]

Investigation is complete; fix will be tracked as SERVER-56619 and SERVER-56620.

Comment by Matthew Russotto [ 04/May/21 ]

It turns out we can get an optime/wall time which is a "ghost in the machine". If an arbiter gets a durable optime/wall time at any point (which can happen if it goes into REMOVED while it has an oplog entry; there may be other bugs which can cause this), that optime will be passed to the other members of the set. Even if the arbiter is subsequently restarted or even resynced, that durable optime will be greater than the (correct) null durable optime the arbiter is sending in heartbeats, so it will remain in the state of the other nodes in the system, which will pass it around using replSetUpdatePosition.

Comment by Nicola Battista [ 07/Apr/21 ]

Hi Matthew,

On the Arbiter :

xxx-mongodb:ARBITER> use local
switched to db local
xxx-mongodb:ARBITER> show collections;
oplog.rs
replset.election
replset.minvalid
replset.oplogTruncateAfterPoint
startup_log
system.replset
system.rollback.id
xxx-mongodb:ARBITER> db.oplog.rs.find()
xxx-mongodb:ARBITER>
xxx-mongodb:ARBITER>

 

Regards

Nicola

Comment by Matthew Russotto [ 07/Apr/21 ]

dmitry.agranat
I have been able to create a reproduction of this bug but it requires a somewhat unusual sequence of events and I would like to confirm if it was possible before going forward.

What would be required is that the arbiter was at one point not an arbiter, and wrote some oplog entries with a walltime of 0 (which could have happened in some previous versions). Then it became an arbiter; we would continue to send that last oplog time as the last durable optime of the arbiter, which would crash a 4.4 node.

So to check this, I would like to see if we can get the last entry in the oplog collection (local.oplog.rs) on the arbiter. There really shouldn't be such a collection; if there isn't, that isn't the problem.

Comment by Dmitry Agranat [ 17/Mar/21 ]

Thanks nicola.battista89@gmail.com. We're assigning this ticket to the appropriate team to be further investigated. Updates will be posted on this ticket as they happen.

Comment by Nicola Battista [ 17/Mar/21 ]

Hi,

All nodes have this output :

db.adminCommand( { getParameter: 1, featureCompatibilityVersion: 1 } )
{
"featureCompatibilityVersion" :

{ "version" : "4.2" }

,
"ok" : 1,
"$clusterTime" : {
"clusterTime" : Timestamp(1615988483, 684),
"signature" :

{ "hash" : BinData(0,"EFp6D8mRrtb88nUTJ2zWtnX/SO0="), "keyId" : NumberLong("6919393090692186131") }

},
"operationTime" : Timestamp(1615988483, 683)

 

Thanks

Regards.

Nicola

 

Comment by Dmitry Agranat [ 17/Mar/21 ]

Thanks nicola.battista89@gmail.com for uploading mongodb logs. I have another question, what is the current featureCompatibilityVersion on all 7 members of this replica set? To view the featureCompatibilityVersion for a mongod instance, run the following command on each mongod instance:

db.adminCommand( { getParameter: 1, featureCompatibilityVersion: 1 } )

Comment by Nicola Battista [ 17/Mar/21 ]

Hi,

i've sent the mongodb logs called mongod_tp2_mongo00.log.gz and mongod_tp2_mongo01.log.gz.

Thanks

Regards

Nicola

 

Comment by Dmitry Agranat [ 17/Mar/21 ]

Hi nicola.battista89@gmail.com, the uploaded data only contains diagnostic.data archive but no mongod logs. Can you upload mongod logs covering the time of the reported event?

Comment by Nicola Battista [ 15/Mar/21 ]

Hi Dmitry Agranat,

I've sent you  the files in the secure portal.

the name files called : log_tp2-mongo00.tar.gz and log_tp2-mongo01.tar.gz

Thank you.

Regards

Nicola

Comment by Nicola Battista [ 08/Mar/21 ]

Hi, Being a production database, tomorrow we will replicate the bug again and i'll send you the required logs.

Thank you.

Regards.

Nicola

Comment by Dmitry Agranat [ 08/Mar/21 ]

Hi nicola.battista89@gmail.com,

We still need additional information to diagnose the problem. If this is still an issue for you, would you please compress and upload it into this secure portal:

  • full mongoD logs and the $dbpath/diagnostic.data directory from the original primary covering the time before and during the upgrade?
  • full mongoD logs and the $dbpath/diagnostic.data directory from the current primary (which was upgraded from 4.2.12 to 4.4.4 when it was a Secondary) covering the time before and during the upgrade?

Thanks,
Dima

Comment by Dmitry Agranat [ 02/Mar/21 ]

Hi nicola.battista89@gmail.com, thank you for the report.

For completeness, could you please compress and upload into this secure portal:

  • full mongoD logs and the $dbpath/diagnostic.data directory from the original primary covering the time before and during the upgrade?
  • full mongoD logs and the $dbpath/diagnostic.data directory from the current primary (which was upgraded from 4.2.12 to 4.4.4 when it was a Secondary) covering the time before and during the upgrade?

Thanks,
Dima

Generated at Thu Feb 08 05:34:45 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.