[SERVER-35795] 3.4 secondaries crashing after some time with 3.6 primary Created: 25/Jun/18  Updated: 08/Jan/24  Resolved: 19/Jul/18

Status: Closed
Project: Core Server
Component/s: Admin
Affects Version/s: 3.6.5
Fix Version/s: 3.6.7

Type: Bug Priority: Critical - P2
Reporter: Sree Himakunthala Assignee: Misha Tyulenev
Resolution: Fixed Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Problem/Incident
Related
related to SERVER-36104 LogicalSessions should destroy cache ... Closed
related to SERVER-36223 Add hook for the fuzzer to not send l... Closed
is related to SERVER-35957 While upgrading from Shard 3.4 to Sha... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Sharding 2018-07-16, Sharding 2018-07-30
Participants:
Case:
Linked BF Score: 0

 Description   

MongoDB replica secondaries crashing and not-starting
I have a 3 node replica set with mongodb version of 3.4.13 running on ubuntu 16. 
I updated a secondary node from 3.4.13 to 3.6.5 and switched it to become primary. This node became primary and running fine with no issues.
Soon after that, the rest two secondary nodes crashed with a stack trace given below.
I tried to restart them but i am still getting the same errors and the mongo is aborting.
 
Any pointers?
 

2018-06-21T16:58:00.493-0700 I FTDC     [initandlisten] Initializing full-time diagnostic data capture with directory '/xxxx/diagnostic.data'
2018-06-21T16:58:00.499-0700 I REPL     [initandlisten] Replaying stored operations from { ts: Timestamp 1529618536000|1, t: 18 } (exclusive) to { ts: Timestamp 1529618546000|1, t: 18 } (inclusive).
2018-06-21T16:58:00.532-0700 E REPL     [initandlisten] Failed command { create: "system.sessions", idIndex: { v: 2, key: { _id: 1 }, name: "_id_", ns: "config.system.sessions" } } on config with status BadValue: cannot write to 'config.system.sessions' during oplog application
2018-06-21T16:58:00.532-0700 I -        [initandlisten] Fatal assertion 40294 BadValue: cannot write to 'config.system.sessions' at src/mongo/db/repl/replication_coordinator_external_state_impl.cpp 669
2018-06-21T16:58:00.532-0700 I -        [initandlisten]
 
***aborting after fassert() failure
 
 
2018-06-21T16:58:00.570-0700 F -        [initandlisten] Got signal: 6 (Aborted).
 
 0x7efde8e5a6a1 0x7efde8e598b9 0x7efde8e59d9d 0x7efde64dc390 0x7efde6136428 0x7efde613802a 0x7efde8104eb7 0x7efde88c129d 0x7efde88d0423 0x7efde88d1f4c 0x7efde80f28d1 0x7efde8110a46 0x7efde6121830 0x7efde81700f9
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"7EFDE78E8000","o":"15726A1","s":"_ZN5mongo15printStackTraceERSo"},{"b":"7EFDE78E8000","o":"15718B9"},{"b":"7EFDE78E8000","o":"1571D9D"},{"b":"7EFDE64CB000","o":"11390"},{"b":"7EFDE6101000","o":"35428","s":"gsignal"},{"b":"7EFDE6101000","o":"3702A","s":"abort"},{"b":"7EFDE78E8000","o":"81CEB7","s":"_ZN5mongo42fassertFailedWithStatusNoTraceWithLocationEiRKNS_6StatusEPKcj"},{"b":"7EFDE78E8000","o":"FD929D","s":"_ZN5mongo4repl39ReplicationCoordinatorExternalStateImpl21cleanUpLastApplyBatchEPNS_16OperationContextE"},{"b":"7EFDE78E8000","o":"FE8423","s":"_ZN5mongo4repl26ReplicationCoordinatorImpl21_startLoadLocalConfigEPNS_16OperationContextE"},{"b":"7EFDE78E8000","o":"FE9F4C","s":"_ZN5mongo4repl26ReplicationCoordinatorImpl7startupEPNS_16OperationContextE"},{"b":"7EFDE78E8000","o":"80A8D1"},{"b":"7EFDE78E8000","o":"828A46","s":"main"},{"b":"7EFDE6101000","o":"20830","s":"__libc_start_main"},{"b":"7EFDE78E8000","o":"8880F9","s":"_start"}],"processInfo":{ "mongodbVersion" : "3.4.13", "gitVersion" : "fbdef2ccc53e0fcc9afb570063633d992b2aae42", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "3.13.0-112-generic", "version" : "#159-Ubuntu SMP Fri Mar 3 15:26:07 UTC 2017", "machine" : "x86_64" }, "somap" : [ { "b" : "7EFDE78E8000", "elfType" : 3, "buildId" : "9A34D266FEE22AA39FD157A54318E11A8F6D072B" }, { "b" : "7FFF268AB000", "elfType" : 3, "buildId" : "012E1338BA43AF7C0DC7D069F64F0A6490CC6D9C" }, { "b" : "7EFDE7457000", "path" : "/lib/x86_64-linux-gnu/libssl.so.1.0.0", "elfType" : 3, "buildId" : "DCF10134B91ED2139E3E8C72564668F5CDBA8522" }, { "b" : "7EFDE7013000", "path" : "/lib/x86_64-linux-gnu/libcrypto.so.1.0.0", "elfType" : 3, "buildId" : "1649272BE0CA9FA22F082DC86372B6C9959779B0" }, { "b" : "7EFDE6E0B000", "path" : "/lib/x86_64-linux-gnu/librt.so.1", "elfType" : 3, "buildId" : "89C34D7A182387D76D5CDA1F7718F5D58824DFB3" }, { "b" : "7EFDE6C07000", "path" : "/lib/x86_64-linux-gnu/libdl.so.2", "elfType" : 3, "buildId" : "8CC8D0D119B142D839800BFF71FB71E73AEA7BD4" }, { "b" : "7EFDE68FE000", "path" : "/lib/x86_64-linux-gnu/libm.so.6", "elfType" : 3, "buildId" : "DFB85DE42DAFFD09640C8FE377D572DE3E168920" }, { "b" : "7EFDE66E8000", "path" : "/lib/x86_64-linux-gnu/libgcc_s.so.1", "elfType" : 3, "buildId" : "68220AE2C65D65C1B6AAA12FA6765A6EC2F5F434" }, { "b" : "7EFDE64CB000", "path" : "/lib/x86_64-linux-gnu/libpthread.so.0", "elfType" : 3, "buildId" : "CE17E023542265FC11D9BC8F534BB4F070493D30" }, { "b" : "7EFDE6101000", "path" : "/lib/x86_64-linux-gnu/libc.so.6", "elfType" : 3, "buildId" : "B5381A457906D279073822A5CEB24C4BFEF94DDB" }, { "b" : "7EFDE76C0000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "5D7B6259552275A3C17BD4C3FD05F5A6BF40CAA5" } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x41) [0x7efde8e5a6a1]
 mongod(+0x15718B9) [0x7efde8e598b9]
 mongod(+0x1571D9D) [0x7efde8e59d9d]
 libpthread.so.0(+0x11390) [0x7efde64dc390]
 libc.so.6(gsignal+0x38) [0x7efde6136428]
 libc.so.6(abort+0x16A) [0x7efde613802a]
 mongod(_ZN5mongo42fassertFailedWithStatusNoTraceWithLocationEiRKNS_6StatusEPKcj+0x0) [0x7efde8104eb7]
 mongod(_ZN5mongo4repl39ReplicationCoordinatorExternalStateImpl21cleanUpLastApplyBatchEPNS_16OperationContextE+0xA0D) [0x7efde88c129d]
 mongod(_ZN5mongo4repl26ReplicationCoordinatorImpl21_startLoadLocalConfigEPNS_16OperationContextE+0x363) [0x7efde88d0423]
 mongod(_ZN5mongo4repl26ReplicationCoordinatorImpl7startupEPNS_16OperationContextE+0x1DC) [0x7efde88d1f4c]
 mongod(+0x80A8D1) [0x7efde80f28d1]
 mongod(main+0x966) [0x7efde8110a46]
 libc.so.6(__libc_start_main+0xF0) [0x7efde6121830]
 mongod(_start+0x29) [0x7efde81700f9]
-----  END BACKTRACE  -----

Fix Implementation

1. Return error status from SessionsCollection derived ::setupSessionsCollection method if the featureCompatibility is not fully upgraded to 3.6 this status will be checked in the _refresh method https://github.com/mongodb/mongo/blob/r3.6.6/src/mongo/db/logical_session_cache_impl.cpp#L275-L279 . This is the main defense against setting up sessions table too early

2. return from _refresh if FCV is not fully upgraded to 3.6 in https://github.com/mongodb/mongo/blob/r3.6.6/src/mongo/db/logical_session_cache_impl.cpp#L262 i.e. after computing statistics but before calling setupSessionCollections. This way the logs will not be polluted with the messages. The full downgrade support is tracked in SERVER-36104

3. This is not required for this fix but nice to have while on it: Change return type https://github.com/mongodb/mongo/blob/r3.6.6/src/mongo/db/initialize_operation_session_info.cpp#L41 from void to Status and consequently do not uassert inside initializeOperationSessionInfo
and uassert in the callers.



 Comments   
Comment by Githook User [ 19/Jul/18 ]

Author:

{'name': 'Misha Tyulenev', 'email': 'misha@mongodb.com', 'username': 'mikety'}

Message: SERVER-35795 do not allow sessions with FCV 3.4
Branch: v3.6
https://github.com/mongodb/mongo/commit/646d68003cadcd60fed5abaf1e92368390a4a1cb

Comment by Lukáš Křečan [ 12/Jul/18 ]

Happened to us as well.

Comment by Tatiana [ 11/Jul/18 ]

Thanks for the replies, guys!  Yeah basically that's what I did back then, just shutting down the mongod service on the new secondary (which has been primary that stepped down) and updating it to 3.6.5 too.  Thanks for the answers  and good luck with fixing it ! 

Comment by Sree Himakunthala [ 05/Jul/18 ]

Kelsey,

1) Do you have any ETA on the fix/next release?

2) Does this problem exist in 3.6.4?

 

Thanks

Sree

 

Comment by Kelsey Schubert [ 05/Jul/18 ]

Yes, that's a good suggestion; thanks for sharing, sreek95051.

I should clarify my previous comment, running a mixed-version replica set for an extended period of time is not supported. We recommend following the upgrade documented procedure, however, we are aware that unexpected elections may cause a 3.6 node to become primary before all secondaries have been upgraded, as such we are working on resolving this issue in a later version of MongoDB 3.6.

Kind regards,
Kelsey

Comment by Sree Himakunthala [ 05/Jul/18 ]

Tatiana,

You have a good point. I am taking the liberty to suggest you a method for that problem (use your discretion in following this procedure) unless MongoDB team has a better recommendation.

Instead of stepping down the PRIMARY, just shut it down so that a new PRIMARY is elected and upgrade the binaries of this node to 3.6.5 and bring it up with 3.6.5 binaries to circumvent the problem being identified.

 

Hope this helps.

 

 

 

Comment by Tatiana [ 04/Jul/18 ]

Hi Kelsy, 

About "but ensure that 3.6 nodes should only be primary when all secondaries have been upgraded to 3.6." , it seems impossible taking into account the current upgrade procedure https://docs.mongodb.com/manual/release-notes/3.6-upgrade-replica-set/#upgrade-a-replica-set-to-3-6  , once you updated your secondaries from e.g.  3.4.10 to 3.6.5, you will need to make the primary step down according to this guide , so it becomes secondary (bit it's still 3.4) and the issue reproduces so mongod service is in a failed state.  This guide should mention that we just needs top ignore it, stop the mongod on a recently stepped down node and update it to 3.6.5 as well, shouldn't it? 

 

Thanks for confirmation! 

Comment by Kelsey Schubert [ 03/Jul/18 ]

Hi sreek95051,

The upgrade procedure you describe is not supported, nor recommended. As we document in our upgrade process, we suggest that all secondaries are upgraded before upgrading the primary. Note, for example, that testing a 3.6 node as primary with 3.4 secondaries may not be a reasonable measure of the cluster's performance fully upgraded. If the 3.4 secondaries are not able to keep up with the 3.6 primary's writes, this could cause additional cache pressure and impact the performance the of the upgraded primary.

We are working on correcting the issue described by this ticket, but following our recommended upgrade best practices would fully avoid running into this problem. For a complete list of issues resolved in 3.6.5 (not included in 3.6.4), please review this filter. If you're unable to modify your upgrade procedure, I would recommend waiting for this ticket to be resolved.

Kind regards,
Kelsey

Comment by Sree Himakunthala [ 02/Jul/18 ]

Kelsey,

we can not upgrade all secondaries to 3.6.5 before making a 3.6.5 node primary.

Our plan is to upgrade one node to 3.6.5 and make it the primary and let it bake for few days before upgrading rest of the nodes. This way, we have other nodes to fall back in case of a problem with 3.6.5. For this upgrade plan, what is a stabler version of mongo without the current problem?

 

Thanks

Sree

 

Comment by Andy Schwerin [ 02/Jul/18 ]

Setting this fix to "3.6 Required", as it looks pretty likely to need work on the 3.6 branch. My interpretation is that it shouldn't require work on master or 4.0, as it's largely upgrade-downgrade related, misha.tyulenev, feel free to add fix versions or file other tickets if work needs to happen on 4.0 and master branches.

Comment by Kelsey Schubert [ 02/Jul/18 ]

Hi sreek95051,

We're continuing to investigate and will update this ticket when we have a fix for this issue. In the meantime, I would suggest that you continue to use 3.6.5, but ensure that 3.6 nodes should only be primary when all secondaries have been upgraded to 3.6. If you want to protect against unexpected elections, I would suggest setting your 3.6 secondary to priority 0 until the replica set is fully upgraded.

Kind regards,
Kelsey

Comment by Sree Himakunthala [ 02/Jul/18 ]

Joe,

do you recommend us to stick to 3.6.4 instead of 3.6.5?

If introduced in 3.6.5, is there a patch coming up to fix this issue?

 

Thanks

Sree

 

Comment by Joe Caswell [ 02/Jul/18 ]

In my testing for the rolling upgrade ticket, this issues appears to have been introduced in 3.6.5

I have, as yet, been unable to reproduce in 3.6.4.

Comment by Sree Himakunthala [ 28/Jun/18 ]

Kelsey,

the mongodb log of the primary has been uploaded to the ftp site

 

thx

Sree

 

Comment by Kelsey Schubert [ 27/Jun/18 ]

Hi sreek95051,

Yes, we'd still like to see the logs so we can understand how the replica set got into this state. The error you observed on the secondaries running 3.4 was coming from oplog application here which originates during the namespace check in the application of the create command here.

This error suggests that the 3.6 primary created the 'config.system.sessions' collection even though the secondaries were still running 3.4, which don't support sessions or have any awareness of that collection. So it would be helpful for us to review to review the logs of the 3.6 primary to understand the context around when this collection was created. The approach you suggest going forward with your upgrade sounds reasonable, given our current understanding of the situation.

Note that in 3.6, this collection name is valid and does not produce any errors while applying the oplog, as you can see here, consequently, upgrading the secondaries to 3.6 allowed the nodes to successfully apply these operations.

Please let us know when you've uploaded the logs, so we can continue to investigate this issue.

Thank you,
Kelsey

Comment by Sree Himakunthala [ 27/Jun/18 ]

Kelsey,

i upgraded one of the erroring secondaries to 3.6.5 and I am able to bring up the secondary successfully.

Why did it error out with the 3.4.x version?

Do you still need the old log for your debugging?

Our plan is to upgrade one secondary first and promote it as primary and observe the system before upgrading the rest of the nodes. Do you see any problem with this approach?

 

Thanks

Sree

 

Comment by Kelsey Schubert [ 26/Jun/18 ]

Hi sreek95051,

To get your cluster back into a healthy state as we continue to investigate what caused this issue, I would suggest upgrading your secondaries to 3.6, which should allow them to successfully start. If you would then like to downgrade back to 3.4, my advice would be to set feature compatibility to 3.6, then follow the standard downgrade steps described in our documentation.

Kind regards,
Kelsey

Comment by Sree Himakunthala [ 26/Jun/18 ]

Kelsey,

This node was NEVER run with feature compatibility of 3.6.

I will upload the logs soon.

 

Thanks

Sree

 

Comment by Kelsey Schubert [ 26/Jun/18 ]

Hi sreek95051,

Thanks for confirming that the feature compatibility version is appropriately configured for this replica set. Has this node ever run with feature compatibility 3.6?

To help us investigate this issue, would you please provide the complete logs from this node, going back to at least before it was promoted to primary?

I've created a secure upload portal for you to use. Files uploaded to this portal are only visible to MongoDB employees investigating this issue and are routinely deleted after some time.

Thanks again,
Kelsey

Comment by Sree Himakunthala [ 26/Jun/18 ]

Kelsey,

here is the output:

{ "featureCompatibilityVersion" : "3.4", "ok" : 1 }
Comment by Kelsey Schubert [ 25/Jun/18 ]

Hi sreek95051,

Thanks for reporting this issue. So we can continue to investigate, would you please provide the output of the following command 

db.adminCommand( { getParameter: 1, featureCompatibilityVersion: 1 } )

executed directly against the primary which was upgraded to 3.6.5?

Thank you,
Kelsey

Generated at Thu Feb 08 04:41:01 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.