[SERVER-34683] Downgrade replicaset from 3.6.4 to 3.4.14 fails due to the presence of `config.system.sessions` Created: 26/Apr/18  Updated: 29/Oct/23  Resolved: 10/May/18

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.6.4
Fix Version/s: 3.6.5

Type: Bug Priority: Major - P3
Reporter: Wojciech Sielski Assignee: Misha Tyulenev
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

3.4 upgrade to 3.6.4 and try to downgrade, when config.system.session exists

Sprint: Sharding 2018-05-07, Sharding 2018-05-21
Participants:
Case:

 Description   

Hi

I made an update of my RS from 3.4.1 to 3.6.4,
but application where not using newest driver yet,
so I was enforced to downgrade back
(acc to docu, I have chosen the newest 3.4.14 ).

Downgrade was not possible. Even with data removal (enforce "scratch-replication").
I constantly got error, about config.system.session collection :

018-04-25T17:21:01.319+0200 F REPL     [repl writer worker 1] writer worker caught exception: 10156 cannot update system collection: config.system.sessions q: { _id: { id: BinData(4, 96E23194367D482B81B1D75B6423AFBD), uid: BinData(0, EEDC58CB4933CBA895757E2FD8275E869A732F964407D28FA4D462EE4FC5B046) } } u: { $set: { lastUse: new Date(1524669659199) } } on: { ts: Timestamp 1524669659000|10, t: 39, h: 6122200662662238595, v: 2, op: "u", ns: "config.system.sessions", o2: { _id: { id: BinData(4, 96E23194367D482B81B1D75B6423AFBD), uid: BinData(0, EEDC58CB4933CBA895757E2FD8275E869A732F964407D28FA4D462EE4FC5B046) } }, wall: new Date(1524669659197), o: { $set: { lastUse: new Date(1524669659199) } } }
2018-04-25T17:21:01.319+0200 I -        [repl writer worker 1] Fatal assertion 16359 Location10156: cannot update system collection: config.system.sessions q: { _id: { id: BinData(4, 96E23194367D482B81B1D75B6423AFBD), uid: BinData(0, EEDC58CB4933CBA895757E2FD8275E869A732F964407D28FA4D462EE4FC5B046) } } u: { $set: { lastUse: new Date(1524669659199) } } at src/mongo/db/repl/sync_tail.cpp 1082
2018-04-25T17:21:01.319+0200 F REPL     [repl writer worker 15] writer worker caught exception: 10156 cannot update system collection: config.system.sessions q: { _id: { id: BinData(4, F7749D80D99044C18AEC4F16DB0841A6), uid: BinData(0, 575CD72AF2368BDED794D83BD07BA14894021281F75D97BE815F1FE5D795D1E1) } } u: { $set: { lastUse: new Date(1524669659199) } } on: { ts: Timestamp 1524669659000|6, t: 39, h: 9214092136229990651, v: 2, op: "u", ns: "config.system.sessions", o2: { _id: { id: BinData(4, F7749D80D99044C18AEC4F16DB0841A6), uid: BinData(0, 575CD72AF2368BDED794D83BD07BA14894021281F75D97BE815F1FE5D795D1E1) } }, wall: new Date(1524669659197), o: { $set: { lastUse: new Date(1524669659199) } } }
2018-04-25T17:21:01.319+0200 I -        [repl writer worker 1]
 
***aborting after fassert() failure

I have even tried to drop the collection and db config
but it was recreated.



 Comments   
Comment by Ramon Fernandez Marina [ 25/Jun/18 ]

victorgp, robomon1, since this ticket corresponds to a specific bug and it's been closed already I'd request that you either post on mongodb-user group if you have a support-related question, or open a new SERVER ticket if you believe you've found a bug.

Thanks,
Ramón.

Comment by Robert Ford [ 21/Jun/18 ]

Not sure this is totally fixed in 3.6.5.  I just did a rolling upgrade of a 3 node replicaset that was on 3.4 with FCV=3.4.  The nodes were aws instances and it was easy enough to just wipe them out and recreate them with 3.6.5.  Nodes 1 and 2 went fine.  When I ran the rs.stepDown() on the primary it took more than a few seconds to select a new primary.  Then the mongod service on Node 3 which was still 3.4 aborted with this  error.  After that I couldn't restart the service on Node 3.  I finally just upgraded Node 3 to 3.6 and everything started fine.

Comment by VictorGP [ 19/Jun/18 ]

I'm affected by this issue in a slightly different way.

We downgraded fom 3.6 to 3.4 and in one of the shards of the cluster the config.system.sessions collection remained there, not in the rest. It is not in the config servers either, so what i'm trying to do is drop the collection it in that shard, but i cannot find a role or set of roles that allow me to do that.

I tried the admin, restore and root roles with no luck, i always get: "not authorized on config to execute command"

Do you know what permissions i need to set to perform this operation in that shard?

Comment by Githook User [ 10/May/18 ]

Author:

{'name': 'Misha Tyulenev', 'email': 'misha@mongodb.com', 'username': 'mikety'}

Message: SERVER-34683 drop config.system.sessions on downgrade to 3.4
Branch: v3.6
https://github.com/mongodb/mongo/commit/8d736eabdbc0da2d4846edffad26df538a6adad4

Comment by Kevin Pulo [ 03/May/18 ]

For users currently affected by this issue (running 3.6.4, with FCV 3.4 (either never set to 3.6, or set back to 3.4), attempting to downgrade to 3.4, and encountering these errors on the 3.4 nodes), another potential workaround is to rolling restart on 3.6.4 with --setParameter disableLogicalSessionCacheRefresh=true, and then perform the rolling downgrade to 3.4 while also removing this parameter when starting each 3.4 mongod (since it will prevent 3.4 mongod from starting).

Note that this is an undocumented internal-only parameter that should not be used otherwise — while set to true it will inhibit the writes to config.system.sessions — which are normal and necessary in homogenous 3.6 replica sets, but the source of this issue in mixed 3.4/3.6 replica sets.

Comment by Andy Schwerin [ 01/May/18 ]

This plan sounds reasonable. We should also see about improving the downgrade test coverage in the multiversion suite.

Comment by Randolph Tan [ 30/Apr/18 ]

The approach sounds reasonable to me

Comment by Kaloian Manassiev [ 30/Apr/18 ]

Currently (as of 3.6.4) the config.system.sessions collection is unconditionally created by the config server and all other nodes (shards and mongos) just check for its presence before attempting to write to it.

In order to fix this downgrade problem, I propose that we make the following changes:

  • Change SessionsCollectionConfigServer to only create the collection if FCV is 3.6 (this will require it taking the FCV lock as well in order to serialize with possible concurrent FCV changes)
  • On FCV downgrade to 3.4, drop the sharded config.system.sessions collection, driven by the config server again.

The last step still has a race condition where a stray write to the config.system.sessions collection may accidentally recreate it on the config server, so I propose to also disallow the creation of config.system.sessions at all if FCV is not 3.6.

This is the general direction and some race conditions may still have to be fleshed out, but I wanted to verify that the direction sounds correct before we put more time into designing it. schwerin, renctan?

With these fixes, customers who are on 3.6.4 and are unable to downgrade to 3.4 will have two options:

  • Manually delete the config.system.sessions collection on all nodes (config server and shards)
  • Upgrade to 3.6.5, set FCV to 3.4 and then downgrade to 3.4
Comment by Kaloian Manassiev [ 26/Apr/18 ]

Hi sielaq,

Thank you for the detailed report and for confirming that the feature compatibility version has been downgraded to 3.4.

From cursory look it appears that the code which downgrades the FCV omits dropping the internal config.system.sessions collection (which is something new we introduced in 3.6 in order to support logical sessions) or it somehow gets recreated after the FCV downgrade (more likely).

While we investigate this issue further, if you are blocked because of the failing downgrade, you can work around the problem by manually dropping the config.system.sessions collection.

Best regards,
-Kal.

Comment by Wojciech Sielski [ 26/Apr/18 ]

overtaking the coming question: yes downgrading features to 3.4 has been done - all acc to procedure

Generated at Thu Feb 08 04:37:29 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.