[SERVER-49468] Invalidate previous OperationContext when a new OperationContext is created Created: 13/Jul/20  Updated: 29/Oct/23  Resolved: 09/Mar/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 4.9.0

Type: Bug Priority: Major - P3
Reporter: Benjamin Caimano (Inactive) Assignee: Benjamin Caimano (Inactive)
Resolution: Fixed Votes: 0
Labels: servicearch-wfbf-day
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Related
related to SERVER-53857 Successive segfaults of several clust... Closed
is related to SERVER-53566 Investigate and reproduce "opCtx != n... Closed
Backwards Compatibility: Fully Compatible
Backport Requested:
v4.4
Sprint: Service arch 2020-11-30
Participants:

 Description   

We have an aggressive invariant here. This means that we crash the server whenever we attempt to replace an OperationContext instead of explicitly destroying and then recreating. We can keep that invariant in test environments. In production, I believe we should interrupt the previous OperationContext to produce an AssertionError. We should probably also log and emit an exception at the call site to make sure we don't end up in an unsatisfiable wait.



 Comments   
Comment by Lucas Bonnet [ 01/Apr/21 ]

Awesome, can't wait for 4.4.5 then, thanks.

Comment by Billy Donahue [ 01/Apr/21 ]

Lucas,

The root cause of the problems you're referring to has been identified and fixed by SERVER-53566, which has been backported and is indeed expected to be included in the upcoming 4.4.5 release.

This ticket here represents an earlier defensive and diagnostic measure, implemented before the counter overflow was identified as the root cause.

Comment by Lucas Bonnet [ 01/Apr/21 ]

Hello,

 

Just a reminder that we still can't touch the replicaSet to remove or add mebmers, or it crashes completely, and servers crash anyway after 20 days. How is that marked as "FIXED"? Why are you refusing to fix this bug in 4.4? I fail to see how you're keeping 4.4 "as stable as possible" with a bug that paralyses cluster operations.

Comment by Githook User [ 09/Mar/21 ]

Author:

{'name': 'Judah Schvimer', 'email': 'judah@mongodb.com', 'username': 'judahschvimer'}

Message: Revert "SERVER-49468 Kill and throw when OperationContexts are overwritten"

This reverts commit d1478455494a1d8b4a5ceec91eb4983f118a45b4.
Branch: v4.4
https://github.com/mongodb/mongo/commit/ca84ce1f36dae35cb433b578c1f04283586c093e

Comment by Judah Schvimer [ 09/Mar/21 ]

To keep the 4.4 branch as stable as possible, and since there is ongoing discussion about the best way forward on this ticket, I will be reverting the 4.4 fix until we've gained consensus.

Comment by Githook User [ 08/Mar/21 ]

Author:

{'name': 'Ben Caimano', 'email': 'ben.caimano@10gen.com'}

Message: SERVER-49468 Kill and throw when OperationContexts are overwritten

(cherry picked from commit b7cf8fbfcc547015f7fcd8521f4890b8ee8598f6)
Branch: v4.4
https://github.com/mongodb/mongo/commit/d1478455494a1d8b4a5ceec91eb4983f118a45b4

Comment by Юрий Соколов [ 08/Mar/21 ]

There is conclusion on #SERVER-53566 about source of issue and the way to fix it.

Comment by Benjamin Caimano (Inactive) [ 08/Mar/21 ]

Thanks for the poke, sz and lucas@lichess.org. I'd left this on the backlog for our architecture team but it seems that it needs to be better prioritized. I'll attempt to get it backported for r4.4.5.

Comment by Lucas Bonnet [ 08/Mar/21 ]

Indeed our two crashes are 20 days apart here too. That's not a long uptime for a db server...

Comment by Sergey Zagursky [ 08/Mar/21 ]

@Lucas, our clusters have crushed after exactly 20 days of uptime and it looks like it linked to this constant: https://github.com/mongodb/mongo/blob/r4.4.3/src/mongo/db/keys_collection_manager.cpp#L62 because invariant failure occured in KeysCollectionManager:: PeriodicRunner thread.

Comment by Lucas Bonnet [ 08/Mar/21 ]

I second this, we're now on our second cluster-wide crash with no warning after weeks of uptime and zero action during the crash... With this issue the mongodb cluster went from something we did not worry about to something that might suddenly crash (and potentially lose data) without notice or change in app behaviour.

Comment by Sergey Zagursky [ 07/Mar/21 ]

@Benjamin Caimano, we're in a desperate need of backporting this to next release of 4.4. We experience crushing cascade on our production sharded clusters.

Comment by Юрий Соколов [ 15/Feb/21 ]

Heavily upvote for backporting to 4.4

Today I've set featureCompatibilityVersion on several sharded clusters. Almost every shard replicaset did crashed with this error. Sometimes all 3 replicas simultaneously, sometimes only master. Reading it will possibly happen with dropIndex and stepDown makes me nervous.

Comment by Githook User [ 24/Nov/20 ]

Author:

{'name': 'Ben Caimano', 'email': 'ben.caimano@10gen.com'}

Message: SERVER-49468 Kill and throw when OperationContexts are overwritten
Branch: master
https://github.com/mongodb/mongo/commit/b7cf8fbfcc547015f7fcd8521f4890b8ee8598f6

Generated at Thu Feb 08 05:19:55 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.