[SERVER-49468] Invalidate previous OperationContext when a new OperationContext is created Created: 13/Jul/20 Updated: 29/Oct/23 Resolved: 09/Mar/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | 4.9.0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Benjamin Caimano (Inactive) | Assignee: | Benjamin Caimano (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | servicearch-wfbf-day | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||
| Backport Requested: |
v4.4
|
||||||||||||||||
| Sprint: | Service arch 2020-11-30 | ||||||||||||||||
| Participants: | |||||||||||||||||
| Description |
|
We have an aggressive invariant here. This means that we crash the server whenever we attempt to replace an OperationContext instead of explicitly destroying and then recreating. We can keep that invariant in test environments. In production, I believe we should interrupt the previous OperationContext to produce an AssertionError. We should probably also log and emit an exception at the call site to make sure we don't end up in an unsatisfiable wait. |
| Comments |
| Comment by Lucas Bonnet [ 01/Apr/21 ] |
|
Awesome, can't wait for 4.4.5 then, thanks. |
| Comment by Billy Donahue [ 01/Apr/21 ] |
|
Lucas, The root cause of the problems you're referring to has been identified and fixed by This ticket here represents an earlier defensive and diagnostic measure, implemented before the counter overflow was identified as the root cause. |
| Comment by Lucas Bonnet [ 01/Apr/21 ] |
|
Hello,
Just a reminder that we still can't touch the replicaSet to remove or add mebmers, or it crashes completely, and servers crash anyway after 20 days. How is that marked as "FIXED"? Why are you refusing to fix this bug in 4.4? I fail to see how you're keeping 4.4 "as stable as possible" with a bug that paralyses cluster operations. |
| Comment by Githook User [ 09/Mar/21 ] |
|
Author: {'name': 'Judah Schvimer', 'email': 'judah@mongodb.com', 'username': 'judahschvimer'}Message: Revert " This reverts commit d1478455494a1d8b4a5ceec91eb4983f118a45b4. |
| Comment by Judah Schvimer [ 09/Mar/21 ] |
|
To keep the 4.4 branch as stable as possible, and since there is ongoing discussion about the best way forward on this ticket, I will be reverting the 4.4 fix until we've gained consensus. |
| Comment by Githook User [ 08/Mar/21 ] |
|
Author: {'name': 'Ben Caimano', 'email': 'ben.caimano@10gen.com'}Message: (cherry picked from commit b7cf8fbfcc547015f7fcd8521f4890b8ee8598f6) |
| Comment by Юрий Соколов [ 08/Mar/21 ] |
|
There is conclusion on # |
| Comment by Benjamin Caimano (Inactive) [ 08/Mar/21 ] |
|
Thanks for the poke, sz and lucas@lichess.org. I'd left this on the backlog for our architecture team but it seems that it needs to be better prioritized. I'll attempt to get it backported for r4.4.5. |
| Comment by Lucas Bonnet [ 08/Mar/21 ] |
|
Indeed our two crashes are 20 days apart here too. That's not a long uptime for a db server... |
| Comment by Sergey Zagursky [ 08/Mar/21 ] |
|
@Lucas, our clusters have crushed after exactly 20 days of uptime and it looks like it linked to this constant: https://github.com/mongodb/mongo/blob/r4.4.3/src/mongo/db/keys_collection_manager.cpp#L62 because invariant failure occured in KeysCollectionManager:: PeriodicRunner thread. |
| Comment by Lucas Bonnet [ 08/Mar/21 ] |
|
I second this, we're now on our second cluster-wide crash with no warning after weeks of uptime and zero action during the crash... With this issue the mongodb cluster went from something we did not worry about to something that might suddenly crash (and potentially lose data) without notice or change in app behaviour. |
| Comment by Sergey Zagursky [ 07/Mar/21 ] |
|
@Benjamin Caimano, we're in a desperate need of backporting this to next release of 4.4. We experience crushing cascade on our production sharded clusters. |
| Comment by Юрий Соколов [ 15/Feb/21 ] |
|
Heavily upvote for backporting to 4.4 Today I've set featureCompatibilityVersion on several sharded clusters. Almost every shard replicaset did crashed with this error. Sometimes all 3 replicas simultaneously, sometimes only master. Reading it will possibly happen with dropIndex and stepDown makes me nervous. |
| Comment by Githook User [ 24/Nov/20 ] |
|
Author: {'name': 'Ben Caimano', 'email': 'ben.caimano@10gen.com'}Message: |