[SERVER-53566] Investigate and reproduce "opCtx != nullptr && _opCtx == nullptr" invariant Created: 04/Jan/21  Updated: 29/Oct/23  Resolved: 23/Mar/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 4.9.0, 4.4.5, 4.0.24, 4.2.14

Type: Task Priority: Major - P3
Reporter: Matthew Tretin (Inactive) Assignee: Tyler Seip (Inactive)
Resolution: Fixed Votes: 8
Labels: sa-groomed
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Duplicate
duplicates SERVER-57007 MongoDB version 4.4.4 service crashed Closed
is duplicated by SERVER-52735 mongodb crash with "Invariant failure... Closed
is duplicated by SERVER-53857 Successive segfaults of several clust... Closed
is duplicated by SERVER-59350 Invariant failure exception Closed
is duplicated by SERVER-60303 Mongodb PSS all nodes crashed Closed
Related
related to SERVER-49468 Invalidate previous OperationContext ... Closed
related to SERVER-52735 mongodb crash with "Invariant failure... Closed
related to SERVER-53857 Successive segfaults of several clust... Closed
related to SERVER-53747 replica set down after setFeatureComp... Closed
is related to SERVER-53482 Invariant failure: opCtx != nullptr &... Closed
Backwards Compatibility: Fully Compatible
Backport Requested:
v4.4, v4.2, v4.0
Sprint: Service Arch 2021-03-08, Service Arch 2021-03-22, Service Arch 2021-04-05
Participants:
Case:
Story Points: 5

 Description   

There seems to be an unknown issue causing a particular invariant to occasionally get hit, which results in nodes unexpectedly terminating. The invariant is improved as part of SERVER-49468 but this ticket is to track a root cause investigation.

Someone needs to spend some dedicated time in order to try and unwind what may be happening here and determine the steps required to reproduce this particular bug. 

Acceptance Criteria:  Either:

  • A set of procedures that a user might be able to implement which causes the invariants to trigger. 
  • A plan for how to add additional, backportable, diagnostic data that will allow us to properly track down the cause of this invariant being triggered. 


 Comments   
Comment by Githook User [ 24/Mar/21 ]

Author:

{'name': 'Tyler Seip', 'email': 'Tyler.Seip@mongodb.com', 'username': 'tseip-mongo'}

Message: SERVER-53566: Protect ServiceContext from issuing duplicate operation IDs
Branch: v4.0
https://github.com/mongodb/mongo/commit/fbf13d24435cc05a04cbc56e560397e8b52d4ac9

Comment by Githook User [ 24/Mar/21 ]

Author:

{'name': 'Tyler Seip', 'email': 'Tyler.Seip@mongodb.com', 'username': 'tseip-mongo'}

Message: SERVER-53566: Protect ServiceContext from issuing duplicate operation IDs
Branch: v4.2
https://github.com/mongodb/mongo/commit/2d30db50b66fb8c31432b6f622e010a5b7bce256

Comment by Githook User [ 23/Mar/21 ]

Author:

{'name': 'Tyler Seip', 'email': 'Tyler.Seip@mongodb.com', 'username': 'tseip-mongo'}

Message: SERVER-53566: Protect ServiceContext from issuing duplicate operation IDs
Branch: v4.4
https://github.com/mongodb/mongo/commit/5fe45689b6eb936190203c66b506d5d042a76eb0

Comment by Daniel Pasette (Inactive) [ 23/Mar/21 ]

Yes, BACKPORT tickets are private, as they're just used for internal bookkeeping. The fixVersion on the ticket will be updated and the commit message will be included as a comment in this ticket (with the branch name in the message) when the backport lands.

Comment by Tyler Seip (Inactive) [ 23/Mar/21 ]

Ah, it must be the case that they're only visible to employees (the 4.4 ticket for example is BACKPORT-8675) - the backports themselves are working their way through the commit queues now and should be committed in a day or two.

Comment by Sergey Zagursky [ 23/Mar/21 ]

@Tyler Seip, I don't see backporting tickets linked to the issue. Are they visible only to MongoDB employees?

Comment by Tyler Seip (Inactive) [ 23/Mar/21 ]

This has been fixed on master and is being backported on each of the respective linked backporting tickets. Closing this as resolved.

Comment by Githook User [ 22/Mar/21 ]

Author:

{'name': 'Tyler Seip', 'email': 'Tyler.Seip@mongodb.com', 'username': 'tseip-mongo'}

Message: SERVER-53566: Protect ServiceContext from issuing duplicate operation IDs
Branch: master
https://github.com/mongodb/mongo/commit/dbb6b0d496badf3e841ab41270e23cc0e4dc2ea4

Comment by Artem Navrotskiy [ 18/Mar/21 ]

We made a PR against v4.4 branch with check and skip already used opId to avoid _nextOpId wraparound issue.

Comment by Artem Navrotskiy [ 11/Mar/21 ]

I sorry, but look like Mongos 4.4.5 is not released yet.

Comment by Tyler Seip (Inactive) [ 11/Mar/21 ]

In response to this comment, we currently think that the server only needs enough uptime to trigger this issue, but that the large amount of operations emitted by events such as stepDown/network issues make it more likely that this issue will occur contemporaneously with them.

 

In response to this comment, the solution from that ticket will prevent the server from crashing entirely but does not fix the underlying issue, which will be fixed and back ported on this ticket.

Comment by Luca - [ 09/Mar/21 ]

Hi !

As said in SERVER-52735 ticket, we encountered the same issue ( "ctx":"monitoring-keys-for-HMAC","msg":"Invariant failure","attr":{"expr":"opCtx != nullptr && _opCtx == nullptr","file":"src/mongo/db/client.cpp","line":126}} ) on the replicates of one of our mongo bases, two days apart between replicate 1 (crash on March 8, at 23:56:15 UTC) and replicate 2 (crash this morning, March 9 at 00:04:25 UTC) with exactly the same log/stacktrace as the one present in the description of SERVER-52735.

We saw that one of the related tickets (SERVER-49468) had been marked as fixed in version 4.4.5, do you know if the update to this version will fix the problem or not? 

Thanks

Comment by Юрий Соколов [ 08/Mar/21 ]

> Does the server need to have enough uptime AND a network issue to trigger the crash?

There is also 20day forced refresh period in KeysCollectionManager::PeriodicRunner::_doPeriodicRefresh ("ctx": "monitoring-keys-for-HMAC" in log) (as found and mentioned by Sergey Zagursky in SERVER-49468) .  Looks like 20 days is often enough for wraparound to happen.

Comment by Adrien Jarthon [ 08/Mar/21 ]

Hi, if this explanation is confirmed, as we noticed that the crash occurs when the server is stepping down or suffering from network issues, what would be the link between the two issues (long uptime with integer overflow and network issue)? Does the server need to have enough uptime AND a network issue to trigger the crash? If there's no network issues / stepDown, would the server leave with this integer overflow?
Thanks for the precisions!

Comment by Tyler Seip (Inactive) [ 08/Mar/21 ]

We currently suspect that this comment has correctly identified the issue (thanks funny.falcon@gmail.com!). amirsaman.memaripour wrote a unit test which is able to repro this condition on wraparound. We're currently planning on leaving OperationId a uint32_t (as it is directly exposed in multiple parts of the system and so is not just implementation details) but adding some slow-path checking to _clientByOperationId in the event of a collision. This should be sufficient to resolve the issue within the week assuming performance holds with the additional checks.

Comment by Benjamin Caimano (Inactive) [ 08/Mar/21 ]

rectalogic, that is a good data point to have, thank you!

Comment by Andrew Wason [ 08/Mar/21 ]

Both times we experienced this crash I believe our uptime was on the order of 6 months or more when we stepped down the primary and crashed.

Comment by Benjamin Caimano (Inactive) [ 08/Mar/21 ]

Back of the envelope calculation says (2^32-1)/1000(ms/s)/60(s/min)/60(min/h)/24(h/day) ~= 50days, so we'd need to generate a new operation every millisecond for 50 days. If we rephrase this to a scale like 3 months, that's not impossible. There are certainly going to be issues with bson type conversion for our RPC so we'd probably want to put this under a flag to allow users with especially long uptimes to opt in. Checking for operation id integer overflow under the hood is probably significantly less intrusive as a first step.

Comment by Юрий Соколов [ 08/Mar/21 ]

I've made a PR against v4.4 branch with changing OperationId to uint64_t. I believe it should fix the issue.

Comment by Юрий Соколов [ 07/Mar/21 ]

All failing traces are in long-living loops in non-client threads:

  • KeysCollectionManager::PeriodicRunner::_doPeriodicRefresh
  • WaitForMajorityService::_periodicallyWaitForMajority
  • TopologyVersionObserver::_workerThreadBody

 

There is one of possible scenarios:

I believe it is because:

Therefore I believe long living process just got `_nextOpId` wraparound and operation in `_clientByOperationId` is simply overwritten with more short living operation and then deleted. Then `_delistOperation` believes operation is already cleaned, and doesn't call `resetOperationContext`.

Simplest fix is to add check in `ServiceContext::makeOperationContext` for already existed key in `_clientByOperationId`. Otherwise `OperationID` type has to be changed to `uint64_t`.

Comment by Adrien Jarthon [ 03/Mar/21 ]

Ok thanks, I uploaded the logs + diagnostic.data for my 3 nodes (called bhs, db3 and rbx) on this portal and I am following this ticket.
Let me know if I can help more!

Comment by Matthew Tretin (Inactive) [ 03/Mar/21 ]

Thanks for the report! If you want to upload your log data to this ticket we'll take a look, however, we've seen a few instances of this issue and logging has not proved fruitful yet. We have a number of folks actively engaged in reproducing the problem to generate a core dump so we can track down a root cause. Please keep an eye on this ticket for further information. 

Comment by Adrien Jarthon [ 03/Mar/21 ]

I just got one of these chain-failure (like in SERVER-52735) on 2 out 3 of my RS nodes, leaving the RS in a read-only state. In my case it happened after a short network interruption (planned maintenance) on one of them (the first primary).
I went like:

  1. db3.updn.io (Primary, 4.4.1) notices network split ("Can't see a majority of the set, relinquishing primary") and crash with the same Invariant failure ("opCtx != nullptr && _opCtx == nullptr")
  2. rbx.updn.io (Secondary, 4.4.2) at the same time notices the network split, triggering an election and assuming primary role ("Election succeeded, assuming primary role") then instantly crash with the same exception (at the exact same second as db3).
  3. bhs.updn.io (Secondary, 4.4.1) meanwhile, notices both server down but after voting for rbx once, don't assume primary as it's the only one left and stayed alive.

Do you need my log samples and diagnostic data? where shall I upload them? Thanks!

Comment by Billy Donahue [ 07/Jan/21 ]

Maybe we can do something like what heap checkers do, and record some backtrace information. Have all calls to setOperationContext record a address-only backtrace. We will only need memory for one of these traces per Client. Later when a competing setOperationContext call comes in, we can log the saved backtrace along with the incoming backtrace. This might show us why they are in conflict.

I don't think backporting SERVER-49468 will yield better information. The purpose there was merely to keep servers alive after they suffer this invariant violation. That may not be safe to do! It downgrades the servers response from process-fatal to merely operation-fatal. We don't know where the corruption is coming from, and it might not be local to the single operation.

Generated at Thu Feb 08 05:31:17 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.