[SERVER-53566] Investigate and reproduce "opCtx != nullptr && _opCtx == nullptr" invariant Created: 04/Jan/21 Updated: 29/Oct/23 Resolved: 23/Mar/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | 4.9.0, 4.4.5, 4.0.24, 4.2.14 |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | Matthew Tretin (Inactive) | Assignee: | Tyler Seip (Inactive) |
| Resolution: | Fixed | Votes: | 8 |
| Labels: | sa-groomed | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| Backport Requested: |
v4.4, v4.2, v4.0
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| Sprint: | Service Arch 2021-03-08, Service Arch 2021-03-22, Service Arch 2021-04-05 | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| Story Points: | 5 | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
There seems to be an unknown issue causing a particular invariant to occasionally get hit, which results in nodes unexpectedly terminating. The invariant is improved as part of Someone needs to spend some dedicated time in order to try and unwind what may be happening here and determine the steps required to reproduce this particular bug. Acceptance Criteria: Either:
|
| Comments |
| Comment by Githook User [ 24/Mar/21 ] |
|
Author: {'name': 'Tyler Seip', 'email': 'Tyler.Seip@mongodb.com', 'username': 'tseip-mongo'}Message: |
| Comment by Githook User [ 24/Mar/21 ] |
|
Author: {'name': 'Tyler Seip', 'email': 'Tyler.Seip@mongodb.com', 'username': 'tseip-mongo'}Message: |
| Comment by Githook User [ 23/Mar/21 ] |
|
Author: {'name': 'Tyler Seip', 'email': 'Tyler.Seip@mongodb.com', 'username': 'tseip-mongo'}Message: |
| Comment by Daniel Pasette (Inactive) [ 23/Mar/21 ] |
|
Yes, BACKPORT tickets are private, as they're just used for internal bookkeeping. The fixVersion on the ticket will be updated and the commit message will be included as a comment in this ticket (with the branch name in the message) when the backport lands. |
| Comment by Tyler Seip (Inactive) [ 23/Mar/21 ] |
|
Ah, it must be the case that they're only visible to employees (the 4.4 ticket for example is BACKPORT-8675) - the backports themselves are working their way through the commit queues now and should be committed in a day or two. |
| Comment by Sergey Zagursky [ 23/Mar/21 ] |
|
@Tyler Seip, I don't see backporting tickets linked to the issue. Are they visible only to MongoDB employees? |
| Comment by Tyler Seip (Inactive) [ 23/Mar/21 ] |
|
This has been fixed on master and is being backported on each of the respective linked backporting tickets. Closing this as resolved. |
| Comment by Githook User [ 22/Mar/21 ] |
|
Author: {'name': 'Tyler Seip', 'email': 'Tyler.Seip@mongodb.com', 'username': 'tseip-mongo'}Message: |
| Comment by Artem Navrotskiy [ 18/Mar/21 ] |
|
We made a PR against v4.4 branch with check and skip already used opId to avoid _nextOpId wraparound issue. |
| Comment by Artem Navrotskiy [ 11/Mar/21 ] |
|
I sorry, but look like Mongos 4.4.5 is not released yet. |
| Comment by Tyler Seip (Inactive) [ 11/Mar/21 ] |
|
In response to this comment, we currently think that the server only needs enough uptime to trigger this issue, but that the large amount of operations emitted by events such as stepDown/network issues make it more likely that this issue will occur contemporaneously with them.
In response to this comment, the solution from that ticket will prevent the server from crashing entirely but does not fix the underlying issue, which will be fixed and back ported on this ticket. |
| Comment by Luca - [ 09/Mar/21 ] |
|
Hi ! As said in We saw that one of the related tickets ( Thanks |
| Comment by Юрий Соколов [ 08/Mar/21 ] |
|
> Does the server need to have enough uptime AND a network issue to trigger the crash? There is also 20day forced refresh period in KeysCollectionManager::PeriodicRunner::_doPeriodicRefresh ("ctx": "monitoring-keys-for-HMAC" in log) (as found and mentioned by Sergey Zagursky in |
| Comment by Adrien Jarthon [ 08/Mar/21 ] |
|
Hi, if this explanation is confirmed, as we noticed that the crash occurs when the server is stepping down or suffering from network issues, what would be the link between the two issues (long uptime with integer overflow and network issue)? Does the server need to have enough uptime AND a network issue to trigger the crash? If there's no network issues / stepDown, would the server leave with this integer overflow? |
| Comment by Tyler Seip (Inactive) [ 08/Mar/21 ] |
|
We currently suspect that this comment has correctly identified the issue (thanks funny.falcon@gmail.com!). amirsaman.memaripour wrote a unit test which is able to repro this condition on wraparound. We're currently planning on leaving OperationId a uint32_t (as it is directly exposed in multiple parts of the system and so is not just implementation details) but adding some slow-path checking to _clientByOperationId in the event of a collision. This should be sufficient to resolve the issue within the week assuming performance holds with the additional checks. |
| Comment by Benjamin Caimano (Inactive) [ 08/Mar/21 ] |
|
rectalogic, that is a good data point to have, thank you! |
| Comment by Andrew Wason [ 08/Mar/21 ] |
|
Both times we experienced this crash I believe our uptime was on the order of 6 months or more when we stepped down the primary and crashed. |
| Comment by Benjamin Caimano (Inactive) [ 08/Mar/21 ] |
|
Back of the envelope calculation says (2^32-1)/1000(ms/s)/60(s/min)/60(min/h)/24(h/day) ~= 50days, so we'd need to generate a new operation every millisecond for 50 days. If we rephrase this to a scale like 3 months, that's not impossible. There are certainly going to be issues with bson type conversion for our RPC so we'd probably want to put this under a flag to allow users with especially long uptimes to opt in. Checking for operation id integer overflow under the hood is probably significantly less intrusive as a first step. |
| Comment by Юрий Соколов [ 08/Mar/21 ] |
|
I've made a PR against v4.4 branch with changing OperationId to uint64_t. I believe it should fix the issue. |
| Comment by Юрий Соколов [ 07/Mar/21 ] |
|
All failing traces are in long-living loops in non-client threads:
There is one of possible scenarios:
I believe it is because:
Therefore I believe long living process just got `_nextOpId` wraparound and operation in `_clientByOperationId` is simply overwritten with more short living operation and then deleted. Then `_delistOperation` believes operation is already cleaned, and doesn't call `resetOperationContext`. Simplest fix is to add check in `ServiceContext::makeOperationContext` for already existed key in `_clientByOperationId`. Otherwise `OperationID` type has to be changed to `uint64_t`. |
| Comment by Adrien Jarthon [ 03/Mar/21 ] |
|
Ok thanks, I uploaded the logs + diagnostic.data for my 3 nodes (called bhs, db3 and rbx) on this portal and I am following this ticket. |
| Comment by Matthew Tretin (Inactive) [ 03/Mar/21 ] |
|
Thanks for the report! If you want to upload your log data to this ticket we'll take a look, however, we've seen a few instances of this issue and logging has not proved fruitful yet. We have a number of folks actively engaged in reproducing the problem to generate a core dump so we can track down a root cause. Please keep an eye on this ticket for further information. |
| Comment by Adrien Jarthon [ 03/Mar/21 ] |
|
I just got one of these chain-failure (like in
Do you need my log samples and diagnostic data? where shall I upload them? Thanks! |
| Comment by Billy Donahue [ 07/Jan/21 ] |
|
Maybe we can do something like what heap checkers do, and record some backtrace information. Have all calls to setOperationContext record a address-only backtrace. We will only need memory for one of these traces per Client. Later when a competing setOperationContext call comes in, we can log the saved backtrace along with the incoming backtrace. This might show us why they are in conflict. I don't think backporting |