[SERVER-49688] TTLMonitor Fatal assertion 29001 Bad Value - Invalid Argument in wt record store Created: 17/Jul/20 Updated: 22/Jun/22 Resolved: 20/Oct/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Stability, TTL |
| Affects Version/s: | 4.2.7 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Chad Kreimendahl | Assignee: | Eric Milkie |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Debian Stretch VM |
||
| Attachments: |
|
||||
| Issue Links: |
|
||||
| Operating System: | ALL | ||||
| Steps To Reproduce: | Have TTL Indexes? Wait until they run? |
||||
| Sprint: | Execution Team 2020-08-24, Execution Team 2020-09-21, Execution Team 2020-10-05, Execution Team 2020-11-02 | ||||
| Participants: | |||||
| Description |
|
Hard crash on 4.2.7 during a TTL run. Of note. These TTL indexes ran fine for several weeks, without any issue. There are a bunch of them we have. Unsure which specific one caused this.
|
| Comments |
| Comment by Eric Milkie [ 20/Oct/20 ] | |||||||||||||||||||||||||||||||||||||||||||||||||
|
Unfortunately, after some attempts we were unable to reproduce this issue, and careful code inspection did not otherwise reveal the way in which this error was produced. If it happens again, please let us know so that we can reopen this ticket and continue the investigation with further failure data. | |||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Chad Kreimendahl [ 11/Aug/20 ] | |||||||||||||||||||||||||||||||||||||||||||||||||
|
It is disabled. This is the only environment in which that is true, as it's the only environment in which we use a PSA architecture (it's an integration testing system, where we create copies of high-load stuff and validate mongodb updates, where errors like this one would prevent our upgrade process (from 3.6 to 4.2 hopefully soon). We would not use this config in production, as there's no use for arbiters. Other than for failover, there's also nearly no use for the secondaries. They're a security blanket for us more than a destination for queries. less that 0.001% of our queries can be performed on data that's not up-to-date.
(excluding some authorization items from config for obvious reasons)
| |||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Gottlieb (Inactive) [ 11/Aug/20 ] | |||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi chad@onspring.com, can you share the configuration that instance was running with? With luck, the configuration information should be available when the node restarts and is repeated at the top of a file when performing a MongoDB log rotation, e.g:
I'm mostly curious about whether majority read concern is disabled. | |||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Chad Kreimendahl [ 11/Aug/20 ] | |||||||||||||||||||||||||||||||||||||||||||||||||
|
It has not. I always prefer when things are repeatable or have some pattern. This one does not have one easily identifiable. The previous 6 hours of logs contains just permutations of this message:
And the lines in question:
| |||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Eric Milkie [ 11/Aug/20 ] | |||||||||||||||||||||||||||||||||||||||||||||||||
|
And it hasn't happened again since July 17? | |||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Chad Kreimendahl [ 11/Aug/20 ] | |||||||||||||||||||||||||||||||||||||||||||||||||
|
I'll dig for it, but it may be gone at this point. I went through the 200 lines of logs just prior, and all of them were normal things you'd expect to see happening all the time in our db. All things that had happened hundreds to millions of times that day, already. | |||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Eric Milkie [ 10/Aug/20 ] | |||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Chad, Do you have the full log from the database server that crashed? I'm interested in what the server was doing prior to hitting the problem. | |||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Gregory Wlodarek [ 03/Aug/20 ] | |||||||||||||||||||||||||||||||||||||||||||||||||
|
We'll investigate this in the next sprint. |