[SERVER-50971] Invariant failure, WT_NOTFOUND: item not found Created: 16/Sep/20 Updated: 17/Jan/24 Resolved: 04/Feb/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Stability, WiredTiger |
| Affects Version/s: | 4.4.0, 4.4.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Pieter Jordaan | Assignee: | Jonathan Streets (Inactive) |
| Resolution: | Done | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||
| Steps To Reproduce: | Unfortunately I cannot get it reproduced on will yet. |
||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||||||||||
| Linked BF Score: | 119 | ||||||||||||||||||||||||||||
| Description |
|
Hi
We have been getting infrequent (1-2 times a day) aborts of this kind lately. This happened on 4.4.1 and 4.4.0 too. We have had years without issue (3-node replica set), upgrading as new versions come around. It isn't tied to specific times either. This is on an AWS I3 instance with nvme drive, formatted to xfs. Snippet of the logs here and in the attachement. We have tried removing the DB path and resyncing from a secondary, but this did not fix it. What could cause this? Could it be a specific query? Could the disk itself be corrupt? How can I help pinpointing the issue? |
| Comments |
| Comment by Jonathan Streets (Inactive) [ 10/Mar/22 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
hi firdous.bano@flipkart.com, we believe this has been fixed in the latest version of 4.4. jon | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Firdous Bano [ 09/Mar/22 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
any update on this issue? Or what was the solution for it? I have been facing this since last few days on our production machine. MongoDB version report: MongoDB shell version v4.4.0 } I see this error in mongo logs
| ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jonathan Streets (Inactive) [ 04/Feb/21 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Hi pieterwjordaanpc@gmail.com and agustin@boostup.ai, Regards, Jon | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Alexander Gorrod [ 29/Dec/20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Thanks for letting us know - I'm glad you found a workaround. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Agustin Barto [ 28/Dec/20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
We got rid of a large write-heavy collection with a TTL index, and haven't had an incident since. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Agustin Barto [ 09/Dec/20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Spoke too soon. Our primary started crashing again. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Agustin Barto [ 01/Dec/20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
We had the same issue on the primary of the replicaset. We (think) we fixed it by switching primary over to another member, clearing all the mongodb files on the affected server, and then adding it back into the replicaset. We haven't add any issues so far. Maybe something got corrupted and wiredtiger can't recover gracefully? We got the idea from this post https://dba.stackexchange.com/questions/274900/why-mongod-service-is-killed-by-my-system. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Luke Pearson [ 01/Dec/20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Hi pieterwjordaanpc@gmail.com. Only install it on the member of the replica set exhibiting the issue. It's interesting that a resync appears to have reduced the crash frequency or stopped it all together. Let us know how you go with the updated binary. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Pieter Jordaan [ 01/Dec/20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Hi @luke.pearson Thanks for the potential fix. Should I install this on all the replicas or just the one exhibiting the issue? To update on the status, it hasn't crashed again for 'n while, though due to other factors, I had to resync the node again from scratch. Since then it hasn't crashed again yet. Previously I had resynced as well and it did not fix the issue. Our access patterns and load hasn't changed which could account for this. Though there has been cases where nothing happens for an extended period.
| ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Luke Pearson [ 01/Dec/20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Hi pieterwjordaanpc@gmail.com, We’ve compiled a new binary with two bug fixes within it, one mentioned by haribabu.kommi previously and one which also relates to WT_NOTFOUND being returned incorrectly in WiredTiger. As a result of not yet being able to reproduce the issue, we can’t say for sure if the bug fixes included in this patch fix the issue you see. We’ve added extensive logging internally to increase the diagnostic information we get if you re-encounter this issue. We would greatly appreciate it if you could run this updated binary. Additionally, please run: db.setLogLevel(2) in your MongoDB shell on startup. As a result of the additional debugging included in this build, your log file will potentially grow to several GBs, so please be aware of that. Thanks again for your assistance in helping us diagnose this issue. The information you have provided has been instrumental in assisting the team as we work through this issue. The binary can be found here: download link | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Pieter Jordaan [ 18/Nov/20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Hi Haribabu I took the liberty of uploading the 5 core dumps since my last upload, along with a log comprising all 5 crashes. I will cease to upload new dumps from now on, until you may require it again. I am hopeful that the issue can be resolved soon. Thanks | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Haribabu Kommi [ 18/Nov/20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
pieterwjordaanpc@gmail.com Thanks for providing core dumps. We are able to identify a problematic code flow that can lead to this invariant failure. We are discussing a possible fix within the team. We may need your help again in verifying the fix later. As of now, we may not need any more core dumps. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Pieter Jordaan [ 18/Nov/20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
3 crashes last night. Should I upload the core dumps? | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Pieter Jordaan [ 17/Nov/20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
I had two more crashes last night, about 2 hours apart. Would more core dumps help? | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Haribabu Kommi [ 16/Nov/20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
pieterwjordaanpc@gmail.com Thanks for providing the latest core dump files. We will investigate further with the latest core dump file for the possible scenarios leading to this issue. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Pieter Jordaan [ 13/Nov/20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
@Luke, it happened again last night: core.mongod.113.1af3e321b37948df847752ae4d9a2e31.3285838.1605215389000000000000.lz4 core.mongod.3.aa through core.mongod.3.ak as well as the log is being uploaded to S3 | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Pieter Jordaan [ 10/Nov/20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
@Luke it happened again shortly after the previous one. core.mongod.113.1af3e321b37948df847752ae4d9a2e31.3208829.1604957277000000000000.lz4
I uploaded it in chunks again, core.mongod.2.aa through core.mongod.2.af. Log is also uploaded. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Luke Pearson [ 10/Nov/20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Thanks pieterwjordaanpc@gmail.com for providing us with the core dump. It's a shame the bug fix I added didn't resolve the issue but that was also within the realm of possibilities. I've managed to piece the core back together look at it, I don't spot anything immediately out of the ordinary but it does rule out concerns about the snapshot and it has helped with understanding the state of the system. Which has narrowed down the range of possible scenarios that lead to you seeing this failure. I'll continue investigating. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Pieter Jordaan [ 09/Nov/20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Bad news with a silver lining. The server just crashed. I'm uploading the core dump and logs.
core.mongod.113.1af3e321b37948df847752ae4d9a2e31.2715348.1604954701000000000000.lz4 was the original file name. I've had to split it into 1Gig chunks with the split utility. Files are called core.mongod.aa through core.mongod.ai
Let me know if I can provide any additional details. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Pieter Jordaan [ 07/Nov/20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Hi Jonathan I have set those two last night. Hopefully it will work this time. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jonathan Streets (Inactive) [ 06/Nov/20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
hi pieterwjordaanpc@gmail.com, | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Pieter Jordaan [ 06/Nov/20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Okay so the core dump vanished. Not quite sure how that is possible. Are there any sane config settings for coredump you can recommend? | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Pieter Jordaan [ 06/Nov/20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Hi sorry for being quiet. We have been quite busy and fortunately haven't had any issues again. The server just crashed again. I'll upload the log and core dump again, though it is only 12 Gig (zipped) again. Not quite sure why the coredumo rules aren't taking effect.
Regarding the fatal crash I mentioned. The same primary died, but none of the secondaries could sync from the primary because there was a corrupted entry in its oplog. I had to sync the primary from the older secondaries.
I'll run the new fixed version this time. Hopefully we don't run into any crashes again. Thanks for all the effort into getting this fixed. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Luke Pearson [ 03/Nov/20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Hi pieterwjordaanpc@gmail.com, I've created the binary with the bug fix described in To be clear we're still unsure if this will fix the issue you're seeing but would greatly appreciate if you can run with this binary. In the event it still hits the same issue it will log additional information (on top of what it was logging additionally last time), which will give us further insight into the issue you're seeing. Thanks, | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Louis Williams [ 28/Oct/20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Hi pieterwjordaanpc@gmail.com, sorry that you've been experiencing these problems. What symptoms did you observe with this corrupted oplog entry? Was it a primary node or a secondary? | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Pieter Jordaan [ 27/Oct/20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Hi @Luke we will definitely try the new binary. What still seems a mystery to me is why it only happens on the one host when primary. Also, reading your fix's description it seems related to transactions, which I don't explicitly make use of, unless transactions there refer to something else. We haven't had a crash since my last message, though if history repeats itself it should occur soon. My biggest concern regarding the whole issue is that one fatal crash we had where even the oplog had a corrupted entry. Is there anything related to your fix which could explain why that happened? Eventually we needed to resync from a secondary which was older, losing some data in the process. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Luke Pearson [ 25/Oct/20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Hi pieterwjordaanpc@gmail.com, I've found a potential cause for the symptom that occurs at the time of crash. Since finding that I've been working on a fix for it, see When the change is reviewed and tested we'd like to provide you with a custom binary to see if you still see the crash. Does that sound good to you? In the meantime a core dump would still be helpful to rule out additional possibilities. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Pieter Jordaan [ 24/Oct/20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Hi @Jonathan It has occurred twice since my last report, however, no core dumps were produced whatsoever.
I assume a symbolic link is not suitable for the job. I've gone ahead and used mount --bind as an alternative.
I trust a dump will soon happen again hopefully, this time with a full core dump... If you have an alternative approach, please let me know. Thanks
| ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jonathan Streets (Inactive) [ 23/Oct/20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
hi pieterwjordaanpc@gmail.com, just checking in. do you have any updates for us ? thanks. jon | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Eric Sedor [ 19/Oct/20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Hi pieterwjordaanpc@gmail.com; I just wanted to chime back in to thank you again for your continued help investigating! | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Pieter Jordaan [ 19/Oct/20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Hi Johnathan, unfortunately that is exactly the size of the core dump file in /var/lib/systemd/coredump. I've checked and ulimits for coredumps for my mongod process is indeed unlimited.
However inspecting /etc/systemd/coredump.conf I see the defaults are in fact 2Gig. I'm not sure if I would be able to reach 32Gig as my OS drive is not large enough. I've therefore made a link to my data drive for storing coredumps. I hope it will work now. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jonathan Streets (Inactive) [ 19/Oct/20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
hi pieterwjordaanpc@gmail.com. the core file appears to be truncated. the uploaded (compressed) file is 1104408543 bytes (1.03 GB), and when we uncompress it it is 2147483648 bytes (2.00 GB). the file uploader on s3 supports file sizes up to 4GB. when we try using the core dump, we get an error like expected core file size >= 32241004544, found: 2147483648
thanks. jon | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Pieter Jordaan [ 19/Oct/20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
@luke.pearson@mongodb.com Another crash occurred. Core dump and log sent. Core dump is fundamentally larger now, I hope this will provide more insight.
Fortunately it wasn't as fatal as the other day's crash where even the oplog was corrupted. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Zach Bjornson [ 15/Oct/20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
We're hitting this same issue with 4.4.1 in a standalone configuration. I attached the relevant log snippet, which looks the same as the OP's log. Let me know if there's anything else I can provide. (We're about to roll back to 4.2 though, so likely can't provide much else.) | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Pieter Jordaan [ 14/Oct/20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
@Jonathan Coredumps were definitely enabled. I have additionally now installed systemd-coredump. I have systemctl daemon-reload and restarted mongod service too. I trust this should suffice.
@Luke the two where I interchanged the primary status are on the same OS. Ubuntu 20.04 x86_64. Same kernel, same XFS filesystem. Started roughly the same time. There is another secondary, which has lower priority, and hence never becomes primary. It is running the same Mongo on Ubuntu 16.04. It is also a different AWS instance type. Like the other secondary, this server has never manifested these kinds of issues. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Luke Pearson [ 14/Oct/20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Hi pieterwjordaanpc@gmail.com, are the two secondaries running the same OS too? I notice you mentioned physically. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jonathan Streets (Inactive) [ 14/Oct/20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
hi pieterwjordaanpc@gmail.com. i tested some instructions for creating core dumps in Ubuntu 20.04. I think luke.pearson would appreciate getting a full core dump - the last one was not a full core. To set ulimit:
To check that ulimit is correct - (using NNNNN as the mongod PID)
Install the core-dump utility with apt install systemd-coredump To test we can use kill -11 NNNNN, where NNNNN is the PID of mongod. The core dump will appear in
The core dump can be extracted to a file with coredumpctl --output mongo.core dump NNNNN Please could you check that core dumps are enabled, and send us the next one? thanks | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Pieter Jordaan [ 14/Oct/20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
@Luke Pearson, Interesting Any clue as to why it would only break on this specific server and not on any of the other secondaries? The two I changed primary role between are identical physically (on AWS), though only this one server ever manifested this issue. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Luke Pearson [ 14/Oct/20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Thanks for working with us on this pieterwjordaanpc@gmail.com, the log's provided do give us some insight into this issue. The relevant log lines prior to the invariant:
As we search for the value, WiredTiger ignores the data file value as its not visible to the current transaction. The read timestamp is 0 so I don't think thats why we're ignoring this value. We then search the history store file and don't find a value. Additionally the subsequent search added didn't return the record, indicating that this issue is persistent for an unknown duration. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Pieter Jordaan [ 13/Oct/20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Dump and log uploaded.
Thanks, I hope it proves useful | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jonathan Streets (Inactive) [ 13/Oct/20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
hi pieterwjordaanpc@gmail.com - please can you check /var/crash for the core dump? the new log files from the dev version may be enough though. jon | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Pieter Jordaan [ 13/Oct/20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
@Jonathan it happened again just now. I'm preparing to upload the log files. Though I cannot find the core dump file. Where would it be saved? /var/lib/systemd/coredumps is empty. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Pieter Jordaan [ 12/Oct/20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
@Jonathan Streets after the incident mentioned we have changed primaries. We haven't had another incident yet. I have now moved primary back, enabled core dumps and am running the dev version mentioned. I have also rotated the log. I both hope it happens again and not. I will keep you posted of anything out of the ordinary. Thanks again | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jonathan Streets (Inactive) [ 12/Oct/20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
hi pieterwjordaanpc@gmail.com, This should not have any impact on the application, and will only print additional information in the event the same bug causes the server to crash. Thanks, Jon | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jonathan Streets (Inactive) [ 02/Oct/20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
hi pieterwjordaanpc@gmail.com, The core dump can be uploaded to the secure upload site we set up before. Along with the core dump, please could you also verify the MongoDB version and Ubuntu version in use ? thanks | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Pieter Jordaan [ 02/Oct/20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
So this happened again last night at around 10-ish SAST time. Though after the server restarted the secondaries just stopped syncing. Though there were no error messages, the rs.status shows that there is no source to sync from. Specifically calling syncFrom produced these error logs:
| ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Eric Sedor [ 28/Sep/20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Thanks pieterwjordaanpc@gmail.com, we are investigating and will get back to you with what we find. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Pieter Jordaan [ 27/Sep/20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Hi @Eric sorry I somehow never finished adding here. I have uploaded a new log and validation output as mentioned.
| ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Eric Sedor [ 23/Sep/20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Yes please, pieterwjordaanpc@gmail.com, we're interested in seeing validation output for any node that's thrown the errors you mentioned. It's difficult to estimate validation times, which vary due to a variety of factors (including available resources). Typical validation times are measured in hours but that does sound like a fairly small collection. So it is likely safe to run if you make the node Secondary by failing over, and run during off-peak hours or a maintenance window. Again, you could work off of a snapshot or file-based backup if you have any doubts. What version of mongoosejs are you running? | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Pieter Jordaan [ 23/Sep/20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
@Eric should this be run on the specific instance or could this be done on any secondary? I believe the events collection (from the secuvue database) may be small enough to be done live seeing it only has 250k documents of around 115MiB? How long could the operation take ballpark? We only have a 3-host replica set without any sharding yet. We don't explicitly make use of any multi-document transactions unless mongoosejs perhaps does this implicitly for some reason.
The only instance that crashes is the primary who's logs I've sent, so I'm guessing it should be done there? | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Eric Sedor [ 22/Sep/20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
pieterwjordaanpc@gmail.com, thanks for your patience and help so far. Would you be able to run validate({full: true}) on the events collection? Validate is an impactful operation but could reveal important additional information about what's happening. We recommend making a copy of the dbpath and working off of a node built from the copy. Then, provide via the upload portal the validation results and the mongod logs during the execution of validate. We're also curious if this replica set is part of a sharded cluster, and whether or not you are making use of multi-document transactions? Thanks in advance, | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Pieter Jordaan [ 16/Sep/20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
@Eric Sedor, I've added the two files you requested, as well as a pdf of the charts over the last time it occurred. Please let me know if I can provide any more information. Thanks | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Eric Sedor [ 16/Sep/20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Of course! I've created a secure upload portal for you. Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Pieter Jordaan [ 16/Sep/20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
@Eric Sedor sure, can I upload the log privately though? | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Pieter Jordaan [ 16/Sep/20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Thanks @Louis Williams for looking into this. This is the first time we have had any stability issues in almost 10 years. I am not sure how I should capture the query responsible. I haven't been able to pinpoint it, as it happens nondeterministically. Fortunately (ironic I know) it happens nearly every day, sometimes twice a day. Is there some way I could log queries without severely crippling my production environment? We do around 4k operations per second. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Eric Sedor [ 16/Sep/20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Hi pieterwjordaanpc@gmail.com, Would you please archive (tar or zip) the $dbpath/diagnostic.data directory (the contents are described here) and attach it to this ticket, along with recent logs that include the error (instead of just the error itself)? Thank you! Eric | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Louis Williams [ 16/Sep/20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Hi pieterwjordaanpc@gmail.com, thank you very much for filing this bug report. This doesn't immediately look like data corruption to me because there are no accompanied WiredTiger errors. Someone from our Triage team will follow-up with more questions, but in the meantime can you post the failing query, or some redacted version? | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Pieter Jordaan [ 16/Sep/20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
client connection was this one (ip redacted):
|