[SERVER-50971] Invariant failure, WT_NOTFOUND: item not found Created: 16/Sep/20  Updated: 17/Jan/24  Resolved: 04/Feb/21

Status: Closed
Project: Core Server
Component/s: Stability, WiredTiger
Affects Version/s: 4.4.0, 4.4.1
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Pieter Jordaan Assignee: Jonathan Streets (Inactive)
Resolution: Done Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File MongoDB Abort.json     PNG File image-2020-10-15-16-49-06-849.png     File mongo-log.json    
Issue Links:
Depends
Related
related to WT-6811 Allow older readers to read behind a ... Closed
related to WT-6928 Removing historical updates required ... Closed
related to SERVER-52530 Mongo v.4.4.1 crash - UnknownError -3... Closed
is related to SERVER-50880 Mongod Server Failed with signal 6 Closed
is related to SERVER-51386 Mongo 4.4.1 Crashes often Closed
Operating System: ALL
Steps To Reproduce:

Unfortunately I cannot get it reproduced on will yet.

Participants:
Case:
Linked BF Score: 119

 Description   

Hi

 

We have been getting infrequent (1-2 times a day) aborts of this kind lately. This happened on 4.4.1 and 4.4.0 too. We have had years without issue (3-node replica set), upgrading as new versions come around. It isn't tied to specific times either.

This is on an AWS I3 instance with nvme drive, formatted to xfs.

Snippet of the logs here and in the attachement.

https://pastebin.com/uqauh8H0

We have tried removing the DB path and resyncing from a secondary, but this did not fix it.

What could cause this? Could it be a specific query? Could the disk itself be corrupt? How can I help pinpointing the issue?



 Comments   
Comment by Jonathan Streets (Inactive) [ 10/Mar/22 ]

hi firdous.bano@flipkart.com, we believe this has been fixed in the latest version of 4.4.     jon

Comment by Firdous Bano [ 09/Mar/22 ]

@jonathan.streets

any update on this issue? Or what was the solution for it? I have been facing this since last few days on our production machine.

MongoDB version report:

MongoDB shell version v4.4.0
Build Info: {
"version": "4.4.0",
"openSSLVersion": "OpenSSL 1.1.0l 10 Sep 2019",
"modules": [],
"allocator": "tcmalloc",
"environment":

{ "distmod": "debian92", "distarch": "x86_64", "target_arch": "x86_64" }

}

I see this error in mongo logs

{"t":{"$date":"2022-03-09T16:33:20.005+05:30"},"s":"F",  "c":"-",        "id":23083,   "ctx":"conn160","msg":"Invariant failure","attr":{"expr":"ret","error":"UnknownError: -31803: WT_NOTFOUND: item not found","file":"src/mongo/db/storage/wiredtiger/wiredtiger_record_store.cpp","line":1598}}
{"t":{"$date":"2022-03-09T16:33:20.005+05:30"},"s":"F",  "c":"-",        "id":23083,   "ctx":"conn128","msg":"Invariant failure","attr":{"expr":"ret","error":"UnknownError: -31803: WT_NOTFOUND: item not found","file":"src/mongo/db/storage/wiredtiger/wiredtiger_record_store.cpp","line":1598}}
{"t":{"$date":"2022-03-09T16:33:20.005+05:30"},"s":"F",  "c":"-",        "id":23084,   "ctx":"conn160","msg":"\n\n***aborting after invariant() failure\n\n"}
{"t":{"$date":"2022-03-09T16:33:20.005+05:30"},"s":"F",  "c":"-",        "id":23084,   "ctx":"conn128","msg":"\n\n***aborting after invariant() failure\n\n"}
{"t":{"$date":"2022-03-09T16:33:20.005+05:30"},"s":"F",  "c":"CONTROL",  "id":4757800, "ctx":"conn160","msg":"Writing fatal message","attr":{"message":"Got signal: 6 (Aborted).\n"}}

Comment by Jonathan Streets (Inactive) [ 04/Feb/21 ]

 

Hi pieterwjordaanpc@gmail.com and agustin@boostup.ai,
We haven’t heard back from you for some time, so I’m going to close this ticket. If this is still an issue for you, please provide additional information and we will reopen the ticket.

Regards,

Jon

Comment by Alexander Gorrod [ 29/Dec/20 ]

Thanks for letting us know - I'm glad you found a workaround.

Comment by Agustin Barto [ 28/Dec/20 ]

We got rid of a large write-heavy collection with a TTL index, and haven't had an incident since.

Comment by Agustin Barto [ 09/Dec/20 ]

Spoke too soon. Our primary started crashing again.

Comment by Agustin Barto [ 01/Dec/20 ]

We had the same issue on the primary of the replicaset. We (think) we fixed it by switching primary over to another member, clearing all the mongodb files on the affected server, and then adding it back into the replicaset. We haven't add any issues so far. Maybe something got corrupted and wiredtiger can't recover gracefully? We got the idea from this post https://dba.stackexchange.com/questions/274900/why-mongod-service-is-killed-by-my-system.

Comment by Luke Pearson [ 01/Dec/20 ]

Hi pieterwjordaanpc@gmail.com.

Only install it on the member of the replica set exhibiting the issue. It's interesting that a resync appears to have reduced the crash frequency or stopped it all together.

Let us know how you go with the updated binary.

Comment by Pieter Jordaan [ 01/Dec/20 ]

Hi @luke.pearson

Thanks for the potential fix. Should I install this on all the replicas or just the one exhibiting the issue?

To update on the status, it hasn't crashed again for 'n while, though due to other factors, I had to resync the node again from scratch. Since then it hasn't crashed again yet

Previously I had resynced as well and it did not fix the issue. Our access patterns and load hasn't changed which could account for this. Though there has been cases where nothing happens for an extended period.

 

Comment by Luke Pearson [ 01/Dec/20 ]

Hi pieterwjordaanpc@gmail.com,

We’ve compiled a new binary with two bug fixes within it, one mentioned by haribabu.kommi previously and one which also relates to WT_NOTFOUND being returned incorrectly in WiredTiger. As a result of not yet being able to reproduce the issue, we can’t say for sure if the bug fixes included in this patch fix the issue you see.

We’ve added extensive logging internally to increase the diagnostic information we get if you re-encounter this issue. We would greatly appreciate it if you could run this updated binary.

Additionally, please run: db.setLogLevel(2) in your MongoDB shell on startup. As a result of the additional debugging included in this build, your log file will potentially grow to several GBs, so please be aware of that.

Thanks again for your assistance in helping us diagnose this issue. The information you have provided has been instrumental in assisting the team as we work through this issue.

The binary can be found here: download link

Comment by Pieter Jordaan [ 18/Nov/20 ]

Hi Haribabu

I took the liberty of uploading the 5 core dumps since my last upload, along with a log comprising all 5 crashes. I will cease to upload new dumps from now on, until you may require it again.

I am hopeful that the issue can be resolved soon.

Thanks

Comment by Haribabu Kommi [ 18/Nov/20 ]

pieterwjordaanpc@gmail.com Thanks for providing core dumps. We are able to identify a problematic code flow that can lead to this invariant failure. We are discussing a possible fix within the team. We may need your help again in verifying the fix later. As of now, we may not need any more core dumps.

Comment by Pieter Jordaan [ 18/Nov/20 ]

3 crashes last night. Should I upload the core dumps?

Comment by Pieter Jordaan [ 17/Nov/20 ]

I had two more crashes last night, about 2 hours apart. Would more core dumps help?

Comment by Haribabu Kommi [ 16/Nov/20 ]

pieterwjordaanpc@gmail.com Thanks for providing the latest core dump files. We will investigate further with the latest core dump file for the possible scenarios leading to this issue.

Comment by Pieter Jordaan [ 13/Nov/20 ]

@Luke, it happened again last night:

core.mongod.113.1af3e321b37948df847752ae4d9a2e31.3285838.1605215389000000000000.lz4

core.mongod.3.aa through core.mongod.3.ak as well as the log is being uploaded to S3

Comment by Pieter Jordaan [ 10/Nov/20 ]

@Luke it happened again shortly after the previous one. core.mongod.113.1af3e321b37948df847752ae4d9a2e31.3208829.1604957277000000000000.lz4

 

I uploaded it in chunks again, core.mongod.2.aa through core.mongod.2.af. Log is also uploaded.

Comment by Luke Pearson [ 10/Nov/20 ]

Thanks pieterwjordaanpc@gmail.com for providing us with the core dump. It's a shame the bug fix I added didn't resolve the issue but that was also within the realm of possibilities.

I've managed to piece the core back together look at it, I don't spot anything immediately out of the ordinary but it does rule out concerns about the snapshot and it has helped with understanding the state of the system. Which has narrowed down the range of possible scenarios that lead to you seeing this failure. I'll continue investigating.

Comment by Pieter Jordaan [ 09/Nov/20 ]

Bad news with a silver lining. The server just crashed. I'm uploading the core dump and logs. 

 

core.mongod.113.1af3e321b37948df847752ae4d9a2e31.2715348.1604954701000000000000.lz4 was the original file name. I've had to split it into 1Gig chunks with the split utility. Files are called core.mongod.aa through core.mongod.ai

 

Let me know if I can provide any additional details. 

Comment by Pieter Jordaan [ 07/Nov/20 ]

Hi Jonathan 

I have set those two last night. Hopefully it will work this time. 

Comment by Jonathan Streets (Inactive) [ 06/Nov/20 ]

hi pieterwjordaanpc@gmail.com,
it looks like there could be more than one coredump.conf being used according to the
man pages files ? You mentioned that you had increased the max core dump sizes in /etc/systemd/coredump.conf. are there any coredump.conf files which are mentioned in the man page? In particular there are the MaxUse=, KeepFree= parameters which are used to clean up large core files.
jon

Comment by Pieter Jordaan [ 06/Nov/20 ]

Okay so the core dump vanished. Not quite sure how that is possible. Are there any sane config settings for coredump you can recommend? 

Comment by Pieter Jordaan [ 06/Nov/20 ]

Hi sorry for being quiet. We have been quite busy and fortunately haven't had any issues again. The server just crashed again. I'll upload the log and core dump again, though it is only 12 Gig (zipped) again. Not quite sure why the coredumo rules aren't taking effect.

 

Regarding the fatal crash I mentioned. The same primary died, but none of the secondaries could sync from the primary because there was a corrupted entry in its oplog. I had to sync the primary from the older secondaries.

 

I'll run the new fixed version this time. Hopefully we don't run into any crashes again. Thanks for all the effort into getting this fixed. 

Comment by Luke Pearson [ 03/Nov/20 ]

Hi pieterwjordaanpc@gmail.com,

I've created the binary with the bug fix described in WT-6811, it can be found here: download link

To be clear we're still unsure if this will fix the issue you're seeing but would greatly appreciate if you can run with this binary. In the event it still hits the same issue it will log additional information (on top of what it was logging additionally last time), which will give us further insight into the issue you're seeing.

Thanks,
Luke

Comment by Louis Williams [ 28/Oct/20 ]

Hi pieterwjordaanpc@gmail.com, sorry that you've been experiencing these problems. What symptoms did you observe with this corrupted oplog entry? Was it a primary node or a secondary?

Comment by Pieter Jordaan [ 27/Oct/20 ]

Hi @Luke we will definitely try the new binary. 

What still seems a mystery to me is why it only happens on the one host when primary. Also, reading your fix's description it seems related to transactions, which I don't explicitly make use of, unless transactions there refer to something else. 

We haven't had a crash since my last message, though if history repeats itself it should occur soon.

My biggest concern regarding the whole issue is that one fatal crash we had where even the oplog had a corrupted entry. Is there anything related to your fix which could explain why that happened? Eventually we needed to resync from a secondary which was older, losing some data in the process.

Comment by Luke Pearson [ 25/Oct/20 ]

Hi pieterwjordaanpc@gmail.com,

I've found a potential cause for the symptom that occurs at the time of crash. Since finding that I've been working on a fix for it, see WT-6811. Currently that fix is being reviewed and tested before it will get merged into the main branch of MongoDB. However this fix relies on certain behaviours occurring internally in mongodb and we cannot guarantee that that is what is happening in your scenario. As such it still may not fix the issue you're seeing.

When the change is reviewed and tested we'd like to provide you with a custom binary to see if you still see the crash. Does that sound good to you?

In the meantime a core dump would still be helpful to rule out additional possibilities.

Comment by Pieter Jordaan [ 24/Oct/20 ]

Hi @Jonathan

It has occurred twice since my last report, however, no core dumps were produced whatsoever.

 

Oct 24 00:11:49 ip-172-30-5-62 systemd[1]: Started Process Core Dump (PID 1624288/UID 0).
Oct 24 00:11:49 ip-172-30-5-62 systemd[1624301]: systemd-coredump@57-1624288-0.service: Failed to set up special execution directory in /var/lib: Not a directory
Oct 24 00:11:49 ip-172-30-5-62 systemd[1624301]: systemd-coredump@57-1624288-0.service: Failed at step STATE_DIRECTORY spawning /lib/systemd/systemd-coredump: Not a directory
Oct 24 00:11:49 ip-172-30-5-62 systemd[1]: systemd-coredump@57-1624288-0.service: Succeeded.
Oct 24 00:11:49 ip-172-30-5-62 systemd-coredump[1624288]: Failed to send coredump datagram: Connection reset by peer
Oct 24 00:11:50 ip-172-30-5-62 systemd[1]: mongod.service: Main process exited, code=killed, status=6/ABRT
Oct 24 00:11:50 ip-172-30-5-62 systemd[1]: mongod.service: Failed with result 'signal'.
Oct 24 00:11:50 ip-172-30-5-62 systemd[1]: mongod.service: Scheduled restart job, restart counter is at 4.
Oct 24 00:11:50 ip-172-30-5-62 systemd[1]: Stopped MongoDB Database Server.
Oct 24 00:11:50 ip-172-30-5-62 systemd[1]: Started MongoDB Database Server.

I assume a symbolic link is not suitable for the job. I've gone ahead and used mount --bind as an alternative.

 

I trust a dump will soon happen again hopefully, this time with a full core dump...

If you have an alternative approach, please let me know.

Thanks

 

Comment by Jonathan Streets (Inactive) [ 23/Oct/20 ]

hi pieterwjordaanpc@gmail.com, just checking in. do you have any updates for us ? thanks. jon

Comment by Eric Sedor [ 19/Oct/20 ]

Hi pieterwjordaanpc@gmail.com; I just wanted to chime back in to thank you again for your continued help investigating!

Comment by Pieter Jordaan [ 19/Oct/20 ]

Hi Johnathan, unfortunately that is exactly the size of the core dump file in /var/lib/systemd/coredump. I've checked and ulimits for coredumps for my mongod process is indeed unlimited.

 

However inspecting /etc/systemd/coredump.conf I see the defaults are in fact 2Gig. I'm not sure if I would be able to reach 32Gig as my OS drive is not large enough.

I've therefore made a link to my data drive for storing coredumps. I hope it will work now.

Comment by Jonathan Streets (Inactive) [ 19/Oct/20 ]

hi pieterwjordaanpc@gmail.com. the core file appears to be truncated. the uploaded (compressed) file is 1104408543 bytes (1.03 GB), and when we uncompress it it is 2147483648 bytes (2.00 GB). the file uploader on s3 supports file sizes up to 4GB. 

when we try using the core dump, we get an error like expected core file size >= 32241004544, found: 2147483648 

  1. how big is the file that you are trying to upload ?
  2. did you get any errors while uploading it - what were they ?

thanks. jon

Comment by Pieter Jordaan [ 19/Oct/20 ]

@luke.pearson@mongodb.com and @jonathan.streets@mongodb.com 

Another crash occurred. Core dump and log sent. Core dump is fundamentally larger now, I hope this will provide more insight.

 

Fortunately it wasn't as fatal as the other day's crash where even the oplog was corrupted.

Comment by Zach Bjornson [ 15/Oct/20 ]

We're hitting this same issue with 4.4.1 in a standalone configuration. I attached the relevant log snippet, which looks the same as the OP's log. Let me know if there's anything else I can provide. (We're about to roll back to 4.2 though, so likely can't provide much else.)

mongo-log.json

Comment by Pieter Jordaan [ 14/Oct/20 ]

@Jonathan Coredumps were definitely enabled. I have additionally now installed systemd-coredump. I have systemctl daemon-reload and restarted mongod service too.

I trust this should suffice.

 

@Luke the two where I interchanged the primary status are on the same OS. Ubuntu 20.04 x86_64. Same kernel, same XFS filesystem. Started roughly the same time. 

There is another secondary, which has lower priority, and hence never becomes primary. It is running the same Mongo on Ubuntu 16.04. It is also a different AWS instance type. Like the other secondary, this server has never manifested these kinds of issues.

Comment by Luke Pearson [ 14/Oct/20 ]

Hi pieterwjordaanpc@gmail.com, are the two secondaries running the same OS too? I notice you mentioned physically.

Comment by Jonathan Streets (Inactive) [ 14/Oct/20 ]

hi pieterwjordaanpc@gmail.com. i tested some instructions for creating core dumps in Ubuntu 20.04. I think luke.pearson would appreciate getting a full core dump - the last one was not a full core.

To set ulimit:

  • edit /lib/systemd/system/mongod.service and add LimitCORE=infinity with the other limits
  • systemctl daemon-reload
  • systemctl start mongod

To check that ulimit is correct - (using NNNNN as the mongod PID)

 root@ip-10-122-0-169:/# cat /proc/NNNNN/limits | grep core
 Max core file size unlimited unlimited bytes

Install the core-dump utility with apt install systemd-coredump

To test we can use kill -11 NNNNN, where NNNNN is the PID of mongod. The core dump will appear in

root@ip-10-122-0-169:/# coredumpctl
TIME                            PID   UID   GID SIG COREFILE  EXE
Wed 2020-10-14 20:21:25 UTC   NNNNN     0     0  11 present   /usr/bin/mongod

The core dump can be extracted to a file with coredumpctl --output mongo.core dump NNNNN

Please could you check that core dumps are enabled, and send us the next one?

thanks
jon

Comment by Pieter Jordaan [ 14/Oct/20 ]

@Luke Pearson, Interesting

Any clue as to why it would only break on this specific server and not on any of the other secondaries?

The two I changed primary role between are identical physically (on AWS), though only this one server ever manifested this issue.

Comment by Luke Pearson [ 14/Oct/20 ]

Thanks for working with us on this pieterwjordaanpc@gmail.com, the log's provided do give us some insight into this issue.

The relevant log lines prior to the invariant:

{"t":{"$date":"2020-10-13T18:05:02.401+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"conn1725","msg":"WiredTiger message","attr":{"message":"Checking history store as no ondisk value is visible to us, on disk tw: start_ts: (1602612301, 1432), start_txn: 206611506, stop_ts: (4294967295, 4294967295), stop_txn: 18446744073709551605"}}
{"t":{"$date":"2020-10-13T18:05:02.401+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"conn1725","msg":"WiredTiger message","attr":{"message":"has read no, read timestamp: (0, 0), txn isolation: 2, stable_timestamp: (1602612302, 517), oldest_timestamp: (1602612297, 517), txnid: 0"}}
{"t":{"$date":"2020-10-13T18:05:02.407+00:00"},"s":"F",  "c":"-",        "id":23083,   "ctx":"conn1725","msg":"Invariant failure","attr":{"expr":"ret","error":"UnknownError: -31803: WT_NOTFOUND: item not found","file":"src/mongo/db/storage/wiredtiger/wiredtiger_record_store.cpp","line":1609}}
{"t":{"$date":"2020-10-13T18:05:02.407+00:00"},"s":"F",  "c":"-",        "id":23084,   "ctx":"conn1725","msg":"\n\n***aborting after invariant() failure\n\n"}
{"t":{"$date":"2020-10-13T18:05:02.412+00:00"},"s":"F",  "c":"CONTROL",  "id":4757800, "ctx":"conn1725","msg":"Writing fatal message","attr":{"message":"Got signal: 6 (Aborted).\n"}}

As we search for the value, WiredTiger ignores the data file value as its not visible to the current transaction. The read timestamp is 0 so I don't think thats why we're ignoring this value. We then search the history store file and don't find a value. Additionally the subsequent search added didn't return the record, indicating that this issue is persistent for an unknown duration.

Comment by Pieter Jordaan [ 13/Oct/20 ]

Dump and log uploaded.

 

Thanks, I hope it proves useful

Comment by Jonathan Streets (Inactive) [ 13/Oct/20 ]

hi pieterwjordaanpc@gmail.com - please can you check /var/crash for the core dump? the new log files from the dev version may be enough though. jon

Comment by Pieter Jordaan [ 13/Oct/20 ]

@Jonathan it happened again just now. I'm preparing to upload the log files. Though I cannot find the core dump file. Where would it be saved? /var/lib/systemd/coredumps is empty.

Comment by Pieter Jordaan [ 12/Oct/20 ]

@Jonathan Streets after the incident mentioned we have changed primaries. We haven't had another incident yet. I have now moved primary back, enabled core dumps and am running the dev version mentioned. I have also rotated the log.

I both hope it happens again and not.

I will keep you posted of anything out of the ordinary.

Thanks again

Comment by Jonathan Streets (Inactive) [ 12/Oct/20 ]

hi pieterwjordaanpc@gmail.com,
were you able to generate a core dump as described above? Also, the team has built custom binaries for Ubuntu 20.04 x86_64 with additional logging. The downloadable are here (public link).

This should not have any impact on the application, and will only print additional information in the event the same bug causes the server to crash.

Thanks, Jon

Comment by Jonathan Streets (Inactive) [ 02/Oct/20 ]

hi pieterwjordaanpc@gmail.com
to investigate this further, we will need to get a core dump. By default ulimit is set to 0 to disable core dumps because they can generate large files. To enable a core dump, please set the ulimit with ulimit -c unlimited in the shell, or set LimitCORE=infinity in /lib/systemd/system/mongod.service file (for systemd in Ubuntu).

The core dump can be uploaded to the secure upload site we set up before. Along with the core dump, please could you also verify the MongoDB version and Ubuntu version in use ?

thanks
jon

Comment by Pieter Jordaan [ 02/Oct/20 ]

So this happened again last night at around 10-ish SAST time. Though after the server restarted the secondaries just stopped syncing. Though there were no error messages, the rs.status shows that there is no source to sync from. Specifically calling syncFrom produced these error logs:

{"t":{"$date":"2020-10-02T06:35:17.498+00:00"},"s":"I",  "c":"REPL",     "id":21080,   "ctx":"BackgroundSync","msg":"Clearing sync source to choose a new one","attr":{"syncSource":"ip-172-30-5-62.eu-west-1.compute.internal:27017"}}
{"t":{"$date":"2020-10-02T06:35:17.498+00:00"},"s":"I",  "c":"REPL",     "id":21798,   "ctx":"BackgroundSync","msg":"Could not find member to sync from"}
{"t":{"$date":"2020-10-02T06:35:20.620+00:00"},"s":"I",  "c":"REPL",     "id":21760,   "ctx":"SyncSourceFeedback","msg":"SyncSourceFeedback error sending update","attr":{"syncTarget":"ip-172-30-5-62.eu-west-1.compute.internal:27017","error":{"code":119,"codeName":"InvalidSyncSource","errmsg":"Sync source was cleared. Was ip-172-30-5-62.eu-west-1.compute.internal:27017"}}}
{"t":{"$date":"2020-10-02T06:36:17.514+00:00"},"s":"I",  "c":"REPL",     "id":21799,   "ctx":"BackgroundSync","msg":"Sync source candidate chosen","attr":{"syncSource":"ip-172-30-5-62.eu-west-1.compute.internal:27017"}}
{"t":{"$date":"2020-10-02T06:36:17.516+00:00"},"s":"I",  "c":"REPL",     "id":21088,   "ctx":"BackgroundSync","msg":"Changed sync source","attr":{"oldSyncSource":"empty","newSyncSource":"ip-172-30-5-62.eu-west-1.compute.internal:27017"}}
{"t":{"$date":"2020-10-02T06:36:17.642+00:00"},"s":"I",  "c":"REPL",     "id":21275,   "ctx":"ReplCoordExtern-10","msg":"Recreating cursor for oplog fetcher due to error","attr":{"lastOpTimeFetched":{"ts":{"$timestamp":{"t":1601582607,"i":7731}},"t":432},"attemptsRemaining":1,"error":"InvalidBSON: Error while getting the next batch in the oplog fetcher :: caused by :: invalid bson in element with field name 'zoneAlias' in object with unknown _id"}}
{"t":{"$date":"2020-10-02T06:36:17.717+00:00"},"s":"I",  "c":"REPL",     "id":21274,   "ctx":"ReplCoordExtern-10","msg":"Error returned from oplog query (no more query restarts left)","attr":{"error":"InvalidBSON: Error while getting the next batch in the oplog fetcher :: caused by :: invalid bson in element with field name 'zoneAlias' in object with unknown _id"}}
{"t":{"$date":"2020-10-02T06:36:17.717+00:00"},"s":"W",  "c":"REPL",     "id":21121,   "ctx":"BackgroundSync","msg":"Fetcher got invalid BSON while querying oplog. Blacklisting sync source","attr":{"syncSource":"ip-172-30-5-62.eu-west-1.compute.internal:27017","blacklistDurationSeconds":60}}
{"t":{"$date":"2020-10-02T06:36:17.717+00:00"},"s":"I",  "c":"REPL",     "id":21080,   "ctx":"BackgroundSync","msg":"Clearing sync source to choose a new one","attr":{"syncSource":"ip-172-30-5-62.eu-west-1.compute.internal:27017"}}
{"t":{"$date":"2020-10-02T06:36:17.717+00:00"},"s":"I",  "c":"REPL",     "id":21798,   "ctx":"BackgroundSync","msg":"Could not find member to sync from"}
{"t":{"$date":"2020-10-02T06:36:22.516+00:00"},"s":"I",  "c":"REPL",     "id":21760,   "ctx":"SyncSourceFeedback","msg":"SyncSourceFeedback error sending update","attr":{"syncTarget":"ip-172-30-5-62.eu-west-1.compute.internal:27017","error":{"code":119,"codeName":"InvalidSyncSource","errmsg":"Sync source was cleared. Was ip-172-30-5-62.eu-west-1.compute.internal:27017"}}}
{"t":{"$date":"2020-10-02T06:37:17.735+00:00"},"s":"I",  "c":"REPL",     "id":21799,   "ctx":"BackgroundSync","msg":"Sync source candidate chosen","attr":{"syncSource":"ip-172-30-5-62.eu-west-1.compute.internal:27017"}}
{"t":{"$date":"2020-10-02T06:37:17.736+00:00"},"s":"I",  "c":"REPL",     "id":21088,   "ctx":"BackgroundSync","msg":"Changed sync source","attr":{"oldSyncSource":"empty","newSyncSource":"ip-172-30-5-62.eu-west-1.compute.internal:27017"}}
{"t":{"$date":"2020-10-02T06:37:17.875+00:00"},"s":"I",  "c":"REPL",     "id":21275,   "ctx":"ReplCoordExtern-12","msg":"Recreating cursor for oplog fetcher due to error","attr":{"lastOpTimeFetched":{"ts":{"$timestamp":{"t":1601582607,"i":7731}},"t":432},"attemptsRemaining":1,"error":"InvalidBSON: Error while getting the next batch in the oplog fetcher :: caused by :: invalid bson in element with field name 'zoneAlias' in object with unknown _id"}}
{"t":{"$date":"2020-10-02T06:37:17.951+00:00"},"s":"I",  "c":"REPL",     "id":21274,   "ctx":"ReplCoordExtern-12","msg":"Error returned from oplog query (no more query restarts left)","attr":{"error":"InvalidBSON: Error while getting the next batch in the oplog fetcher :: caused by :: invalid bson in element with field name 'zoneAlias' in object with unknown _id"}}
{"t":{"$date":"2020-10-02T06:37:17.951+00:00"},"s":"W",  "c":"REPL",     "id":21121,   "ctx":"BackgroundSync","msg":"Fetcher got invalid BSON while querying oplog. Blacklisting sync source","attr":{"syncSource":"ip-172-30-5-62.eu-west-1.compute.internal:27017","blacklistDurationSeconds":60}}
{"t":{"$date":"2020-10-02T06:37:17.951+00:00"},"s":"I",  "c":"REPL",     "id":21080,   "ctx":"BackgroundSync","msg":"Clearing sync source to choose a new one","attr":{"syncSource":"ip-172-30-5-62.eu-west-1.compute.internal:27017"}}
{"t":{"$date":"2020-10-02T06:37:17.951+00:00"},"s":"I",  "c":"REPL",     "id":21798,   "ctx":"BackgroundSync","msg":"Could not find member to sync from"}
{"t":{"$date":"2020-10-02T06:37:22.737+00:00"},"s":"I",  "c":"REPL",     "id":21760,   "ctx":"SyncSourceFeedback","msg":"SyncSourceFeedback error sending update","attr":{"syncTarget":"ip-172-30-5-62.eu-west-1.compute.internal:27017","error":{"code":119,"codeName":"InvalidSyncSource","errmsg":"Sync source was cleared. Was ip-172-30-5-62.eu-west-1.compute.internal:27017"}}}

Comment by Eric Sedor [ 28/Sep/20 ]

Thanks pieterwjordaanpc@gmail.com, we are investigating and will get back to you with what we find.

Comment by Pieter Jordaan [ 27/Sep/20 ]

Hi @Eric sorry I somehow never finished adding here. I have uploaded a new log and validation output as mentioned.

myset:SECONDARY> db.events.validate({full: true})
{
        "nInvalidDocuments" : 0,
        "nrecords" : 254237,
        "nIndexes" : 2,
        "keysPerIndex" : {
                "account_1_active_1_createdAt_-1" : 254237,
                "_id_" : 254237
        },
        "indexDetails" : {
                "account_1_active_1_createdAt_-1" : {
                        "valid" : true,
                        "warnings" : [
                                "Could not complete validation of table:index/782-692351979634215595. This is a transient issue as the collection was actively in use by other operations."
                        ]
                },
                "_id_" : {
                        "valid" : true,
                        "warnings" : [
                                "Could not complete validation of table:index/784-692351979634215595. This is a transient issue as the collection was actively in use by other operations."
                        ]
                }
        },
        "ns" : "secuvue.events",
        "valid" : true,
        "warnings" : [
                "Could not complete validation of table:collection/781-692351979634215595. This is a transient issue as the collection was actively in use by other operations.",
                "Could not complete validation of table:index/782-692351979634215595. This is a transient issue as the collection was actively in use by other operations.",
                "Could not complete validation of table:index/784-692351979634215595. This is a transient issue as the collection was actively in use by other operations."
        ],
        "errors" : [ ],
        "extraIndexEntries" : [ ],
        "missingIndexEntries" : [ ],
        "ok" : 1,
        "$clusterTime" : {
                "clusterTime" : Timestamp(1600890139, 1427),
                "signature" : {
                        "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
                        "keyId" : NumberLong(0)
                }
        },
        "operationTime" : Timestamp(1600890138, 214)
}

Comment by Eric Sedor [ 23/Sep/20 ]

Yes please, pieterwjordaanpc@gmail.com, we're interested in seeing validation output for any node that's thrown the errors you mentioned.

It's difficult to estimate validation times, which vary due to a variety of factors (including available resources). Typical validation times are measured in hours but that does sound like a fairly small collection. So it is likely safe to run if you make the node Secondary by failing over, and run during off-peak hours or a maintenance window. Again, you could work off of a snapshot or file-based backup if you have any doubts.

What version of mongoosejs are you running?

Comment by Pieter Jordaan [ 23/Sep/20 ]

@Eric should this be run on the specific instance or could this be done on any secondary? I believe the events collection (from the secuvue database) may be small enough to be done live seeing it only has 250k documents of around 115MiB? How long could the operation take ballpark?

We only have a 3-host replica set without any sharding yet. We don't explicitly make use of any multi-document transactions unless mongoosejs perhaps does this implicitly for some reason.

 

The only instance that crashes is the primary who's logs I've sent, so I'm guessing it should be done there?

Comment by Eric Sedor [ 22/Sep/20 ]

pieterwjordaanpc@gmail.com, thanks for your patience and help so far.

Would you be able to run validate({full: true}) on the events collection?

Validate is an impactful operation but could reveal important additional information about what's happening. We recommend making a copy of the dbpath and working off of a node built from the copy.

Then, provide via the upload portal the validation results and the mongod logs during the execution of validate.

We're also curious if this replica set is part of a sharded cluster, and whether or not you are making use of multi-document transactions?

Thanks in advance,
Eric

Comment by Pieter Jordaan [ 16/Sep/20 ]

@Eric Sedor, I've added the two files you requested, as well as a pdf of the charts over the last time it occurred.

Please let me know if I can provide any more information.

Thanks

Comment by Eric Sedor [ 16/Sep/20 ]

Of course! I've created a secure upload portal for you. Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time.

Comment by Pieter Jordaan [ 16/Sep/20 ]

@Eric Sedor sure, can I upload the log privately though?

Comment by Pieter Jordaan [ 16/Sep/20 ]

Thanks @Louis Williams for looking into this. This is the first time we have had any stability issues in almost 10 years.

I am not sure how I should capture the query responsible. I haven't been able to pinpoint it, as it happens nondeterministically. Fortunately (ironic I know) it happens nearly every day, sometimes twice a day.

Is there some way I could log queries without severely crippling my production environment? We do around 4k operations per second.

Comment by Eric Sedor [ 16/Sep/20 ]

Hi pieterwjordaanpc@gmail.com,

Would you please archive (tar or zip) the $dbpath/diagnostic.data directory (the contents are described here) and attach it to this ticket, along with recent logs that include the error (instead of just the error itself)?

Thank you!

Eric

Comment by Louis Williams [ 16/Sep/20 ]

Hi pieterwjordaanpc@gmail.com, thank you very much for filing this bug report. This doesn't immediately look like data corruption to me because there are no accompanied WiredTiger errors. Someone from our Triage team will follow-up with more questions, but in the meantime can you post the failing query, or some redacted version?

Comment by Pieter Jordaan [ 16/Sep/20 ]

client connection was this one (ip redacted):

{"t":{"$date":"2020-09-15T20:42:44.595+00:00"},"s":"I", "c":"NETWORK", "id":51800, "ctx":"conn429","msg":"client metadata","attr":{"remote":"********","client":"conn429","doc":{"driver":{"name":"nodejs|Mongoose","version":"3.5.7"},"os":{"type":"Linux","name":"linux","architecture":"x64","version":"5.4.0-1024-aws"},"platform":"'Node.j
s v14.4.0, LE (unified)","version":"3.5.7|5.9.15"}}}

Generated at Thu Feb 08 05:24:09 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.