[SERVER-52926] [3.6] Mongo db crash with Got signal: 11 - KVDatabaseCatalogEntryBase::AddCollectionChange::rollback Created: 18/Nov/20  Updated: 16/Oct/21  Resolved: 04/Dec/20

Status: Closed
Project: Core Server
Component/s: Catalog
Affects Version/s: 3.6.10
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Chao Xu Assignee: Benety Goh
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Screen Shot 2020-11-19 at 14.33.34.png     Text File mongodb-1.log     Text File mongodb-2.log     Text File mongodb.log    
Issue Links:
Duplicate
duplicates SERVER-37443 Reuse in-memory catalog objects throu... Closed
Related
is related to SERVER-38419 Crash in rename collection Closed
is related to SERVER-49384 the replset primary crashed without a... Closed
Operating System: ALL
Sprint: Execution Team 2020-12-14
Participants:

 Description   

MongoDB shell version v3.6.10
git version: 3e3ab85bfb98875af3bc6e74eeb945b0719f69c8
OpenSSL version: OpenSSL 1.0.1e-fips 11 Feb 2013
allocator: tcmalloc
modules: none
build environment:
distmod: rhel70
distarch: x86_64
target_arch: x86_64

 

mongodb.log



 Comments   
Comment by apocarteres [ 10/May/21 ]

btw also affects 3.6.18

Comment by Benety Goh [ 04/Dec/20 ]

Symptoms match those reported in SERVER-38419, SERVER-49384, and SERVER-37443. The root cause was a defect in the in-memory catalog that was fixed in 4.2 - see SERVER-37443.

Comment by Chao Xu [ 20/Nov/20 ]

Hi Dima,

I'm so sorry about this case, My colleague’s handling at that time only kept the situation at that time not try to restart that node. but I mistakenly thought they restarted and this error log was the result after restart. so you already explained this crash root cause. I restarted the node in this morning. I think it's not a bug. we could close this case.

Thanks again.

have a good life.

Chao

Comment by Chao Xu [ 19/Nov/20 ]

Hi Dimtry,

Thanks a lot. but that's so wired, I don’t think we have such a huge amount of data. so how did it happened?

Unfortunately, a few hours ago,  another node of this cluster crashed too. so how to get this cluster back to work is my priority. could you give me some solutions? thanks again. 

mongodb.log this file is for one hour before this node crashed

Have a good day.

Comment by Dmitry Agranat [ 19/Nov/20 ]

xuchao528610@gmail.com I will keep looking at the potential cause of the segmentation fault. Can you upload the full mongod log covering the time of the reported event to the same secure location?

Comment by Dmitry Agranat [ 19/Nov/20 ]

Hi xuchao528610@gmail.com,

I think this is a rare circumstance where the Segmentation fault you are reporting might just be a symptom and not the cause.

Looking at your cluster, you have PSA deployment, with read concern majority = true and with the Secondary member being in recovery for the last 70 days. This creates an enormous cache pressure on the Primary which is barely operational with almost 100% cache full. Another indication that the system is struggling is the amount of cache overflow table entries which is 6 billion.

I do not believe the system under such extreme conditions is supposed to operate w/o issues, you of which you have experienced. For this issue and the overall sizing and tuning of your cluster, we'd like to encourage you to start by asking our community for help by posting on the MongoDB Developer Community Forums.

Thanks,
Dima

Comment by Chao Xu [ 19/Nov/20 ]

@Dmitry Agranat

Hi Dmitry,

Thanks for your helping. I upload two files (metrics.2020-11-17T21-49-16Z-00000, metrics.interim). maybe will help you to investigate. 

Thanks,

Chao

Comment by Dmitry Agranat [ 18/Nov/20 ]

Hi xuchao528610@gmail.com,

I think I understand what's going on here but to validate my theory we'll also need archived diagnostic.data located under the dbpath. You can upload it into this secure uploader.

Thanks,
Dima

Generated at Thu Feb 08 05:29:24 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.