[SERVER-61752] MongoDB crashed with Invalid access at address + Got signal: 11 (Segmentation fault Created: 27/Nov/21  Updated: 07/Feb/22  Resolved: 07/Feb/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Vy Nguyen Tan Assignee: Edwin Zhou
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Participants:

 Description   

I am running MongoDB replicaset with 3 nodes. Today MongoDB primary crashed with error message:

{"t":\{"$date":"2021-11-27T08:52:53.053+07:00"},"s":"I", "c":"COMMAND", "id":51803, "ctx":"conn570","msg":"Slow query","attr":\{"type":"command","ns":"db_raiden.$cmd","command":{"delete":"deal","ordered":true,"lsid":{"id":{"$uuid":"4fc60658-0d71-4270-89a8-04f61f2e2553"}},"$clusterTime":\{"clusterTime":{"$timestamp":{"t":1637977971,"i":1}},"signature":\{"hash":{"$binary":{"base64":"ckPcpj/2DY7V3cnI3qfMbl3lM6c=","subType":"0"}},"keyId":7014040670813814788}},"$db":"db_raiden"},"numYields":14,"reslen":230,"locks":\{"ParallelBatchWriterMode":{"acquireCount":{"r":16}},"ReplicationStateTransition":\{"acquireCount":{"w":17}},"Global":\{"acquireCount":{"r":1,"w":16}},"Database":\{"acquireCount":{"w":16}},"Collection":\{"acquireCount":{"w":16}},"Mutex":\{"acquireCount":{"r":2}}},"flowControl":\{"acquireCount":15,"timeAcquiringMicros":14},"storage":{},"protocol":"op_msg","durationMillis":161}}
{"t":\{"$date":"2021-11-27T08:52:54.792+07:00"},"s":"I", "c":"STORAGE", "id":22430, "ctx":"WTCheckpointThread","msg":"WiredTiger message","attr":\{"message":"[1637977974:792733][1024:0x7f2a3066e700], WT_SESSION.checkpoint: [WT_VERB_CHECKPOINT_PROGRESS] saving checkpoint snapshot min: 32552161, snapshot max: 32552161 snapshot count: 0, oldest timestamp: (1637977969, 26) , meta checkpoint timestamp: (1637977973, 332) base write gen: 60140142"}}
{"t":\{"$date":"2021-11-27T08:52:55.064+07:00"},"s":"F", "c":"CONTROL", "id":4757800, "ctx":"thread673","msg":"Writing fatal message","attr":\{"message":"Invalid access at address: 0"}}
{"t":\{"$date":"2021-11-27T08:52:55.064+07:00"},"s":"F", "c":"CONTROL", "id":4757800, "ctx":"thread673","msg":"Writing fatal message","attr":\{"message":"Got signal: 11 (Segmentation fault).\n"}}
{"t":\{"$date":"2021-11-27T08:52:55.296+07:00"},"s":"I", "c":"CONTROL", "id":31431, "ctx":"thread673","msg":"BACKTRACE: \{bt}","attr":\{"bt":{"backtrace":[{"a":"5643CD8A23AA","b":"5643CAAD9000","o":"2DC93AA","s":"_ZN5mongo18stack_trace_detail12_GLOBAL__N_119printStackTraceImplERKNS1_7OptionsEPNS_14StackTraceSinkE.constprop.606","s+":"1EA"},\{"a":"5643CD8A3E39","b":"5643CAAD9000","o":"2DCAE39","s":"_ZN5mongo15printStackTraceEv","s+":"29"},\{"a":"5643CD8A130C","b":"5643CAAD9000","o":"2DC830C","s":"_ZN5mongo12_GLOBAL__N_124abruptQuitWithAddrSignalEiP9siginfo_tPv","s+":"EC"},\{"a":"7F2A3B5C7630","b":"7F2A3B5B8000","o":"F630","s":"_L_unlock_13","s+":"34"},\{"a":"5643CDA21E4B","b":"5643CAAD9000","o":"2F48E4B","s":"_ZN8tcmalloc11ThreadCache21ReleaseToCentralCacheEPNS0_8FreeListEji","s+":"EB"},\{"a":"5643CDA22075","b":"5643CAAD9000","o":"2F49075","s":"_ZN8tcmalloc11ThreadCache11ListTooLongEPNS0_8FreeListEj","s+":"35"},\{"a":"5643CBC465FF","b":"5643CAAD9000","o":"116D5FF","s":"__free_skip_list","s+":"7F"},\{"a":"5643CBC46686","b":"5643CAAD9000","o":"116D686","s":"__free_skip_array","s+":"46"},\{"a":"5643CBC47146","b":"5643CAAD9000","o":"116E146","s":"__wt_page_out","s+":"696"},\{"a":"5643CBBA02D3","b":"5643CAAD9000","o":"10C72D3","s":"__wt_evict","s+":"1423"},\{"a":"5643CBB9705D","b":"5643CAAD9000","o":"10BE05D","s":"__evict_page","s+":"6BD"},\{"a":"5643CBB978D8","b":"5643CAAD9000","o":"10BE8D8","s":"__evict_lru_pages","s+":"78"},\{"a":"5643CBB9C6C0","b":"5643CAAD9000","o":"10C36C0","s":"__wt_evict_thread_run","s+":"70"},\{"a":"5643CBC015E9","b":"5643CAAD9000","o":"11285E9","s":"__thread_run","s+":"39"},\{"a":"7F2A3B5BFEA5","b":"7F2A3B5B8000","o":"7EA5","s":"start_thread","s+":"C5"},\{"a":"7F2A3B2E89FD","b":"7F2A3B1EA000","o":"FE9FD","s":"clone","s+":"6D"}],"processInfo":\{"mongodbVersion":"4.4.9","gitVersion":"b4048e19814bfebac717cf5a880076aa69aba481","compiledModules":[],"uname":{"sysname":"Linux","release":"3.10.0-1160.42.2.el7.x86_64","version":"#1 SMP Tue Sep 7 14:49:57 UTC 2021","machine":"x86_64"},"somap":[\{"b":"5643CAAD9000","elfType":3,"buildId":"375E0455B64A8CCBA2B20814F34100164730166F"},\{"b":"7F2A3B5B8000","path":"/lib64/libpthread.so.0","elfType":3,"buildId":"E10CC8F2B932FC3DAEDA22F8DAC5EBB969524E5B"},\{"b":"7F2A3B1EA000","path":"/lib64/libc.so.6","elfType":3,"buildId":"A317B42B15368ADCAE21C11107691A03EC91059D"}]}}}}

MongoDB info:

$ mongod --version
db version v4.4.9
Build Info: {
 "version": "4.4.9",
 "gitVersion": "b4048e19814bfebac717cf5a880076aa69aba481",
 "openSSLVersion": "OpenSSL 1.0.1e-fips 11 Feb 2013",
 "modules": [],
 "allocator": "tcmalloc",
 "environment": {
 "distmod": "rhel70",
 "distarch": "x86_64",
 "target_arch": "x86_64"
 }
}

Resource usage is very low:

  • RAM usage: 3.92/32GB.
  • CPU: ~ 1/8 core.
  • Disk: 13%.

 



 Comments   
Comment by Edwin Zhou [ 07/Feb/22 ]

Hi ntv1090@gmail.com,

We haven’t heard back from you for some time, so I’m going to close this ticket. If this is still an issue for you, please provide additional information and we will reopen the ticket.

Best,
Edwin

Comment by Edwin Zhou [ 28/Jan/22 ]

Hi ntv1090@gmail.com

We still need additional information to diagnose the problem. If this is still an issue for you, would you please let us know if you've been able to run validate, and let us know of its result?

Best,
Edwin

Comment by Edwin Zhou [ 07/Jan/22 ]

Hi ntv1090@gmail.com,

Thank you for uploading your log file and for your patience as we investigate this seg fault.

Since this incident, have you experienced repeat seg faults on this cluster? If so, can you provide how frequently these crashes are occurring?

An invalid access may suggest that there's corruption on document data. The seg fault coincided with during heavy write operations on db_raiden.$cmd and db_raiden.deal. My guidance would be to run validate on the collections.

After running validate, can you please let us know if validate was able to identify any inconsistencies on the node or if it passes on the node that experienced the seg fault?

Best,
Edwin

Comment by Vy Nguyen Tan [ 30/Nov/21 ]

Hi Edwin Zhou,

I have uploaded the log file. Please check.

 

Thanks,

Comment by Edwin Zhou [ 29/Nov/21 ]

Hi ntv1090@gmail.com,

Thanks for your report! Would you please archive (tar or zip) the mongod.log files and upload them to this support uploader location?

Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time.

Best,
Edwin

Generated at Thu Feb 08 05:53:14 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.