[SERVER-53706] segfault in tcmalloc Created: 12/Jan/21  Updated: 10/Jun/21  Resolved: 27/Jan/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Suraj Narkhede Assignee: Dmitry Agranat
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-53705 segfault in tcmalloc in wt_evict Closed
is duplicated by SERVER-57199 Pord Mongo instance Crash: got Invali... Closed
is duplicated by WT-7049 test/format heap use after free on 4.... Closed
Related
related to WT-7049 test/format heap use after free on 4.... Closed
Operating System: ALL
Steps To Reproduce:

Unknown at this point, but happening multiple times in production.

Participants:

 Description   

#0  tcmalloc::SLL_Next (t=0x0) at src/third_party/gperftools-2.5/src/linked_list.h:45
#1  tcmalloc::SLL_PopRange (end=<synthetic pointer>, start=<synthetic pointer>, N=128, head=0x559aa7bbc7a8) at src/third_party/gperftools-2.5/src/linked_list.h:76
#2  tcmalloc::ThreadCache::FreeList::PopRange (end=<synthetic pointer>, start=<synthetic pointer>, N=128, this=0x559aa7bbc7a8) at src/third_party/gperftools-2.5/src/thread_cache.h:225
#3  tcmalloc::ThreadCache::ReleaseToCentralCache (this=this@entry=0x559aa7bbc700, src=src@entry=0x559aa7bbc7a8, cl=<optimized out>, N=N@entry=128) at src/third_party/gperftools-2.5/src/thread_cache.cc:195
#4  0x000055966f8bdd8c in tcmalloc::ThreadCache::ListTooLong (this=this@entry=0x559aa7bbc700, list=0x559aa7bbc7a8, cl=<optimized out>) at src/third_party/gperftools-2.5/src/thread_cache.cc:157
#5  0x000055966f8c6a0a in tcmalloc::ThreadCache::Deallocate (cl=<optimized out>, ptr=0x55b849f2a840, this=0x559aa7bbc700) at src/third_party/gperftools-2.5/src/thread_cache.h:393
#6  (anonymous namespace)::do_free_helper (invalid_free_fn=0x55966f8bf4b0 <(anonymous namespace)::InvalidFree(void*)>, size_hint=0, use_hint=false, heap_must_be_valid=true, heap=0x559aa7bbc700, ptr=0x55b849f2a840) at src/third_party/gperftools-2.5/src/tcmalloc.cc:1383
#7  (anonymous namespace)::do_free_with_callback (invalid_free_fn=0x55966f8bf4b0 <(anonymous namespace)::InvalidFree(void*)>, size_hint=0, use_hint=false, ptr=0x55b849f2a840) at src/third_party/gperftools-2.5/src/tcmalloc.cc:1415
#8  (anonymous namespace)::do_free (ptr=0x55b849f2a840) at src/third_party/gperftools-2.5/src/tcmalloc.cc:1423
#9  tc_free (ptr=0x55b849f2a840) at src/third_party/gperftools-2.5/src/tcmalloc.cc:1688
#10 0x000055966dfe4864 in __wt_free_int (session=session@entry=0x559671e720b0, p_arg=p_arg@entry=0x7f6949b8b6b8) at src/third_party/wiredtiger/src/os_common/os_alloc.c:327
#11 0x000055966e0449c8 in __wt_free_ref (session=session@entry=0x559671e720b0, ref=0x0, page_type=6, free_pages=free_pages@entry=false) at src/third_party/wiredtiger/src/btree/bt_discard.c:292
#12 0x000055966e043b4d in __wt_free_ref_index (session=session@entry=0x559671e720b0, page=page@entry=0x55ab30eff540, pindex=0x55bdaff1aa00, free_pages=free_pages@entry=false) at src/third_party/wiredtiger/src/btree/bt_discard.c:309
#13 0x000055966e043ef6 in __free_page_int (page=<optimized out>, session=0x559671e720b0) at src/third_party/wiredtiger/src/btree/bt_discard.c:234
#14 __wt_page_out (session=session@entry=0x559671e720b0, pagep=pagep@entry=0x55c4ad5cd940) at src/third_party/wiredtiger/src/btree/bt_discard.c:119
#15 0x000055966e04481a in __wt_ref_out (session=session@entry=0x559671e720b0, ref=ref@entry=0x55c4ad5cd940) at src/third_party/wiredtiger/src/btree/bt_discard.c:44
#16 0x000055966dfd87df in __evict_page_dirty_update (closing=false, ref=0x55c4ad5cd940, session=0x559671e720b0) at src/third_party/wiredtiger/src/evict/evict_page.c:433
#17 __wt_evict (session=session@entry=0x559671e720b0, ref=ref@entry=0x55c4ad5cd940, closing=closing@entry=false, previous_state=previous_state@entry=5) at src/third_party/wiredtiger/src/evict/evict_page.c:222
#18 0x000055966dfd05eb in __evict_page (session=session@entry=0x559671e720b0, is_server=is_server@entry=false) at src/third_party/wiredtiger/src/evict/evict_lru.c:2334
#19 0x000055966dfd0b43 in __evict_lru_pages (session=session@entry=0x559671e720b0, is_server=is_server@entry=false) at src/third_party/wiredtiger/src/evict/evict_lru.c:1185
#20 0x000055966dfd3957 in __wt_evict_thread_run (session=0x559671e720b0, thread=0x5596755760a0) at src/third_party/wiredtiger/src/evict/evict_lru.c:318
#21 0x000055966e02a9b9 in __thread_run (arg=0x5596755760a0) at src/third_party/wiredtiger/src/support/thread_group.c:31
#22 0x00007f694f18a6ba in start_thread (arg=0x7f6949b8c700) at pthread_create.c:333
#23 0x00007f694eec04dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
(gdb)



 Comments   
Comment by Dmitry Agranat [ 27/Jan/21 ]

Hi surajn.vnit@gmail.com, I will go ahead and close this ticket. Please reopen if the issue reoccurs after upgrading to 4.4.3

Comment by Dmitry Agranat [ 20/Jan/21 ]

Hi surajn.vnit@gmail.com we believe the reported issue is related to WT-7049, and based on the preliminary results, does not reproduce in 4.4. We suggest testing with 4.4.3 and reporting back with the results.

Comment by Sergey G [ 19/Jan/21 ]

@Dmitry Thanks! Re reproduction: we do not yet have a synthetic repro, and it's not possible to deploy an unpatched version into product where we're able to repro this group of crashes consistently. 

(However I'm 95% certain our patches are not relevant here – we do not touch any WT code and other high-load situations have not turned up any similar issues.)

Comment by Dmitry Agranat [ 19/Jan/21 ]

surajn.vnit@gmail.com, we are still investigating this and I expect to have an update for you tomorrow

Comment by Dmitry Agranat [ 18/Jan/21 ]

Thanks surajn.vnit@gmail.com for the provided data and all the explanation. Given you are using a custom MongoDB binary, can you reproduce the same under the default and unchanged MongoDB binary?

Comment by Sergey G [ 16/Jan/21 ]

Hi Dima,

I've uploaded and archive of the relevant data to the portal:

  • the diagnostic.data archive (sadly the files for the date of the crash seem to have been truncated so not sure if this is useful)
  • the crash backtrace in mongod.log (I can attach any other specific info from the log if you can specify it)
  • the `bt full` backtrace from gdb on the core file 
  • a directory of other gdb backtraces ('other-stacks') that I've collected from other crashes that appear related. 

Also, some context that we now have:

  • This is a build of mongo 3.6.20 (with some small local patches that are known to be safe & stable)
  • We think all of the segfaults are happening on machines that are experiencing a high amount of deletes.  (The crash this ticket was opened was a one-off delete workload.  The 'other-stacks' are from a pool of machines that has a regular deletion heavy workload.)
  • The machine with the crashed mongod is a 48 core instance.  For the crashes in 'other-stacks', we noticed the segfaults started when the instance size was changed from 16 CPU machines to 36 CPU machines (with no other configuration), and are happening at a steady rate in the cluster.  (We do not have a synthetic repro yet.)
Comment by Dmitry Agranat [ 12/Jan/21 ]

Hi surajn.vnit@gmail.com,

Would you please archive (tar or zip) the mongod.log files and the $dbpath/diagnostic.data directory (the contents are described here) and upload them to this support uploader location?

In addition, please also attach syslog covering the time of the reported event.

Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time.

Thanks,
Dima

Generated at Thu Feb 08 05:31:38 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.