[SERVER-50476] gdb.lookup_type on dynamically linked mongo programs can be prohibitively expensive Created: 22/Aug/20  Updated: 29/Oct/23  Resolved: 17/Jan/23

Status: Closed
Project: Core Server
Component/s: Build
Affects Version/s: None
Fix Version/s: 6.3.0-rc0

Type: Bug Priority: Major - P3
Reporter: Daniel Gottlieb (Inactive) Assignee: Daniel Moody
Resolution: Fixed Votes: 2
Labels: post-v4-toolchain
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File screenshot-1.png    
Issue Links:
Duplicate
duplicates SERVER-71211 gdb-index causing bugs in server gdb Closed
Related
is related to SERVER-50537 Create alternative gdb string pretty ... Open
is related to SERVER-71098 Create additional gdb tests Open
is related to SERVER-70839 Spawning dynamically linked mongod pr... Closed
Assigned Teams:
Server Development Platform
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

gdb ./path/to/mongod
$ b main
$ r
$ python print(gdb.lookup_type("std::__cxx11::string"))

Participants:
Linked BF Score: 6

 Description   

Calls to gdb.lookup_type when using gdb are ubiquitous. Perhaps most notably is when pretty printing strings.

Running gdb on a dynamically linked mongo program is struggling mightily compared to a statically linked program with the following invocations:

  • gdb.lookup_type("std::__cxx11::string")
  • gdb.lookup_type("std::__cxx11::wstring")

On a mongo shell, the former call took maybe a minute? (pegs a single CPU) and consumed ~6-7GB of memory before succeeding. A follow-up call succeeded immediately (I assume the result was cached by gdb).

However when examining mongod on a spawn host, the lookup consumes all of the system memory and OOMs the gdb program before it gets a chance to succeed.



 Comments   
Comment by Githook User [ 17/Jan/23 ]

Author:

{'name': 'Daniel Moody', 'email': 'daniel.moody@mongodb.com', 'username': 'dmoody256'}

Message: SERVER-50476 created gdb_index scons tool to generate gdb_index
Branch: master
https://github.com/mongodb/mongo/commit/15e9bc63835c7c0c441607e10034841b3a157e3c

Comment by Daniel Moody [ 04/Jan/23 ]

daniel.gottlieb@mongodb.com Thanks for looking into it, so gdb-index was just silently not being used with lld because of the lack of pubnames option? Seems like gdb should have said some kind of warning about the bad index. Same goes for the gold case as well instead of just using it and running into errors.

I will test it out and reproduce the tests you mentioned, and investigate pubnames a bit more and if it looks good, fix it under this ticket.

Comment by Daniel Gottlieb (Inactive) [ 04/Jan/23 ]

On a whim I tried the combination of gold + -ggnu-pubnames. gdb's startup time was reduced to 30 seconds as well (from the 1 minute baseline for gold + gdb-index - pubnames). GDB also succeeded in doing the lookup_type on the problematic names.

It looks like David Blaikie's assessment that pubnames is required for proper gdb-indexing by the linker is correct:

Comment by Daniel Gottlieb (Inactive) [ 04/Jan/23 ]

I can confirm the issue still exists when using --linker=gold and goes away with lld. I'm happy for this ticket to be closed.

I did get curious and did some follow-up investigation:

I noticed that the startup time for gdb when using lld was ~3 minutes (`execfile` is empty):

dgottlieb@chimichurri ~/xgen/mongo[master]$ time gdb --exec=execfile --batch ./bin/mongod ./core
real	2m56.794s
user	3m50.890s
sys	0m7.840s

while it was ~1 minute for gold:

real	1m12.225s
user	2m6.107s
sys	0m8.053s

So to me that meant we were not building a gdb index with lld. However, I did see it being built:

dgottlieb@chimichurri ~/xgen/mongo[master]$ readelf -S bin/mongod | grep -A 4 index
  [32] .gdb_index        PROGBITS         0000000000000000  00007890
       00000000000020c8  0000000000000000           0     0     1
  [33] .symtab           SYMTAB           0000000000000000  00009958
       0000000000000468  0000000000000018          35    26     8
  [34] .shstrtab         STRTAB           0000000000000000  00009dc0

I stumbled across this bug report that says lld won't hit the goals of a gdb index if the compilation units are not compiled with -ggnu-pubnames. So I ran an experiment to add that. My scons -> ninja invocation:

/opt/mongodbtoolchain/v4/bin/python3 ./buildscripts/scons.py CCFLAGS="-ggnu-pubnames" --link-model=dynamic --dbg --opt=off --variables-files=etc/scons/mongodbtoolchain_stable_clang.vars VARIANT_DIR=newninja --variables-files=etc/scons/developer_versions.vars --modules="" MONGO_GIT_HASH="unknown" ICECC=icecc CCACHE=ccache DESTDIR=$(pwd) --ninja generate-ninja NINJA_PREFIX=dynamic.pubnames

gdb startup time went down to ~30 second:

real	0m33.357s
user	1m27.930s
sys	0m6.297s

The existence of the index seemed to continue to work as expected:

(gdb) python print(gdb.lookup_type("std::stringstream"))
std::stringstream

daniel.moody@mongodb.com I'm happy to close this ticket out. But a question/ask that I would appreciate a response on:

  • Most of that bug report is jargon to me. From your understanding are we impacted by that bug/failure to deliver desired behavior (the index reduces startup times)?
  • Can you reproduce my experiment and results?
  • If so, should we be filing a new ticket to add -ggnu-pubnames to builds where the linker is being asked to create a gdb index?
Comment by Daniel Moody [ 03/Jan/23 ]

This seems to have been resolved by switching to lld linker as the default linker (SERVER-72324).

I tested 

  • python print(gdb.lookup_type("std::string"))
  • python print(gdb.lookup_type("std::stringstream"))

And both were near instantaneous, if any else interested could corroborate, that would be great, otherwise I will close this issue.

Comment by Luke Pearson [ 27/Oct/22 ]

I ran into this on the coredump from this failure when trying to do a type_lookup on uint8_t, similarly for 16/32/64 bits. Disabling pretty printers in this instance didn't work around the issue. I did manage to work around it using a void pointer, which allowed me do WiredTiger insert list key conversion.

This command should reproduce the issue on that coredump in the fourth frame.

(gdb) p  ((WT_CURSOR_BTREE*)c)->ref->page.modify.u2.row_leaf.insert

Thanks daniel.gottlieb@mongodb.com for assisting me with this issue.

Comment by Andrew Morrow (Inactive) [ 22/Apr/22 ]

Some potentially good news on this front: https://tromey.com/blog/?p=1084

Comment by Andrew Morrow (Inactive) [ 04/Mar/22 ]

max.hirschhorn - This comment on the gold bug suggests that at least some version of GDB has workarounds for the duplicated entries that ld.gold produces: https://sourceware.org/bugzilla/show_bug.cgi?id=15646#c5. I will look into whether the version of GDB arriving with the v4 toolchain contains the workaround (along with the fix to the workaround indicated in the comment).

I'm going to enqueue this ticket into the epic where we are doing post-v4 tasks so we can investigate what to do here more thoroughly. The gdb-add-index script just invokes gdb and uses the save gdb-index command, but does a lot of other things too. I'd want to take a closer look at what all it was doing. I'm a little hesitant to start needing to invoke gdb as part of our build process. There is also ld.lld in the mix, which I think we can start using in more situations once we get to v4. We had disabled it for many cases in v3 due to bugs.

Comment by Max Hirschhorn [ 03/Mar/22 ]

Unfortunately, there isn't much we can do until GDB itself addresses this issue.

I'm not sure GDB is really to blame here. After reading through

1. The version number, currently 8. Versions 1, 2 and 3 are obsolete. Version 4 uses a different hashing function from versions 5 and 6. Version 6 includes symbols for inlined functions, whereas versions 4 and 5 do not. Version 7 adds attributes to the CU indices in the symbol table. Version 8 specifies that symbols from DWARF type units ('DW_TAG_type_unit') refer to the type unit’s symbol table and not the compilation unit (DW_TAG_comp_unit) using the type.

GDB will only read version 4, 5, or 6 indices by specifying set use-deprecated-index-sections on. GDB has a workaround for potentially broken version 7 indices so it is currently not flagged as deprecated.

https://sourceware.org/gdb/onlinedocs/gdb/Index-Section-Format.html

and tracking down https://www.sourceware.org/legacy-ml/gdb-patches/2013-08/msg00055.html and https://sourceware.org/bugzilla/show_bug.cgi?id=25234, I think the issue with the gold linker and how it generates the .gdb_index section format. Steps I wonder if we could take from here would be to either

Comment by Andrew Morrow (Inactive) [ 23/Aug/20 ]

Some very interesting notes here: https://sourceware.org/gdb/wiki/SymbolHandling. In particular, note 8 seems relevant to our situation. Unfortunately no workaround is suggested.

Comment by Daniel Gottlieb (Inactive) [ 23/Aug/20 ]

I first ran into this on a coredump from a relatively recent version on our standard shared library rhel 6.2 (linked separately). I reproduced this locally using clang with the following invocation:

alias build_hygienic_dynamic='/opt/mongodbtoolchain/v3/bin/python3 ./buildscripts/scons.py --link-model=dynamic --dbg --opt=on --variables-files=etc/scons/mongodbtoolchain_stable_clang.vars VARIANT_DIR=newninja MONGO_VERSION=$(git describe --abbrev=0 | sed -e "s/^r//") --modules="" MONGO_GIT_HASH="unknown" ICECC=icecc CCACHE=ccache --install-mode=hygienic --ninja=next generate-ninja NINJA_PREFIX=dynamic.hygienic'

max.hirschhorn was able to confirm the same problem on his existing mongod. I suspect the repro steps up top are sufficient to observe the same behavior for a wide variety of build settings.

Are you familiar with what gdb.lookup_type has to accomplish? When seeing the same problem on the shell (which required 6-7GB of memory before succeeding), the memory was freed and subsequent calls returned immediately. It's not clear to me that it's necessary for lookup_type to aggregate so much data to return its output. Where I'm going with this: is it plausible that what we're seeing is some unoptimized code path that is maybe improved upon in a later version of gdb?

Comment by Andrew Morrow (Inactive) [ 23/Aug/20 ]

daniel.gottlieb - Thanks for running those experiments. I was somewhat hopeful that we would only observe this behavior with the mongodbtoolchain because it would have suggested that SERVER-48291 was responsible, and we are working on a fix for that. However, since the behavior seems to also occur with the system toolchain, where SERVER-48291 is not a factor, that hypothesis seems eliminated. That leaves the less palatable hypothesis on the table that the issue is due to the sheer number of std::string instantiations across the large number of shared libraries in a dynamic build. We don't have a short term fix for reducing the number of libraries, though we do have longer term goal to do so. In the meantime, we will investigate where there are GCC settings around debug info that we can adjust, or whether there are additional startup commands we could execute with GDB to lessen the impact. Note that we are already using --Wl,--gdb-index.

Meanwhile could you provide details on exactly which build variant and branch, etc., you were using when you encountered this issue?

Comment by Daniel Gottlieb (Inactive) [ 23/Aug/20 ]

mongo::mutablebson::Document worked fine on the toolchain dynamic build and the system gcc. The cxx11::string against the system gcc consumed a quarter of my desktop memory before I interrupted the command.

Comment by Andrew Morrow (Inactive) [ 22/Aug/20 ]

daniel.gottlieb - I don't have another workaround at the moment but I do have an idea what may be happening. To confirm, can you please, on the same build where you observe the poor performance of gdb.lookup_type with std::string, see if you observe the same poor performance for gdb.lookup_type of some type in the mongo namespace? Ideally a simple non-template class with a primarily out-of-line definition like mongo::mutablebson::Document? Then, especially if the performance of gdb.lookup_type shows that it works OK for the mongo:: type, could you please repeat the experiment with both std::string and the mongo:: type, but this time on a build of the binaries produced with the system GCC rather than the mongodbtoolchain?

Comment by Daniel Gottlieb (Inactive) [ 22/Aug/20 ]

acm, I can presumably "workaround this" by not registering the "offending" pretty printers. That said, I'd highly appreciate an expedited triage process if it can result in a different workaround I can apply which doesn't sacrifice the ability to pretty print strings. I suspect you already have an idea of what's probably happening and might have some ideas of what we can try.

Generated at Thu Feb 08 05:22:45 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.