[SERVER-72273] Allow for dynamic loose build mode to prevent relinking Created: 20/Dec/22  Updated: 02/Feb/24

Status: Open
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Matthew Saltz (Inactive) Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File dynamic_mongod_ninja_db_raii_j1.log     File dynamic_mongod_ninja_db_raii_j300.log    
Assigned Teams:
Build
Participants:

 Description   

When I make a modification to src/mongo/db/db_raii.cpp, and recompile the server, it generates 245 "compile tasks" and takes about 2 minutes. It seems like a change to one cpp file should not require all this extra work, especially since I'm using dynamic linking. Here's a sample of the tasks generated:

(env) ~/code/mongo$ ninja -j200 -v -f debug.ninja install-devcore
[1/245 (  0%) 28.073s] rm -f build/debug/mongo/db/db_raii.dyn.o; export CCACHE_NOCPP2='1';export CCACHE_PREFIX='/home/matthewsaltz/code/mongo/build/scons/icecream/debug/run-icecc.sh';export PATH='/opt/mongodbtoolchain/v4/bin:/usr/local/bin:/opt/bin:/bin:/usr/bin';/opt/mongodbtoolchain/v4/bin/ccache /opt/mongodbtoolchain/v4/bin/g++ @build/debug/mongo/db/db_raii.dyn.o.rsp
[2/245 (  0%) 29.422s] rm -f build/debug/mongo/db/libshard_role.so; export CCACHE_NOCPP2='1';export CCACHE_PREFIX='/home/matthewsaltz/code/mongo/build/scons/icecream/debug/run-icecc.sh';export PATH='/opt/mongodbtoolchain/v4/bin:/usr/local/bin:/opt/bin:/bin:/usr/bin';/usr/bin/icerun /opt/mongodbtoolchain/v4/bin/g++ @build/debug/mongo/db/libshard_role.so.rsp
[3/245 (  1%) 29.431s] rm -f build/install/lib/libshard_role.so; ln build/debug/mongo/db/libshard_role.so build/install/lib/libshard_role.so || install build/debug/mongo/db/libshard_role.so build/install/lib/libshard_role.so
[4/245 (  1%) 30.438s] rm -f build/debug/mongo/db/stats/liblatency_server_stats.so; export CCACHE_NOCPP2='1';export CCACHE_PREFIX='/home/matthewsaltz/code/mongo/build/scons/icecream/debug/run-icecc.sh';export PATH='/opt/mongodbtoolchain/v4/bin:/usr/local/bin:/opt/bin:/bin:/usr/bin';/usr/bin/icerun /opt/mongodbtoolchain/v4/bin/g++ @build/debug/mongo/db/stats/liblatency_server_stats.so.rsp
[5/245 (  2%) 30.441s] rm -f build/install/lib/liblatency_server_stats.so; ln build/debug/mongo/db/stats/liblatency_server_stats.so build/install/lib/liblatency_server_stats.so || install build/debug/mongo/db/stats/liblatency_server_stats.so build/install/lib/liblatency_server_stats.so
[6/245 (  2%) 30.452s] rm -f build/debug/mongo/db/storage/liboplog_cap_maintainer_thread.so; export CCACHE_NOCPP2='1';export CCACHE_PREFIX='/home/matthewsaltz/code/mongo/build/scons/icecream/debug/run-icecc.sh';export PATH='/opt/mongodbtoolchain/v4/bin:/usr/local/bin:/opt/bin:/bin:/usr/bin';/usr/bin/icerun /opt/mongodbtoolchain/v4/bin/g++ @build/debug/mongo/db/storage/liboplog_cap_maintainer_thread.so.rsp
[7/245 (  2%) 30.455s] rm -f build/install/lib/liboplog_cap_maintainer_thread.so; ln build/debug/mongo/db/storage/liboplog_cap_maintainer_thread.so build/install/lib/liboplog_cap_maintainer_thread.so || install build/debug/mongo/db/storage/liboplog_cap_maintainer_thread.so build/install/lib/liboplog_cap_maintainer_thread.so
[8/245 (  3%) 30.465s] rm -f build/debug/mongo/db/repl/libreplication_consistency_markers_impl.so; export CCACHE_NOCPP2='1';export CCACHE_PREFIX='/home/matthewsaltz/code/mongo/build/scons/icecream/debug/run-icecc.sh';export PATH='/opt/mongodbtoolchain/v4/bin:/usr/local/bin:/opt/bin:/bin:/usr/bin';/usr/bin/icerun /opt/mongodbtoolchain/v4/bin/g++ @build/debug/mongo/db/repl/libreplication_consistency_markers_impl.so.rsp
[9/245 (  3%) 30.469s] rm -f build/install/lib/libreplication_consistency_markers_impl.so; ln build/debug/mongo/db/repl/libreplication_consistency_markers_impl.so build/install/lib/libreplication_consistency_markers_impl.so || install build/debug/mongo/db/repl/libreplication_consistency_markers_impl.so build/install/lib/libreplication_consistency_markers_impl.so
[10/245 (  4%) 30.520s] rm -f build/debug/mongo/db/catalog/libvalidate_state.so; export CCACHE_NOCPP2='1';export CCACHE_PREFIX='/home/matthewsaltz/code/mongo/build/scons/icecream/debug/run-icecc.sh';export PATH='/opt/mongodbtoolchain/v4/bin:/usr/local/bin:/opt/bin:/bin:/usr/bin';/usr/bin/icerun /opt/mongodbtoolchain/v4/bin/g++ @build/debug/mongo/db/catalog/libvalidate_state.so.rsp
[11/245 (  4%) 30.524s] rm -f build/install/lib/libvalidate_state.so; ln build/debug/mongo/db/catalog/libvalidate_state.so build/install/lib/libvalidate_state.so || install build/debug/mongo/db/catalog/libvalidate_state.so build/install/lib/libvalidate_state.so

Here's the function I used to build debug.ninja:

   function buildninjaicdebug() {
       python ./buildscripts/scons.py  \
           --variables-files=etc/scons/mongodbtoolchain_stable_gcc.vars \
           MONGO_VERSION=$(git describe --abbrev=0 | tail -c+2) \
           --ssl --dbg --opt=off --link-model=dynamic \
           VARIANT_DIR=debug \
           --ninja generate-ninja ICECC=icecc CCACHE=ccache NINJA_PREFIX=debug
   }

For a full repro this should do it:

       python ./buildscripts/scons.py  \
           --variables-files=etc/scons/mongodbtoolchain_stable_gcc.vars \
           MONGO_VERSION=$(git describe --abbrev=0 | tail -c+2) \
           --ssl --dbg --opt=off --link-model=dynamic \
           VARIANT_DIR=debug \
           --ninja generate-ninja ICECC=icecc CCACHE=ccache NINJA_PREFIX=debug
       ninja -j200 -f debug.ninja install-devcore
       touch src/mongo/db/db_raii.cpp
       ninja -j200 -f debug.ninja install-devcore



 Comments   
Comment by Matt Broadstone [ 19/Jan/24 ]

Can we please re-consider this feature, or something that addresses the original ask. It sounds like we have a few options which might be:

  • add a new dynamic-loose profile
  • audit public LIBDEPS and determine if we can reduce the links between libraries
  • ??? something else

 

I'm making a small change to db_s_config_server_test and the link time is ~65s right now, which is just long enough for me to get distracted and read some other Jira ticket or slack while waiting to recompile my unit tests. Working with unit tests should be a joyful, fast iterative experience. What can we do to move in that direction?

 

Comment by Daniel Moody [ 20/Dec/22 ]

I think a good solution here is a command line option which will allow very loose linking, like `--link-model=dynamic-loose` where the libraries don't get relinked because dependent shared objects are changed. This will skip any build time checks for symbol resolution and you will see any undefined refs at runtime.

Comment by Daniel Moody [ 20/Dec/22 ]

I'd expect, for example, at least 10 parts being processed in parallel. Which would bring the recompile down from 180 seconds to ~20.

I don't know where you get this figure from, but it just depends on the levels of dependency in the particular situation. Looking at the graph paths, I printed the longest length dependency path using networkx between mongod and libshard_role (where db_raii is linked in), and its 30 levels of dependency, this mostly like is the bottle neck. Those links must be done consecutively in 30 separate links. Actually in a static build there the .a libraries are not dependent on each other and more concurrency can be done just in the context of just libraries (Note not counting Programs).

The db_raii.cpp just happens to be somewhat deep in dependencies. If you chose a different file it could be worse or better depending on its location in the dependency tree. For example touching a source file in the shell:

(venv) Dec.20 03:04 ubuntu[mongo]: touch src/mongo/shell/mongo.cpp                                 
(venv) Dec.20 03:04 ubuntu[mongo]: time ninja -j200 -f debug.ninja install-devcore
[3/3] Installed build/install/bin/mongo
 
 
real	0m0.520s
user	0m0.404s
sys	0m0.116s 

The premise of dynamic link model is not to improve incremental rebuilds. It was originally intended to reduce memory pressure and therefore allow greater linking concurrency. In the context of a full build, we link 100s of binaries, and in a static link build each can take somewhere between 5GB to 10GB, this can easily OOM a system. Dynamic link model considerably reduces the memory, and is generally faster than the same static link (due to being able to skip out on transitive symbol refs), but in some cases can require more links depending on the situation.

 

Comment by Daniel Gottlieb (Inactive) [ 20/Dec/22 ]

I ran an experiment to just do static linking. It's taking ~13 seconds. Is there a reason to use dynamic linking if it's not effective in the incremental case?

dgottlieb@chimichurri ~/xgen/mongo[master]$ cat static_mongod_ninja_db_raii_j1.log 
[1/4 ( 25%) 0.838s] rm -f build/newninja/mongo/db/db_raii.o; export CCACHE_NOCPP2='1';export CCACHE_PREFIX='/home/dgottlieb/xgen/mongo/build/scons/icecream/newninja/run-icecc.sh';export ICECC_CLANG_REMOTE_CPP='1';export PATH='/opt/mongodbtoolchain/v4/bin:/usr/local/bin:/opt/bin:/bin:/usr/bin';/opt/mongodbtoolchain/v4/bin/ccache /opt/mongodbtoolchain/v4/bin/clang++ @build/newninja/mongo/db/db_raii.o.rsp
[2/4 ( 50%) 0.841s] rm -f build/newninja/mongo/db/libshard_role.a; touch build/newninja/mongo/db/libshard_role.a
[3/4 ( 75%) 12.177s] rm -f build/newninja/mongo/db/mongod; export CCACHE_NOCPP2='1';export CCACHE_PREFIX='/home/dgottlieb/xgen/mongo/build/scons/icecream/newninja/run-icecc.sh';export ICECC_CLANG_REMOTE_CPP='1';export PATH='/opt/mongodbtoolchain/v4/bin:/usr/local/bin:/opt/bin:/bin:/usr/bin';/bin/icerun /opt/mongodbtoolchain/v4/bin/clang++ @build/newninja/mongo/db/mongod.rsp
[4/4 (100%) 12.809s] rm -f bin/mongod; ln build/newninja/mongo/db/mongod bin/mongod || install build/newninja/mongo/db/mongod bin/mongod

Edit
The above experiment is just touching a file. Ninja seems to nose out what file I touched but isn't doing any interesting work. Changing the contents of the file yields a different result. But it's still faster than dynamic linking:

dgottlieb@chimichurri ~/xgen/mongo[master]$ ninja -v -f enterprise.static.ninja -j 300 bin/mongod
[1/4 ( 25%) 25.080s] rm -f build/newninja/mongo/db/db_raii.o; export CCACHE_NOCPP2='1';export CCACHE_PREFIX='/home/dgottlieb/xgen/mongo/build/scons/icecream/newninja/run-icecc.sh';export ICECC_CLANG_REMOTE_CPP='1';export PATH='/opt/mongodbtoolchain/v4/bin:/usr/local/bin:/opt/bin:/bin:/usr/bin';/opt/mongodbtoolchain/v4/bin/ccache /opt/mongodbtoolchain/v4/bin/clang++ @build/newninja/mongo/db/db_raii.o.rsp
[2/4 ( 50%) 25.087s] rm -f build/newninja/mongo/db/libshard_role.a; touch build/newninja/mongo/db/libshard_role.a
[3/4 ( 75%) 42.391s] rm -f build/newninja/mongo/db/mongod; export CCACHE_NOCPP2='1';export CCACHE_PREFIX='/home/dgottlieb/xgen/mongo/build/scons/icecream/newninja/run-icecc.sh';export ICECC_CLANG_REMOTE_CPP='1';export PATH='/opt/mongodbtoolchain/v4/bin:/usr/local/bin:/opt/bin:/bin:/usr/bin';/bin/icerun /opt/mongodbtoolchain/v4/bin/clang++ @build/newninja/mongo/db/mongod.rsp
[4/4 (100%) 42.400s] rm -f bin/mongod; ln build/newninja/mongo/db/mongod bin/mongod || install build/newninja/mongo/db/mongod bin/mongod

The dynamic linking experiment I ran did both of touching a file (the -j1) and editing the file (-j 300). For the better apples to apples:

  • Touching a file + -j300:
    • Static: 12s
    • Dynamic: 80s
  • Editing a file + -j300:
    • Static: 43s
    • Dynamic: 115s
Comment by Daniel Gottlieb (Inactive) [ 20/Dec/22 ]

I've added my log files from ninja -v when recompiling after touching only db_raii.cpp

Comment by Daniel Gottlieb (Inactive) [ 20/Dec/22 ]

The link steps happen in parallel. The install steps (rm + ln) also are in parallel and are relatively fast (10s of milliseconds).

How in parallel are they? The numbers aren't adding up.

I'm able to reproduce matthew.saltz@mongodb.com's observation. When I run with j1 as a baseline I get ~3 minutes:

[271/271 (100%) 187.366s] rm -f bin/mongod; ln build/newninja/mongo/db/mongod bin/mongod || install build/newninja/mongo/db/mongod bin/mongod

When I run with j300 (certainly larger than whatever size pool ninja is configured to use) that only brings the runtime down to ~2 minutes. I'd expect, for example, at least 10 parts being processed in parallel. Which would bring the recompile down from 180 seconds to ~20.

[271/271 (100%) 112.854s] rm -f bin/mongod; ln build/newninja/mongo/db/mongod bin/mongod || install build/newninja/mongo/db/mongod bin/mongod

I agree with the observation that 10s of milliseconds is typical. But I also see some 1+ second blips (from the -j1 logs):

[39/271 ( 14%) 47.282s] rm -f build/newninja/mongo/db/serverless/libshard_split_state_machine.so; export CCACHE_NOCPP2='1';export CCACHE_PREFIX='/home/dgottlieb/xgen/mongo/build/scons/icecream/newninja/run-icecc.sh';export ICECC_CLANG_REMOTE_CPP='1';export PATH='/opt/mongodbtoolchain/v4/bin:/usr/local/bin:/opt/bin:/bin:/usr/bin';/bin/icerun /opt/mongodbtoolchain/v4/bin/clang++ @build/newninja/mongo/db/serverless/libshard_split_state_machine.so.rsp
[40/271 ( 14%) 48.657s] rm -f build/newninja/mongo/db/repl/libtenant_migration_access_blocker.so; export CCACHE_NOCPP2='1';export CCACHE_PREFIX='/home/dgottlieb/xgen/mongo/build/scons/icecream/newninja/run-icecc.sh';export ICECC_CLANG_REMOTE_CPP='1';export PATH='/opt/mongodbtoolchain/v4/bin:/usr/local/bin:/opt/bin:/b
[41/271 ( 15%) 49.766s] rm -f build/newninja/mongo/db/libread_concern_d_impl.so; export CCACHE_NOCPP2='1';export CCACHE_PREFIX='/home/dgottlieb/xgen/mongo/build/scons/icecream/newninja/run-icecc.sh';export ICECC_CLANG_REMOTE_CPP='1';export PATH='/opt/mongodbtoolchain/v4/bin:/usr/local/bin:/opt/bin:/bin:/usr/bin';/bin/icerun /opt/mongodbtoolchain/v4/bin/clang++ @build/newninja/mongo/db/libread_concern_d_impl.so.rsp

To be clear of what (my) asks are here:

  • I am not asking for this to be a top priority ticket to fix.
  • I am* asking to understand what the expected parallelism is. I expect the back of the envelop math to support the hypothesis.
  • I'm also asking: What's the consequence of not doing all those rm + ln steps? What would a user have to do between changing files and/or changing scons/ninja flags such that running mongod wouldn't work correctly? The explanation including public libdeps helps inform what's getting rm + ln'ed due to changing a source file. But I'm not picking up on why that's important. My mental model regarding shared library builds is that changing one cpp file should rewrite one .so file. And nothing else is necessary to run the updated code.
Comment by Daniel Moody [ 20/Dec/22 ]

Is there an option that lets the rm + ln/install steps happen in parallel? Also curious if there's a correctness concern there.

The link steps happen in parallel. The install steps (rm + ln) also are in parallel and are relatively fast (10s of milliseconds).

Comment by Daniel Gottlieb (Inactive) [ 20/Dec/22 ]

Is there an option that lets the rm + ln/install steps happen in parallel? Also curious if there's a correctness concern there.

Comment by Daniel Moody [ 20/Dec/22 ]

We use build time symbol resolution and explicitly make these dependencies and relink all the libraries to recheck all symbols still correctly resolve so there are no issues at runtime. So you change a library which other libraries linked to, and everything is rechecked to make sure when the binary runs all symbols refs are satisfied.

 

That said, the overuse of PUBLIC libdeps means that there are unnecessary dependencies and links. And we could make a command line option feature so that you can build without z,defs (build time symbol resolution) and we can remove the dependencies so nothing is relinked besides the library you changed the cpp in.

Generated at Thu Feb 08 06:21:18 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.