[SERVER-48443] Compilation errors with GCC and ICECC 1.2+ Created: 27/May/20  Updated: 29/Oct/23  Resolved: 17/Jul/20

Status: Closed
Project: Core Server
Component/s: Build
Affects Version/s: None
Fix Version/s: 4.4.1, 4.7.0

Type: Bug Priority: Major - P3
Reporter: Svilen Mihaylov (Inactive) Assignee: Ryan Egesdahl (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File build.ninja     Text File error.txt    
Issue Links:
Backports
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.4
Steps To Reproduce:

A minimal repro for this on an Ubuntu 20.04 spawn host is:

python3 ./buildscripts/scons.py --dbg=on --opt=off --link-model=dynamic --variables-files=etc/scons/mongodbtoolchain_stable_gcc.vars ICECC=icecc CCACHE=ccache build/debug/mongo/base/data_range.dyn.o

Sprint: Dev Platform 2020-06-15, Dev Platform 2020-06-29, Dev Platform 2020-07-13, Dev Platform 2020-07-27
Participants:

 Description   

 

After upgrading to Ubuntu 20.04 looks like ccache is broken. Disabling ccache from scons works.

The lack of ccache is however a significant productivity issue.

 

Ubuntu 20.04
ccache version 3.7.7
ICECC 1.3.1



 Comments   
Comment by Githook User [ 06/Aug/20 ]

Author:

{'name': 'Ryan Egesdahl', 'email': 'ryan.egesdahl@mongodb.com', 'username': 'deriamis'}

Message: SERVER-48443 Fix builds with Icecream 1.2+ and gcc 4.4+

A bug spotted in Icecream 1.2+ can cause build failures when building
with gcc. This is, in turn, due to a bug in GCC where the preprocessor
executed via `gcc -E` has different behavior than the one used
internally during compilation. We are working with Icecream, and GCC
to address these problems. For now, we work around the bugs.

(cherry picked from commit f69c932697877c4036aca066cf212eaad55be070)
Branch: v4.4
https://github.com/mongodb/mongo/commit/f3cd65f32a6d05803b3f3a8d2f5d32f6d9e95135

Comment by Githook User [ 17/Jul/20 ]

Author:

{'name': 'Ryan Egesdahl', 'email': 'ryan.egesdahl@mongodb.com', 'username': 'deriamis'}

Message: SERVER-48443 Fix builds with Icecream 1.2+ and gcc 4.4+

A bug spotted in Icecream 1.2+ can cause build failures when building
with gcc. This is, in turn, due to a bug in GCC where the preprocessor
executed via `gcc -E` has different behavior than the one used
internally during compilation. We are working with Icecream, and GCC
to address these problems. For now, we work around the bugs.

Comment by Githook User [ 17/Jul/20 ]

Author:

{'name': 'Ryan Egesdahl', 'email': 'ryan.egesdahl@mongodb.com', 'username': 'deriamis'}

Message: Revert "SERVER-48443 Fix builds with Icecream 1.2+ and gcc 4.4+"

This reverts commit 01dd381f3359c44bbc9338d91371d1ff823bb7d8.
Branch: master
https://github.com/mongodb/mongo/commit/4635320ea16bb9217605b7150461561fdded5e22

Comment by Svilen Mihaylov (Inactive) [ 15/Jul/20 ]

Thank you very much for fixing this!

Comment by Githook User [ 14/Jul/20 ]

Author:

{'name': 'Ryan Egesdahl', 'email': 'ryan.egesdahl@mongodb.com', 'username': 'deriamis'}

Message: SERVER-48443 Fix builds with Icecream 1.2+ and gcc 4.4+

A bug spotted in Icecream 1.2+ can cause build failures when building
with gcc. This is, in turn, due to a bug in GCC where the preprocessor
executed via `gcc -E` has different behavior than the one used
internally during compilation. We are working with Icecream, and GCC
to address these problems. For now, we work around the bugs.

Comment by Ryan Egesdahl (Inactive) [ 07/Jul/20 ]

After many false starts and dead ends, this ticket has become a cautionary tale about not believing the first thing you see. What's actually happening here has nothing to do with Icecream or ccache. It's actually a GCC bug:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88475

What's actually actually happening here is that we have a header somewhere in our source tree (exactly where is another question at the moment) that triggers the bug. Specifically, when one or more directives is processed by the GCC preprocessor with -fdirectives-only, it gets expanded into a form that is not further parseable by GCC. Clang does not have this problem because the corresponding flag we pass to it does not have the same behavior or the same issues.

The reason why this issue happens only with ccache and Icecream enabled in our build system is an accident of how we have them configured to work together versus separately. What makes this problem even more complicated to understand is the fact that both Icecream and ccache can trigger the behavior, multiple causal factors are in play, the exact relationships between the tools are murky, and the fact that it's possible to accidentally work around this problem and not see it again until you do another full rebuild.

In short, passing -fdirectives-only is the real triggering issue in our build system at the moment. I will work around the problem here. Whether and how we correct the source to fix the real cause will be a separate problem.

Comment by Ryan Egesdahl (Inactive) [ 26/Jun/20 ]

It looks like I was slightly wrong in the above explanation. The cause of breakage is accurate, but the mechanism wasn't quite so much. The problem was not, in fact, CCACHE_NOCPP2 not being set. Rather, it was -fdirectives-only not being set that did it. The CCACHE_NOCPP2 environment variable does not actually disable all preprocessing - it just disables a second preprocessor that is used to calculate a hash for the cache.

What seems to be happening is that ccache is using icecream as its preprocessor, and the newer version of icecream by default also does remote preprocessing. Because we use -fdirectives-only, this has been triggering a known issue with newer icecream versions: https://github.com/icecc/icecream/issues/451

There are two mechanisms of failure: first, -fdirectives-only causes problems for ccache when it's is not running its own preprocessor (and is using icecream instead) because it then tries to get icecream to compile it a second time, which fails for the above reason; second, running a preprocessor without -fdirectives-only causes ccache to send the preprocessed text it got back from icecream back for compiling, which triggers a second preprocessing step that GCC can't handle because it is expecting to do something like that via a precompiled header.

When I added -fpch-includes to the mix, I was actually causing ccache to intercept the compilation failure (because it has special processing for that flag) and try it again from scratch, which succeeded. I verified that by watching ccache errors for precompiled headers ticking upwards even though the compilation was succeeding. Icecream ignores the flag, so it was having no effect remotely. I was still passing in -fdirectives-only at this point, which is why cache was succeeding - the text returned from the remote icecream preprocessing matched what ccache was producing locally.

The recommended solution from both the Icecream and GCC developers is that we do no preprocessing remotely (and disable the -fdirectives-only flag). I've verified this works as well as doing preprocessing both locally and remotely with the flag. I am not sure which one is better for performance, but I lean towards the external recommendation here.

acm, daniel.moody - thoughts on this?

Comment by Ryan Egesdahl (Inactive) [ 26/Jun/20 ]

This behavior started in SERVER-40943 and was caused by adding ccache as a tool unconditionally. This was causing icecream.py to use ccache but without the CCACHE_NOCPP2 option we had set to prevent exactly this problem. I only noticed this during an unrelated SCons-only build where I had icecream enabled and ccache disabled and had turned on VERBOSE=1 to see what was actually being compiled. That's when I noticed ccache was being used even though I had not requested it.

This was actually a bad interaction between ccache and icecc preprocessing, which I will get to shortly. As for why we were getting different behavior only on newer versions of icecream, this commit to the icecream repo tells the rest of the story:

https://github.com/icecc/icecream/commit/e7d95da05ed78a051d45c9cfb00fc9f1539505ca

Essentially, before icecream 1.2, precompiled headers were not being detected properly anyway and were not being used remotely. This patch was committed in icecream version 1.1, so everyone using that version never noticed that CCACHE_NO_CPP2 wasn't set. Versions 1.0.1 and 1.3, however, had the correct behavior and CCACHE_NO_CPP2 mattered again. Interestingly, this commit tells us what my original proposed fix (adding "-fpch-preprocess" to the compiler flags) actually did:

https://github.com/icecc/icecream/commit/c56c19ea0264ef295a3f5824beb1ada48bd58682

The flow of information is that icecream was preprocessing the headers and then returning them to ccache, which would then also attempt to preprocess the same headers. Clang handles this gracefully, but GCC did not. The flag, which was dropped by icecream (thanks to the above commit) was still being used by ccache, which was allowing the compilation to succeed by telling GCC that a precompiled header was present.

The mess of logic for enabling ccache and icecream and then detecting whether they were enabled and which flags and environment options were being passed where were confounding a problem that was already difficult to pull apart. I'm going to correct it, simplify it, and document it better so we don't run into this problem again.

Comment by Andrew Morrow (Inactive) [ 23/Jun/20 ]

Notably, this does not happen on Ubuntu 18.04, where icecc is 1.1 and ccache is 3.4.1 On Ubuntu 20.04, we find icecc 1.3 and ccache 3.7.7. That doesn't really help us rule much out unfortunately, since both are newer versions. But it does narrow the range of versions in play.

Comment by Ryan Egesdahl (Inactive) [ 17/Jun/20 ]

I think I have finally isolated the problem. What seems to be happening is that the source is getting preprocessed by both the compiler and icecream (as documented here: https://developer.mozilla.org/en-US/docs/Mozilla/Developer_guide/Using_Icecream#Frequent_problems_debugging under "I get build failures to due -Werror"). Clang appears not to care that this has happened, but (newer?) GCC does for some reason. The fix is simple - since we know that using both icecream and ccache together results in a twice-compiled header, we can just apply the fix suggested here: https://ccache.dev/manual/3.7.9.html#_precompiled_headers

Doing the above fix seems to correct the problem, and I can verify that both ccache and icecream are in use. I'll test further and ensure it doesn't break anything else.

Conditions to reproduce:

  1. icecc 1.3.1 (or 1.1+)
  2. building with gcc
  3. using ccache

I was able to reproduce this with both Ninja and SCons builds in the above conditions.

Comment by Ryan Egesdahl (Inactive) [ 16/Jun/20 ]

It turns out that in the previous investigation, I was not actually using icecc at all because of the way I was setting the ICECC variable for SCons. However, I was able to perform several other tests that demonstrate where the problem likely is.

First, when I disable the ccache cache (using CCACHE_DISABLE=1), the problem does not appear. Second, when I add CCACHE_PREFIX_CPP=<icecc_path> to icecream.py, the source compiles with some warnings that look like this:

Compiling build/debug/mongo/db/pipeline/document_source_change_stream_close_cursor.dyn.o
In file included from src/mongo/db/query/explain.h:32,
                 from src/mongo/db/commands.h:47,
                 from build/debug/mongo/db/stats/server_write_concern_metrics_gen.cpp:16:
src/mongo/db/exec/plan_stage.h: In member function 'void mongo::PlanStage::dispose(mongo::OperationContext*)':
src/mongo/db/exec/plan_stage.h:296:62: error: self-comparison always evaluates to true [-Werror=tautological-compare]
             invariant(!_opCtx || opCtx == opCtx);
                                                              ^
cc1plus: all warnings being treated as errors
scons: *** [build/debug/mongo/db/stats/server_write_concern_metrics_gen.dyn.o] Error 1
In file included from src/mongo/db/query/explain.h:32,
                 from src/mongo/db/commands.h:47,
                 from src/mongo/db/commands/server_status.h:32,
                 from src/mongo/db/stats/server_read_concern_metrics.cpp:34:
src/mongo/db/exec/plan_stage.h: In member function 'void mongo::PlanStage::dispose(mongo::OperationContext*)':
src/mongo/db/exec/plan_stage.h:296:62: error: self-comparison always evaluates to true [-Werror=tautological-compare]
             invariant(!_opCtx || opCtx == opCtx);

This suggests that there is a problem in the handoff between the "local" and "remote" compilers at the preprocessing stage, and specifically that the remote preprocessor is somehow incorrect. The locale therefore seems to be a red herring here because I can verify that the environment passes through as I expect:

ubuntu@ip-10-122-5-51:~/workspace/mongo$ LANG="blah" ccache icerun sh -c 'env' | grep LANG
LANG=blah
ubuntu@ip-10-122-5-51:~/workspace/mongo$ LANG="blah" ccache icecc sh -c 'env' | grep LANG
LANG=blah

It now looks like this problem might be related to something acm is already working on (https://jira.mongodb.org/browse/SERVER-47598) where we have noticed some inconsistencies in the compiler that gets packaged for icecream runs. I suspect that fixing the problem there will fix the problem here, but I am going to investigate further to try and confirm that.

Comment by Ryan Egesdahl (Inactive) [ 16/Jun/20 ]

This seems to be in the interaction between ccache and icecream. Specifically, when ccache is enabled with icecream, we call ccache by setting CCACHE_PREFIX in icecream.py.

We probably didn't see this because it depends on the following factors being present at the same time:

  1. ccache version > 3.5.1 or 1.6
  2. environment has been "sanitized" (usually because SCons does this)
  3. ccache is being used with icecream

The combination of the above three items is unusual in our current environment because it only seems to show up in a specific use case on Ubuntu 20.04 right now. However, it's worth fixing because it represents behavior we have been depending on which changed out from underneath us.

My current hypothesis based on some testing is that icecream is being run with a different set of LANG environment variables than gcc is being run under via CCACHE, which may in turn be related to this bug which was fixed in ccache 3.5.1 and 3.6:

https://github.com/ccache/ccache/issues/327

If this is what is happening, it may be that this is causing some form of character encoding issues, the exact nature of which I currently do not understand. However, specifically setting LANG when icecream is run seems to "fix" the problem. I'm investigating further in this area.

Comment by Svilen Mihaylov (Inactive) [ 16/Jun/20 ]

Ok this is great. Its great we are validating gcc vs clang.

Comment by Andrew Morrow (Inactive) [ 16/Jun/20 ]

svilen.mihaylov - Our toolchain of record on Linux is GCC, macOS is Xcode clang, and Windows is MSVC. So Linux clang builds aren't how we produce our official release binaries. However, using clang for development is just fine, and we do have clang based linux builders on the waterfall, notably the *SAN builders. So you should have no issues using the toolchain clang on Linux for your development work.

Comment by Svilen Mihaylov (Inactive) [ 15/Jun/20 ]

Thank you very much for investigating. I have uploaded my build.ninja file.

 

I tried switching the vars from gcc to clang and it seems to be working!

Before I am completely satisfied with the resolution, I'd like to know how this affects (if at all) the binaries which are produced? Are we fully cross-compatible?

Comment by Andrew Morrow (Inactive) [ 15/Jun/20 ]

I was able to reproduce this on an Ubuntu 20.04 spawn host using the above command line, and then running ninja build/debug/mongo/base/data_range.dyn.o with a clean ccache, so there is definitely some sort of issue here.

I was able to make the issue go away by enabling only one of icecream or ccache at the same time, so that also helps - the issue is in the integration where we handle the case where both are enabled. However, and notably, the issue did not reproduce if I switched to using toolchain clang by replacing --variables-files=etc/scons/mongodbtoolchain_stable_gcc.vars with --variables-files=etc/scons/mongodbtoolchain_stable_clang.vars.

svilen.mihaylov - could you see if switching to clang works for you? In the meanwhile, we will continue to investigate why this seems not to work as expected.

Comment by Svilen Mihaylov (Inactive) [ 15/Jun/20 ]

Just tried on newest available toolchain version (revisions/4666020dcdac5e2c493adc9e1638877b50ee26c6/) with the same problem.

Comment by Svilen Mihaylov (Inactive) [ 15/Jun/20 ]

acm Is there a specific set of info you are interested in terms of environment setup?

Comment by Svilen Mihaylov (Inactive) [ 15/Jun/20 ]

Attached a text file showing the compile errors when using CCACHE.

Comment by Svilen Mihaylov (Inactive) [ 15/Jun/20 ]

Sure, here:

 

./buildscripts/scons.py --dbg=on --opt=off --link-model=dynamic --allocator=system CCFLAGS='-Wa,--compress-debug-sections -gsplit-dwarf' --variables-files=etc/scons/mongodbtoolchain_stable_gcc.vars MONGO_VERSION=$(git describe --abbrev=0 | tail -c+2) ICECC=icecc CCACHE=ccache --ninja --modules=enterprise build.ninja

Comment by Andrew Morrow (Inactive) [ 15/Jun/20 ]

svilen.mihaylov - Could you add complete repro instructions to this ticket? Local environment info, commands executed, error messages generated, etc. That would go a long way to helping us investigate.

Comment by Svilen Mihaylov (Inactive) [ 15/Jun/20 ]

Just did a quick re-testing and it is still broken.

Comment by Svilen Mihaylov (Inactive) [ 11/Jun/20 ]

Can I help with the investigation of this problem?

Generated at Thu Feb 08 05:17:08 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.