[SERVER-54375] Failing windows build with error LNK1106 Created: 07/Feb/21  Updated: 29/Oct/23  Resolved: 11/Feb/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 4.9.0

Type: Bug Priority: Major - P3
Reporter: Sam Mercier Assignee: Daniel Moody
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File 6278712e64dc58ce127858c3d59272c0.bad.obj     File 6278712e64dc58ce127858c3d59272c0.good.obj    
Issue Links:
Depends
Problem/Incident
is caused by SERVER-54458 updated vendored scons to use uuid fo... Closed
Related
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Dev Platform 2021-02-22
Participants:
Linked BF Score: 17

 Description   

Succeeding patch build – https://spruce.mongodb.com/patch/601dcf502fbabe23aeb93f4a
Original commit – https://github.com/mongodb/mongo/commit/6feae12fe29a4c921bdbf03dd8b1ae6d5dd27f92
Failing waterfall build (1) – https://evergreen.mongodb.com/version/mongodb_mongo_master_6feae12fe29a4c921bdbf03dd8b1ae6d5dd27f92
Failing waterfall build (2) --https://evergreen.mongodb.com/version/mongodb_mongo_master_0d804a26399c41e62a2d0a282120af0cc22b8959
Failing waterfall build (3) – https://evergreen.mongodb.com/version/mongodb_mongo_master_7664a855f33bbe7e0f77cee78cb07e564a4f0c4c
Revert commit – https://github.com/mongodb/mongo/commit/d77297f4a454073505741ae586c885e087b30165
Failing Patch Build (1) – https://spruce.mongodb.com/version/601f64433e8e86586c9b10f2/
Failing Patch Build (2) – https://spruce.mongodb.com/version/601f6a14d1fe07153cbe1f39/
Succeeding Patch Build (1) – https://spruce.mongodb.com/patch/60218a3ad1fe0749d2b1e3f6



 Comments   
Comment by Sam Mercier [ 11/Feb/21 ]

True. Furthermore there are no guarantees that we've thought through every way in which bad data can enter the cache. And from my experience we know it can take a while for the issue to dissipate.

Couldn't one use the hash instead of a UUID in order to get us closer to the world of (2) while still achieving the goal of (1) (and getting the ease of implementation)?

Comment by Daniel Moody [ 10/Feb/21 ]

Okay, i'll start in on 1, but also should be noted that implementing 2 allows scons to then to decide to rebuild the bad file locally, there by bypassing the error.

Comment by Andrew Morrow (Inactive) [ 10/Feb/21 ]

Doing 1 first would help immediately prevent new instances of this issue, while doing 2 will just alert us if the bad data has already been pushed. For me, that argues in favor of working up a fix for the pid tempfile thing in our vendored SCons pretty much right away.

Comment by Daniel Moody [ 10/Feb/21 ]

Well theres two points to this:

  1. Prevent writing bad files to cache by fixing the pid tempfile thing.
  2. Prevent using bad files from the cache by adding cache retrieval verification (i.e. store md5sum in the filename).

I have been playing around with the second point and seems like it will work pretty easily, but both would be pretty easy to implement in our vendored scons.

Comment by Andrew Morrow (Inactive) [ 10/Feb/21 ]

Given that we now think we understand how this happened, what should be the next steps for this ticket?

Comment by Andrew Morrow (Inactive) [ 10/Feb/21 ]

Oh that is interesting. That could result in a true write collision, if I remember prior discussion. Perhaps we ought to get moving on a PR for per-scons-process UUIDs to use for tempfiles? I can't imagine that would be particularly difficult.

Comment by Daniel Moody [ 10/Feb/21 ]

Another interesting note, the scons command for those two builds ended up with the same pid (looking at the system logs gives pid of 5544) which could cause the tempfiles scons uses for writing the cache dir file to be the same: 

https://github.com/mongodb/mongo/blob/6feae12fe29a4c921bdbf03dd8b1ae6d5dd27f92/src/third_party/scons-3.1.2/scons-local-3.1.2/SCons/CacheDir.py#L107

Comment by Daniel Moody [ 10/Feb/21 ]

Yeah I am leaning towards a possible linker bug, file corruption usually is not so specific.

I can reproduce the error just by running dumpbin (windows version of readelf).

Currently I am trying to find the patch build that introduced the bad file to the cache to see if there is any more information there.

It's a bit harder to protect against bad files getting added to the cache, but one idea I had was delayed cache pushing until all the nodes parents are successfully built, or until the entire build finishes successfully. This doesn't really protect against items which nothing depends (like the final binaries) but we use nolinked option so that files don't come from cache anyways?

Comment by Andrew Morrow (Inactive) [ 10/Feb/21 ]

The library that it gets linked into is just the one file. Can you reproduce the error by trying to link it into a library outside of SCons? Perhaps this is just a toolchain bug? Maybe the file is actually fine and the linker is broken? Maybe the file is broken because the compiler mis-compiled it and the bad file landed in cache? Maybe there is no error in the filesystem or the cache at all.

Comment by Daniel Moody [ 10/Feb/21 ]

Looking at the to binaries with a hex diff, I don't see any large scale file corruption. I just jumped through the file and examined the hex and throughout the file I could see ascii string data correlating to the mongodb symbols all the way to the end of the file. Of course the good binary was built in a different directory with different flags on a different system so the comparison is not great. Ideally I would rebuild the binary in evergreen on the same variant, and extract the binary. Possibly I can run the build with the --force-cache --cache-disable options to force updating the bad file in the cache, and then ask brian.mccarthy to retrieve it for me again.

I also did a hex search for the offending address (0x65B6C3F8) and did not find it in either binary:

LANG=CXX grep -obUaP "\x65\xB6\xC3\xF8" /c/Users/Administrator/6278712e64dc58ce127858c3d59272c0

Comment by Daniel Moody [ 08/Feb/21 ]

Uploaded the good and bad version of the mongo/db/catalog/index_builds_manager.obj. They are named with the build signature hash used on scons CacheDir.

I generated the good file by checking out and building commit 6feae12 on a windows-64-vs2019-large distro with the invocation:

C:\Python38\python.exe ./buildscripts/scons.py MONGO_DISTMOD=windows -j8 --win-version-min=win10 --separate-debug --cache=nolinked --cache-dir=Z:\scons-cache --install-action=hardlink --jlink=0.5 --cache-debug=scons_cache.log --debug=time --implicit-cache --build-fast-and-loose=on build\cached\mongo\db\catalog\index_builds_manager.lib

Generated at Thu Feb 08 05:33:22 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.