[SERVER-64323] windows ldap tests failing with concurrency Created: 08/Mar/22  Updated: 29/Oct/23  Resolved: 10/Jun/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 6.1.0-rc0

Type: Bug Priority: Major - P3
Reporter: Daniel Moody Assignee: Sergey Galtsev (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-64322 reduce resmoke jobs for external_auth... Closed
is related to SERVER-67010 Investigate and resolve slow startup ... Closed
is related to SERVER-67094 Implement retry logic in LDAP Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Security 2022-05-16, Security 2022-05-30, Security 2022-06-13
Participants:

 Description   

The jobs for resmoke tests on windows for external auth suite were limited in SERVER-64322 to get the test passing again. This seems related to MSVS 2022. Possibly there is a new security feature related to MSVS 2022 where concurrent ldap request get limited. There needs to be a deeper investigation related to ldap and concurrency in the context of MSVS 2022.



 Comments   
Comment by Githook User [ 10/Jun/22 ]

Author:

{'name': 'sergey.galtsev', 'email': 'sergey.galtsev@mongodb.com', 'username': 'brushless-glitch'}

Message: SERVER-64323 limit number of external_auth jobs only on windows
Branch: master
https://github.com/mongodb/mongo/commit/5b0d49742e1dd9462262593fe844bf3e65bca1d7

Comment by Sergey Galtsev (Inactive) [ 07/Jun/22 ]

Plan to go forward is:

1. Windows will be running in single test mode, Linux will run multiple
2. SERVER-67094 opened to address intermittent retryable ldap errors

Comment by Sergey Galtsev (Inactive) [ 02/Jun/22 ]

I opened SERVER-67010 for I am transferring the ticket to Service Arch to decide what (if anything) should be done to startup sequence.

Comment by Billy Donahue [ 23/May/22 ]

Abseil does something that might help with the startup cost?

https://github.com/10gen/mongo/blob/master/src/third_party/abseil-cpp-master/abseil-cpp/absl/debugging/symbolize_win32.inc#L47

  // Symbols are not loaded until a reference is made requiring the
  // symbols be loaded. This is the fastest, most efficient way to use
  // the symbol handler.
  SymSetOptions(SYMOPT_DEFERRED_LOADS | SYMOPT_UNDNAME);
  if (!SymInitialize(process, nullptr, true)) {

Comment by Sergey Galtsev (Inactive) [ 23/May/22 ]

There is a fix available, however it is not possible to reliably verify it within our present test harness. While discussing situation with andrew.morrow@mongodb.com and matt.broadstone@mongodb.com, we decided to put the ticket on a back-burner up until the windows stack trace functionality testing is implemented. Linking to PM-2926, and must be revisited once a ticket is raised to verify specifically windows stack trace generation

Comment by Sergey Galtsev (Inactive) [ 20/May/22 ]

The difference between VS2019 and VS2022 builds was identified: SymInitializeW will take ~0.6 sec on VS2019 builds, and ~1.5 sec on VS2022. That seems to match tests in which mongod starting up and then immediately shutting down exhibited difference in run time of ~1 sec.

After commenting out SymbolHandler initialization code, I tried running external_auth 6 times in attempt to reproduce failures, and it was successful every time

Comment by Sergey Galtsev (Inactive) [ 12/May/22 ]

This is preliminary report on the subject, which I am posting to summarize what I found so far. I have not at this time found root cause for any of the issues described

Setting:

Distro: windows-vsCurrent-xlarge. Binaries compiled using following command lines:

python buildscripts\scons.py MSVC_VERSION=14.2 MONGO_DISTMOD=windows VARIANT_DIR=2019 ^
    CPPPATH="c:/sasl/include c:/snmp/include" LIBPATH="c:/sasl/lib c:/snmp/lib" ^
    install-core
rmdir build\2019\install /s/q
move build\install build\2019\install
 
python buildscripts\scons.py MSVC_VERSION=14.3 MONGO_DISTMOD=windows VARIANT_DIR=2022 ^
    CPPPATH="c:/sasl/include c:/snmp/include" LIBPATH="c:/sasl/lib c:/snmp/lib" ^
    install-core
rmdir build\2022\install /s/q
move build\install build\2022\install

Running vcvarsall.bat seems to not affect result of build and how resulting binaries behave

Analysis:

  • msvc env (14.2 vs 14.3) makes difference in following aspects:
    • fast test (ldap_authz_bind.js) completes in 3 to 4 sec when binaris are built using 14.2 vs 5-6 sec with 14.3
    • mongod startup time is consistently higher with 14.3. Mongod starting up and immediately exiting executes at ~1.9 sec when built using 14.2, but ~2.7 sec when built using 14.3. That might account for discrepancy in ldap_authz_bind.js
    • ldap_timeout_poolless.js also shows consistent discrepancy between 14.2 and 14.3 builds (~580 vs 610 seconds to complete)
  • msvc env (14.2 vs 14.3) makes no difference other aspects:
    • ldap calls seem to take same amount of time. unit test making actual ldap query performs at same speed with both 14.2 and 14.3 env
    • resulting binary is approximately same length
    • same libs linked

in other words, no obvious reason why one would be so consistently slower than another

Evergreen build:

  • setting MSVC_VERSION=14.3 instead of MSVC_VERSION=14.2 indeed has an effect of enabling often failures
  • errors observed fall into following broad categories:
    • minor discrepancy between expected and received timeouts, example: "assert.time failed timeout 30000ms took 30192ms"
    • "Failed to acquire LDAP group membership", attributable to "Slow query". Unclear why this is happening in 14.3 builds

Unanswered questions:

  • so far I couldn't identify root cause: what exactly causes slower startup, why LDAP timeouts go beyond expected, why group membership timeout but only on 14.3 builds.

Recommendation:

  • perhaps when testing the timeout we should account for machine possibly running concurrent payloads, or being busy with another task, and double anticipated timeout values. Logic is: we know that if we provide to LDAP timeout value of X seconds, the actual time the call returns will not be exactly X seconds. Measuring "too long" is very subjective, and therefore we should configure the tests with the envelopes which are by far above and beyond nominal value.
Generated at Thu Feb 08 06:00:04 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.