Loading...

Type: Bug
Resolution: Won't Do
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
- server-rapid-response-deprioritized

Assigned Teams:

Storage Execution
Operating System:
ALL
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

While updating tests to run on Windows Server 2022 for Mongo 8.0 platform support, several issues were discovered in the noPassthrough suite:

https://spruce.mongodb.com/task/mongodb_mongo_master_enterprise_windows_all_feature_flags_required_noPassthrough_1_windows_enterprise_patch_d60231163ae986719f5b012c47fb065331fabdab_6669f1b564e1ae0007c8514b_24_06_12_19_07_21?execution=2&sortBy=STATUS&sortDir=ASC

https://spruce.mongodb.com/task/mongodb_mongo_master_enterprise_windows_all_feature_flags_required_noPassthrough_1_windows_enterprise_patch_d60231163ae986719f5b012c47fb065331fabdab_6669f1b564e1ae0007c8514b_24_06_12_19_07_21/tests?execution=1&sortBy=STATUS&sortDir=ASC

https://spruce.mongodb.com/task/mongodb_mongo_master_enterprise_windows_all_feature_flags_required_noPassthrough_1_windows_enterprise_patch_d60231163ae986719f5b012c47fb065331fabdab_6669f1b564e1ae0007c8514b_24_06_12_19_07_21/tests?execution=0&sortBy=STATUS&sortDir=ASC

The commit this branch is based off of does not have this issue and the only changes are switching the evergreen host distro from "windows-vsCurrent-large" (Windows Server 2019) to "windows-2022-large" (Windows Server 2022)

The version upgrade will use a workaround to decrease resmoke concurrency to avoid exhausting the system's memory, but it's still unclear why the upgrade caused memory usage to increase.

max.hirschhorn@mongodb.com's analysis:

The Evergreen timeout in execution #3 appears to be caused by slow resmoke logging which led to the primary of the replica set stepping down and hitting fassert(7152000) due to being unable to step down quickly enough since the mongod was fsyncLocked.

[js_test:sharded_pit_backup_restore_simple] d20846| 2024-06-13T01:49:41.751+01:00 I REPL 21809 [S] [ReplCoord-0] "Can't see a majority of the set, relinquishing primary"
...
[js_test:sharded_pit_backup_restore_simple] d20846| 2024-06-13T01:50:11.832+01:00 F REPL 5675600 [S] [ReplCoord-0] "Time out exceeded waiting for RSTL, stepUp/stepDown is not possible thus calling abort() to allow cluster to progress","attr":{"lockRep":{"ReplicationStateTransition":{"acquireCount":

{"W":1}

,"acquireWaitCount":

{"W":1}

,"timeAcquiringMicros":{"W":30079690}}}}
[js_test:sharded_pit_backup_restore_simple] d20846| 2024-06-13T01:50:11.832+01:00 F ASSERT 23089 [S] [ReplCoord-0] "Fatal assertion","attr":

{"msgid":7152000,"file":"src\\mongo\\db\\repl\\replication_coordinator_impl.cpp","line":2964}

https://parsley.mongodb.com/test/mongodb_mongo_master_enterprise_windows_all_feature_flags_required_noPassthrough_1_windows_enterprise_patch_d60231163ae986719f5b012c47fb065331fabdab_6669f1b564e1ae0007c8514b_24_06_12_19_07_21/2/af21249a209a8a57122acbfa50b9bb32?bookmarks=0,118966,137712,239798,242772&filters=10020846%255C%257C.%2A%255C%255BReplCoord-0%255C%255D&shareLine=0

The Evergreen timeout in execution #2 appears to be caused by out_timeseries_cleans_up_bucket_collections.js though I couldn't say why. The logs are incomplete for the other tests because the flush thread had a MemoryError exception. The memory usage hits ~100% at 22:36 UTC but neither the system logs nor system_resource_info.json can identify what is consuming the excessive memory. Notably, the sum of the memory among the processes listed only totals to 10-13GB of the 33GB available.

The Evergreen failure in execution #1 has 7 of the 8 tests failing with "out of memory".

related to

SERVER-91824 Complete TODO listed in SERVER-91466

Closed

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates