Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 8.1.0-rc0, 8.0.5
Affects Version/s: None
Component/s: None
Labels:
None

Assigned Teams:

Networking & Observability
Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Backport Requested:

v8.0
Sprint:
Networking & Obs 2024-09-30, Networking & Obs 2024-10-14
Linked BF Score:
0
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Note: This bug fix ticket came out of BF-28781 which detected an unexpected intermittent failure during a shared search query.

The TaskExecutorCursor (TEC) is a class that manages a cursor in mongod on mongot. Under the hood, it owns a TaskExecutor which does the actual networking to mongot. When mongod to mongot's communication is running in pinning mode, this virtual TaskExecutor will be a concrete PinnedConnectionTaskExecutor (PCTE).

In practice, its possible for two different TECs to share the same PCTE. This is because mongod can run cursor-establishing commands on mongot that open multiple cursors from a single command; generally, a meta cursor and ordinary result cursor. Furthermore, either of these TECs can go out of scope / be destroyed with the expectation that the other can continue unaffected.

Currently, there is exposure to a race condition where, if a TEC's destructor is called at the same time that same TEC has an outstanding network operation over the PCTE, it will kill the entire pinned connection. This is a problem if the other TEC is still expecting to do more network operations and its pinned connection is now closed, and thus will produce an error upon attempting to talk to mongot.

The code for this fix should be very simple. We want to remove the "|| _options.PinnedConnection" option from this check here.

The complexity of this ticket is in testing that some specific cases still work as expected:

1) When an TEC is being destroyed, and there is an outstanding network operation open at the same time, the TEC can finish destruction and the operation can still come back without any errors.

2) A new operation can be enqueued on the PCTE while a different outstanding operation is in progress, and both can come back properly, even if the first operations TEC is already destroyed.

Also, in order to reproduce this bug, the results TEC must be in a non-prefetching mode and the metadata cursor must be in a prefetching mode. In order to enable this behavior for the 'sharded_sort.js' test, have the 'featureFlagSearchBatchSizeTuning' flag enabled (there may be other ways to get the same prefetching/non-prefetching state, but this is the only way I know of).

These cases should be forceable with installing the right sleeps in the right places (in mongod and the mongot mock). Reach out to george.wangensteen@mongodb.com or joseph.shalabi@mongodb.com for clarification on this ticket, or help in producing the test cases, as we originally investigated this failure.

is depended on by

SERVER-93614 Make pinning connection between mongod and mongot the default

Closed

Assignee:: Erin McNulty
Reporter:: George Wangensteen (Inactive)
Participants:: Erin McNulty, George Wangensteen, Githook User
Votes:: 0 Vote for this issue
Watchers:: 6 Start watching this issue

Created:: Aug 14 2024 05:29:01 PM UTC
Updated:: Jan 07 2025 10:18:54 PM UTC
Resolved:: Sep 30 2024 06:51:15 PM UTC
Confidence Status Last Update:: 17/Sep/24 2:11 PM

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates