Loading...

XML

Word

Printable

JSON

Type: Task
Resolution: Unresolved
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
None

Assigned Teams:

Query Execution
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Consider the following scenario. There is a collection with a document _id: 123 and two separate clients running. The following sequence happens:

Client 1 Begins a transaction
Client 1 Deletes _id: 123
Client 1 Inserts a new document with _id: 123 (but with a new RecordId)
Client 2 runs a findAndModify or updateOne targeting _id: 123, and attempts to set a field 'X'
(Client 2 conflicts with the running transaction and is in a retry loop)
Client 1 commits its transaction
Client 2's operation completes

Today, (in versions 4.4-7.0) client 2's findAndModify does not update anything, even though there was a document with _id: 0 the entire time. This is permitted under read-committed semantics, since queries under read-committed isolation can miss rows entirely, though it is confusing.

What specifically causes Client 2's operation to "miss" the document?

Client 2's operation is an UPDATE -> IDHACK plan. The plan reads from the _id index, fetches the document, and then receives a WriteConflict while attempting to update it. The UpdateStage stashes the document it read from the below stage when it gets a write conflict. After a WriteConflict, we abort our WT transaction, and start a new one at a new point in time. The UpdateStage then "recovers" its state (namely, the copy of the document it was trying to update and Record ID). It re-fetches this document by RecordId and checks whether it still matches the filter.

Since the RecordId stashed no longer exists after Thread 1 deleted it, no document is fetched. The UpdateStage then returned NEED_TIME and in the subsequent call to work(), the IDHackStage returns EOF.

What are our known options? (We can add to this)

Update the documentation to make it clear that two documents with the same _id are not necessarily "the same document." Otherwise no change in server behavior. Today's behavior is allowed under read committed isolation, so while it's inconvenient, it's not a bug. There are also two workarounds:
1. Thread 2 could use findAndModify and specify a sort. The sort acts as a sort+limit 1, and if the document which comes first in the sort order gets removed or doesn't match the predicate, we retry the entire operation over via this code path.
2. Thread 1 could update the document instead of deleting and re-inserting it, which would preserve its RecordId.

Change the behavior so when a document is deleted and re-inserted (Same _id, but new record ID), concurrent updates will succeed.
1. One idea MaxH had for this was to have the IDHack stage continue seeking/fetching even after it's returned a document, when beneath a write stage. Essentially, removing the limit 1 that's baked into it today (only when beneath a write stage).
2. Pass a flag via the UpdateParams indicating that the query is reading by _id and change this code to check that flag. This would cause the operation to behave just like findAndModify with a sort does today, without changing the IDHack stage.
3. ~~Make some more general change to the update code to retry when a conflict is hit and the document is later found to be missing.~~
  1. This would result in a perf hit for some scenarios, since it would cause operations to retry completely which don't today.

Repro
A repro script is attached below. It can be run with the following resmoke invocation:

python3 buildscripts/resmoke.py run --installDir build/install/bin --suites=replica_sets fam-repro-replset.js

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

fam-repro-replset.js
3 kB
Feb 05 2024 11:33:37 PM UTC

is related to

SERVER-86250 Consider changing findAndModify behavior when concurrent operation changes the sort key

Open

Assignee:: Evan Bergeron
Reporter:: Ian Boros
Participants:: Evan Bergeron, Ian Boros
Votes:: 0 Vote for this issue
Watchers:: 18 Start watching this issue

Created:: Feb 05 2024 11:32:47 PM UTC
Updated:: Jan 07 2025 04:18:32 PM UTC

Details

Description

Attachments

Attachments

Issue Links

Forms

Activity

People

Dates