[SERVER-47646] Scope::_lastVersion optimization breaks with concurrent readers Created: 17/Apr/20  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: Concurrency, JavaScript
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: David Percy Assignee: Backlog - Service Architecture
Resolution: Unresolved Votes: 0
Labels: sa-remove-fv-backlog-22
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
Assigned Teams:
Service Arch
Operating System: ALL
Steps To Reproduce:

Checkout this branch: https://github.com/mongodb/mongo/compare/v4.4...dpercy:BF-16735-parallel-systemjs#diff-48f56cf0f9676df3fcbc02748a95e4e6

Compile the server, and run ./repro.sh. It will iterate rerunning resmoke until the test fails. For me it usually takes less than 20 iterations to reproduce.

Participants:
Linked BF Score: 10

 Description   

The scenario is two clients running in parallel: client A inserts a function into system.js then tries to call it; client B just runs a $where. They shouldn't interfere with each other because client B doesn't write to system.js. But somehow client A fails with a ReferenceError: the function it inserted is not defined.

This bug was revealed by a new test in 4.4, but I was able to repro on 4.2, so it may go back several versions.

The cause has to do with an optimization in Scope::loadStored. Instead of loading system.js procedures on every call, it uses a global atomic counter, _lastVersion, to avoid reloading when nothing changed. When I disable this optimization, the bug goes away.

I added log statements and found this order of events:
1. Client A bumps _lastVersion.
2. Client B reads the new value of _lastVersion.
3. Client B updates its Scope instance by deleting all the JS functions in it. It sets _loadedVersion = _lastVersion.
4. Client A gets that same Scope instance out of the pool.  It doesn't reload anything because _loadedVersion == _lastVersion.

Step 3 is the surprising part: client A inserts the document before bumping the counter, so if client B reads the new counter value then why doesn't it read the inserted document?

I think what's happening must be something like this:
1. Client B begins a WiredTiger transaction.
2. Client A inserts into system.js.
3. Client A writes _lastVersion.
4. Client B reads the new value of _lastVersion.
5. Client B reads zero documents from system.js, because client B's WiredTiger snapshot is from before the insert.

The two clients are communicating using two different kinds of state: a WT collection and a global atomic counter. So WT doesn't see all the dependencies between the two clients, and it thinks it can serialize client B before client A.

I think one solution would be to store the _lastVersion state in WT somehow. Then client B would never see an inconsistent state where _lastVersion is bumped but the collection is still empty.


Generated at Thu Feb 08 05:14:49 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.