[SERVER-39337] MigrationSourceManager can hit an invariant if initial lock acquisition timed out Created: 01/Feb/19  Updated: 29/Oct/23  Resolved: 05/Mar/19

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 4.1.7
Fix Version/s: 4.1.9

Type: Bug Priority: Major - P3
Reporter: Randolph Tan Assignee: Blake Oler
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
is related to SERVER-39091 startClone can trigger invariant fail... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

1. Run fsyncLock on to be donor shard.
2. Run moveChunk.
3. Wait until what: "moveChunk.error" changelog shows up in the log. This indicates that the lock acquisition timed out.
4. Run fsyncUnlock, this will allow the MSM::_cleanup to grab the collection lock, and then triggering the invariant.

Sprint: Sharding 2019-02-25, Sharding 2019-03-11
Participants:
Linked BF Score: 10

 Description   

MSM modifies the _state outside the collection lock and updates the decorator inside the lock. So, when _cleanup gets run it is possible to have _state != created and decorator to be nullptr.



 Comments   
Comment by Githook User [ 05/Mar/19 ]

Author:

{'name': 'Blake Oler', 'username': 'BlakeIsBlake', 'email': 'blake.oler@mongodb.com'}

Message: SERVER-39337 Assume variables installed for chunk cloning only after state transitions to kCloning
Branch: master
https://github.com/mongodb/mongo/commit/5e9df07fa5d0fdb0a26706473f580dbbde1e4baa

Comment by Randolph Tan [ 20/Feb/19 ]

lgtm

Comment by Blake Oler [ 20/Feb/19 ]

I've been looking at the code, and I'm not seeing any general problems with building in the following assumption:

If we are still in kCreated, it is possible, but not guranteed, for the MSM to be on the CSR, and for the _cloneDriver to exist. To guarantee this, we will only transition to kCloning after both variables have been set.

If we are still in kCreated on cleanup, we simply check if either variable exists – if so, do the already-existing cleanup task. This is fine because we will have the collection lock at the time of cleanup. If we're past kCreated, then we're good and can assume the variables exist regardless.

renctan lgty?

 

Generated at Thu Feb 08 04:51:44 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.