[SERVER-53846] Clarify locking requirements in createCollectionForApplyOps Created: 15/Jan/21 Updated: 29/Oct/23 Resolved: 26/Jul/23 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | 7.1.0-rc0 |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | Lingzhi Deng | Assignee: | Gregory Noma |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||
| Assigned Teams: |
Storage Execution
|
||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||
| Sprint: | Execution NAMR Team 2023-08-07 | ||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Description |
|
I was looking at how we make createCollection idempotent for oplog application. In the case there is a conflicting collection with the same namespace, we would rename the existing collection (newCollName) to a temporary collection (tmpName). But renameCollection requires a MODE_X lock on both the fromNss and the toNss. But based on what I saw in createCollectionForApplyOps, I don't see how we could hold a MODE_X lock on newCollName . Same for this rename. I don't see necessary locks being acquired in this function. I initially suspected this was a bug. But actually we won't hit the locking invariants in renameCollection because because isCollectionLockedForMode always returns true when !shouldConflictWithSecondaryBatchApplication(). And currently, the oplogApplier always sets shouldConflictWithSecondaryBatchApplication to false. If this is the assumption we need for calling createCollectionForApplyOps, we should add an invariant. The locking invariants in renameCollection were hit during tenant migrations. Tenant migration is similar to initial sync but it applies oplog entries on the primary node. So we don't set shouldConflictWithSecondaryBatchApplication to false like we do for normal replication. To work around the invariants mentioned above and unblock tenant migration, we could make tenant migration set shouldConflictWithSecondaryBatchApplication to false. But I think createCollectionForApplyOps should probably acquire the necessary locks on its own. So this is a ticket for that investigation and I am not sure if this would be a bug for applyOps (maybe not because applyOps takes global X lock for commands?). |
| Comments |
| Comment by Githook User [ 26/Jul/23 ] |
|
Author: {'name': 'Gregory Noma', 'email': 'gregory.noma@gmail.com', 'username': 'gregorynoma'}Message: |
| Comment by Lingzhi Deng [ 08/Feb/21 ] |
|
Just one more thing, we used to acquire proper locks in the function until |
| Comment by Dianna Hohensee (Inactive) [ 08/Feb/21 ] |
|
Yes, I definitely agree that the behavior is concerning, particularly the ease with which issues could arise from not realizing new code is unintentionally running without locks thanks to the bypass. I'm going to close this investigation as complete. This will be kept in mind for the future. Thanks for looking into it! |
| Comment by Lingzhi Deng [ 05/Feb/21 ] |
|
For the time being, I set shouldConflictWithSecondaryBatchApplication to false for tenant migration to work around this so it is not too urgent at the moment. But this was very confusing and I am afraid there might be other code path that could hit this. But I am ok with deferring this to later when we have the bandwidth. |
| Comment by Lingzhi Deng [ 05/Feb/21 ] |
|
What do you mean by "main applier thread runs X lock operations serially"? |
| Comment by Dianna Hohensee (Inactive) [ 05/Feb/21 ] |
|
Apparently, for secondary oplog application, the writer threads expect the main applier thread to hold locks for them. The main applier thread runs X lock operations serially, so no other writers are active in the system. Therefore, isCollectionLockedForMode returns true if shouldConflictWithSecondaryBatchApplication==false as a signal that is what is happening. lingzhi.deng, does tenant migration put the server into a similar mode such that concurrent writes are not a concern? Otherwise, not holding the appropriate locks on some thread, sounds like it could bee an problem. |