[SERVER-36508] _getNextSessionMods command should not hold locks on migration collection while querying the oplog Created: 07/Aug/18  Updated: 27/Aug/18  Resolved: 27/Aug/18

Status: Closed
Project: Core Server
Component/s: Replication, Sharding
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Spencer Brody (Inactive) Assignee: Randolph Tan
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by SERVER-35367 Hold locks in fewer callers of waitFo... Closed
Sprint: Sharding 2018-08-13, Sharding 2018-08-27
Participants:

 Description   

In order to fix SERVER-35367, we need to cause queries on the oplog to yield their locks while calling waitForAllEarlierOplogWritesToBeVisible(). Yielding locks doesn't work if there are nested lock acquisitions on the Global lock. Since _getNextSessionMods takes a lock on the collection being migrated and holds that lock while doing a query against the oplog, that means that locks taken for the oplog query result in a nested acquisition of the Global lock, preventing the lock yielding and resulting in waitForAllEarlierOplogWritesToBeVisible() to be called while locks are held.

The lock on the collection being migrated is only required to figure out which is the starting oplog entry for walking back the oplog chain for the transaction. Once that has been decided there's no intrinsic reason why we need to maintain the collection lock while querying the oplog. The problem is that the logic for walking back the oplog chain lives in the MigrationChunkClonerSource, whose lifetime is guarded by the collection lock on the collection being migrated, which makes this difficult to work around without substantial code changes.



 Comments   
Comment by Spencer Brody (Inactive) [ 24/Aug/18 ]

We fixed the deadlock that SERVER-35367 was trying to fix by doing SERVER-36534 instead, so this work isn't blocking anything or required for anything at the moment. This can probably be closed "Won't Fix".

Generated at Thu Feb 08 04:43:19 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.