[SERVER-28427] GlobalLock with timeout can still block indefinitely Created: 22/Mar/17  Updated: 27/Aug/18  Resolved: 01/May/17

Status: Closed
Project: Core Server
Component/s: Storage
Affects Version/s: 3.4.2, 3.5.4
Fix Version/s: 3.4.5, 3.5.7

Type: Bug Priority: Critical - P2
Reporter: Judah Schvimer Assignee: Geert Bosch
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Duplicate
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v3.4
Sprint: Storage 2017-04-17, Storage 2017-05-08
Participants:
Case:

 Description   

There is a potential dead lock between the step down command and the noop writer. The step down command takes the global exclusive lock in S mode and then blocks on destroying the noop writer.

The noop writer takes the global exclusive lock in IX mode when it does writes. The destructor calls join which won't return until the noop writer finishes its write.

To fix this we can:
1. stop the noop writer's write in killAllUserOperations before we try to shut it down.
2. Stop the noop writer before we take the global lock and start it back up again if we fail to step down.
3. mark the operation context as killed in the noop writer destructor so that it stops trying to take the lock.



 Comments   
Comment by Githook User [ 06/May/17 ]

Author:

{u'username': u'GeertBosch', u'name': u'Geert Bosch', u'email': u'geert@mongodb.com'}

Message: SERVER-28427 Implement timeouts for the TicketHolder

(cherry picked from commit 498df9ab853bb03514b8803b9b1f6c2b6900b533)

Conflicts:
src/mongo/db/concurrency/SConscript
src/mongo/db/concurrency/d_concurrency.cpp
src/mongo/db/concurrency/d_concurrency.h
src/mongo/db/concurrency/d_concurrency_test.cpp
src/mongo/db/repl/replication_coordinator_impl.cpp
Branch: v3.4
https://github.com/mongodb/mongo/commit/aa8ab6611d27a6a4b014d82a37eb658760fa7425

Comment by Githook User [ 01/May/17 ]

Author:

{u'username': u'GeertBosch', u'name': u'Geert Bosch', u'email': u'geert@mongodb.com'}

Message: SERVER-28427 Implement timeouts for the TicketHolder
Branch: master
https://github.com/mongodb/mongo/commit/498df9ab853bb03514b8803b9b1f6c2b6900b533

Comment by Andy Schwerin [ 22/Mar/17 ]

I don't believe that the bug here is in the noop writer and stepdown. Rather, the problem is that the noop writer asks to wait up to 1ms to acquire the IX lock, but LockerImpl::lockGlobalBegin is not told that timeout, and its call to waitForTicket does not receive a timeout, either. The offending wait is here.

Prior to SERVER-22011, lockGlobalBegin could not block, and given how it is used in other scenarios, I'm not sure it should be legal for it to block. It certainly shouldn't be legal for it to block indefinitely when the consumer has attempt to supply a timeout.

Generated at Thu Feb 08 04:18:06 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.