[SERVER-27048] Fix recursive lock issue leading to deadlock or crash in LegacySession Created: 15/Nov/16  Updated: 19/Nov/16  Resolved: 16/Nov/16

Status: Closed
Project: Core Server
Component/s: Networking
Affects Version/s: None
Fix Version/s: 3.4.0-rc4, 3.5.1

Type: Bug Priority: Blocker - P1
Reporter: ADAM Martin (Inactive) Assignee: ADAM Martin (Inactive)
Resolution: Done Votes: 0
Labels: platforms-hocr
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Completed:
Steps To Reproduce:

Race conditions are hard to reproduce.

Sprint: Platforms 2016-11-21
Participants:
Linked BF Score: 0

 Description   

The `TransportLayerLegacy::endAllSessions` function takes a lock `_sessionsMutex`. This lock is also taken by the `TransportLayerLegacy::_destroy` method, which is called indirectly by the `TransportLayerLegacy::LegacySession::~LegacySession()` destructor.

Within `endAllSessions`, the `_sessions` list of weak pointers is one-by-one promoted by taking a shared pointer to it, then processed, then the shared pointer is discarded.

This leads to a pair of difficult to reproduce race conditions on `endAllSessions`:

1: The failing case in BF-4102, where the shared pointer is created, making the ref-count of its object at least 2. Other threads dispose of their shared pointers to this object, leaving only the shared pointer which was promoted from the weak pointer behind. That shared pointer will go out of scope at the end of the loop iteration processing it, thus invoking the destructor. That destructor will indirectly call `TransportLayerLegacy::_destroy`, which will attempt to take the lock. Recursively taking a lock in C++'s mutex class is undefined behavior. Typical implementations will either deadlock, or throw `std::system_error` (as encouraged by the standard, but this is non-normative behavior.) Thus, the attempt to take the lock in the `_destroy` function will throw an exception, which is the precise observed behavior in BF-4102.

2. An entry in the weak pointer list has expired, due to the last true pointers to it being destroyed. Promoting the weak pointer will fail, giving a nullptr value for the shared_ptr. In this case, the code skips over empty promoted pointers, thus having no failing actions.



 Comments   
Comment by Githook User [ 16/Nov/16 ]

Author:

{u'username': u'adamlsd', u'name': u'ADAM David Alan Martin', u'email': u'adam.martin@10gen.com'}

Message: SERVER-27048 Fix recursive lock issue in transport.

The LegacySession teardown code has a race where promoted weak
pointers would be the last owners of a type which needs to hold
a lock in destruction. That same lock is held by the LegacySession
teardown code, thus leading to a deadlock or detectable recursive
locking situation.

(cherry picked from commit 2fae4242b9e8256da203639895d1ecd3fe8e2794)
Branch: v3.4
https://github.com/mongodb/mongo/commit/b7472174d000fa00db7827aeadb9fe17738ecfd9

Comment by Githook User [ 16/Nov/16 ]

Author:

{u'username': u'adamlsd', u'name': u'ADAM David Alan Martin', u'email': u'adam.martin@10gen.com'}

Message: SERVER-27048 Fix recursive lock issue in transport.

The LegacySession teardown code has a race where promoted weak
pointers would be the last owners of a type which needs to hold
a lock in destruction. That same lock is held by the LegacySession
teardown code, thus leading to a deadlock or detectable recursive
locking situation.
Branch: master
https://github.com/mongodb/mongo/commit/2fae4242b9e8256da203639895d1ecd3fe8e2794

Generated at Thu Feb 08 04:14:01 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.