[SERVER-33660] Once getMores include lsid, sharded aggregations with $mergeCursors can hang Created: 05/Mar/18  Updated: 29/Oct/23  Resolved: 06/Mar/18

Status: Closed
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: None
Fix Version/s: 3.7.3

Type: Bug Priority: Major - P3
Reporter: Charlie Swanson Assignee: Charlie Swanson
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by SERVER-24978 Second batches in aggregation framewo... Closed
Related
related to SERVER-33683 Allow aggregation $mergeCursors stage... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Query 2018-03-12
Participants:

 Description   

A deadlock is induced on the SessionCatalog:

  1. The operation performing the merging half of the pipeline checks out the Session for that lsid.
  2. That operation includes a $mergeCursors, which performs getMores on remote hosts, one of which is the same host performing the $mergeCursors.
  3. That operation will attempt to check out the same session once the getMore includes the lsid - blocking on a mutex in the SessionCatalog.

As a short term fix, we should do the following:

  1. Only check out the Session if the operation includes a transaction number.
  2. Ban aggregations with a transaction number on mongos.

As a long term fix, we will investigate either not using getMores over the network for what are really local reads. If that proves difficult, we will have to re-evaluate.



 Comments   
Comment by Charlie Swanson [ 06/Mar/18 ]

A workaround has been pushed here to prevent the hang. Future work to improve the system to handle aggregations within a transaction will be tracked by SERVER-33683.

Note for drivers: we have banned attaching a "txnNumber" to an aggregation against mongos. We don't plan for this ban to be permanent, and in fact hope to remove it before 4.0. We want to make sure that drivers are not sending an aggregate with a "txnNumber" just in case we don't get a chance to do the work necessary to remove the restriction. We believe drivers will only attach this for retryable writes, but let us know if this is not the case.

Comment by Githook User [ 06/Mar/18 ]

Author:

{'email': 'charlie.swanson@mongodb.com', 'name': 'Charlie Swanson', 'username': 'cswanson310'}

Message: SERVER-33660 Only check out session for operations with a txnNumber
Branch: master
https://github.com/mongodb/mongo/commit/c3d7752c263d532315671b862695d4316c877ae3

Comment by James Wahlin [ 05/Mar/18 ]

A session yielding mechanism is definitely worth discussion. I have some concern that it would make session state management more difficult. Also we would need to hold locks during yielding for multi-statement transactions and at present for snapshot reads as well.

Comment by Kaloian Manassiev [ 05/Mar/18 ]

While the idea of making local {{getMore}}s not go over the network is definitely something, which should be considered, I think there is a more fundamental problem here with session state. Does it make sense to introduce a "session yielding" mechanism, where operations, which have reached a state where they need to perform remote operations can check the session back in (provided perhaps that they aren't holding any locks), then perform the remote operation assuming it might have to come back?

james.wahlin?

Generated at Thu Feb 08 04:34:09 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.