[SERVER-33660] Once getMores include lsid, sharded aggregations with $mergeCursors can hang Created: 05/Mar/18 Updated: 29/Oct/23 Resolved: 06/Mar/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Aggregation Framework |
| Affects Version/s: | None |
| Fix Version/s: | 3.7.3 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Charlie Swanson | Assignee: | Charlie Swanson |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Sprint: | Query 2018-03-12 | ||||||||||||||||
| Participants: | |||||||||||||||||
| Description |
|
A deadlock is induced on the SessionCatalog:
As a short term fix, we should do the following:
As a long term fix, we will investigate either not using getMores over the network for what are really local reads. If that proves difficult, we will have to re-evaluate. |
| Comments |
| Comment by Charlie Swanson [ 06/Mar/18 ] |
|
A workaround has been pushed here to prevent the hang. Future work to improve the system to handle aggregations within a transaction will be tracked by Note for drivers: we have banned attaching a "txnNumber" to an aggregation against mongos. We don't plan for this ban to be permanent, and in fact hope to remove it before 4.0. We want to make sure that drivers are not sending an aggregate with a "txnNumber" just in case we don't get a chance to do the work necessary to remove the restriction. We believe drivers will only attach this for retryable writes, but let us know if this is not the case. |
| Comment by Githook User [ 06/Mar/18 ] |
|
Author: {'email': 'charlie.swanson@mongodb.com', 'name': 'Charlie Swanson', 'username': 'cswanson310'}Message: |
| Comment by James Wahlin [ 05/Mar/18 ] |
|
A session yielding mechanism is definitely worth discussion. I have some concern that it would make session state management more difficult. Also we would need to hold locks during yielding for multi-statement transactions and at present for snapshot reads as well. |
| Comment by Kaloian Manassiev [ 05/Mar/18 ] |
|
While the idea of making local {{getMore}}s not go over the network is definitely something, which should be considered, I think there is a more fundamental problem here with session state. Does it make sense to introduce a "session yielding" mechanism, where operations, which have reached a state where they need to perform remote operations can check the session back in (provided perhaps that they aren't holding any locks), then perform the remote operation assuming it might have to come back? |