[SERVER-53723] The getMore command should not take a nested AutoGetCollectionForReadCommand instance for agg cursors when waitAfterPinningCursorBeforeGetMoreBatch is enabled Created: 12/Jan/21  Updated: 29/Oct/23  Resolved: 02/Feb/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 4.9.0

Type: Bug Priority: Major - P3
Reporter: Dianna Hohensee (Inactive) Assignee: David Storch
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by SERVER-53732 Add invariant that AutoGetCollectionF... Closed
Related
related to SERVER-53701 GetMore cmd's CommonMongodProcessInte... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Query 2021-02-08
Participants:
Linked BF Score: 25

 Description   

CommonMongodProcessInterface::getIndexStats() instantiates an AutoGetCollectionForReadCommand instance. That code can be called by the getMore command per this stack trace in the linked BF.

.........
"Frame","attr":{"frame":{"a":"7FA593352F9C","b":"7FA593347000","o":"BF9C","s":"_ZN5mongo24AutoGetCollectionForReadC2EPNS_16OperationContextERKNS_21NamespaceStringOrUUIDENS_25AutoGetCollectionViewModeENS_6Date_tE","s+":"3C"}}[js_test:kill_pinned_cursor] 2021-01-08T14:59:41.264+0000 d20521| 2021-01-08T14:59:41.262+00:00 I  CONTROL  31445   [conn53] 
"Frame","attr":{"frame":{"a":"7FA593357962","b":"7FA593347000","o":"10962","s":"_ZN5mongo35AutoGetCollectionForReadCommandBaseINS_24AutoGetCollectionForReadEEC2EPNS_16OperationContextERKNS_21NamespaceStringOrUUIDENS_25AutoGetCollectionViewModeENS_6Date_tENS_16AutoStatsTracker7LogModeE","s+":"32"}}[js_test:kill_pinned_cursor] 2021-01-08T14:59:41.265+0000 d20521| 2021-01-08T14:59:41.262+00:00 I  CONTROL  31445   [conn53] 
"Frame","attr":{"frame":{"a":"7FA595B5FA0E","b":"7FA595B41000","o":"1EA0E","s":"_ZN5mongo28CommonMongodProcessInterface13getIndexStatsEPNS_16OperationContextERKNS_15NamespaceStringENS_10StringDataEb","s+":"15E"}}[js_test:kill_pinned_cursor] 2021-01-08T14:59:41.265+0000 d20521| 2021-01-08T14:59:41.262+00:00 I  CONTROL  31445   [conn53] 
"Frame","attr":{"frame":{"a":"7FA592F52B7D","b":"7FA592E2B000","o":"127B7D","s":"_ZN5mongo24DocumentSourceIndexStats9doGetNextEv","s+":"ED"}}[js_test:kill_pinned_cursor] 2021-01-08T14:59:41.265+0000 d20521| 2021-01-08T14:59:41.262+00:00 I  CONTROL  31445   [conn53] 
"Frame","attr":{"frame":{"a":"7FA592F011E4","b":"7FA592E2B000","o":"D61E4","s":"_ZN5mongo14DocumentSource7getNextEv","s+":"44"}}[js_test:kill_pinned_cursor] 2021-01-08T14:59:41.265+0000 d20521| 2021-01-08T14:59:41.262+00:00 I  CONTROL  31445   [conn53] 
"Frame","attr":{"frame":{"a":"7FA592F4BE03","b":"7FA592E2B000","o":"120E03","s":"_ZN5mongo19DocumentSourceGroup10initializeEv","s+":"63"}}[js_test:kill_pinned_cursor] 2021-01-08T14:59:41.265+0000 d20521| 2021-01-08T14:59:41.262+00:00 I  CONTROL  31445   [conn53] 
"Frame","attr":{"frame":{"a":"7FA592F4D0A9","b":"7FA592E2B000","o":"1220A9","s":"_ZN5mongo19DocumentSourceGroup9doGetNextEv","s+":"A9"}}[js_test:kill_pinned_cursor] 2021-01-08T14:59:41.265+0000 d20521| 2021-01-08T14:59:41.265+00:00 I  CONTROL  31445   [conn53] 
"Frame","attr":{"frame":{"a":"7FA592F011E4","b":"7FA592E2B000","o":"D61E4","s":"_ZN5mongo14DocumentSource7getNextEv","s+":"44"}}[js_test:kill_pinned_cursor] 2021-01-08T14:59:41.265+0000 d20521| 2021-01-08T14:59:41.265+00:00 I  CONTROL  31445   [conn53] 
"Frame","attr":{"frame":{"a":"7FA592FBDDEC","b":"7FA592E2B000","o":"192DEC","s":"_ZN5mongo8Pipeline7getNextEv","s+":"3C"}}[js_test:kill_pinned_cursor] 2021-01-08T14:59:41.266+0000 d20521| 2021-01-08T14:59:41.265+00:00 I  CONTROL  31445   [conn53] 
"Frame","attr":{"frame":{"a":"7FA59416CCBB","b":"7FA594013000","o":"159CBB","s":"_ZN5mongo20PlanExecutorPipeline8_getNextEv","s+":"2B"}}[js_test:kill_pinned_cursor] 2021-01-08T14:59:41.266+0000 d20521| 2021-01-08T14:59:41.266+00:00 I  CONTROL  31445   [conn53] 
"Frame","attr":{"frame":{"a":"7FA59416CD8F","b":"7FA594013000","o":"159D8F","s":"_ZN5mongo20PlanExecutorPipeline15getNextDocumentEPNS_8DocumentEPNS_8RecordIdE","s+":"4F"}}[js_test:kill_pinned_cursor] 2021-01-08T14:59:41.266+0000 d20521| 2021-01-08T14:59:41.266+00:00 I  CONTROL  31445   [conn53] 
"Frame","attr":{"frame":{"a":"7FA59416CF0E","b":"7FA594013000","o":"159F0E","s":"_ZN5mongo20PlanExecutorPipeline7getNextEPNS_7BSONObjEPNS_8RecordIdE","s+":"EE"}}[js_test:kill_pinned_cursor] 2021-01-08T14:59:41.266+0000 d20521| 2021-01-08T14:59:41.266+00:00 I  CONTROL  31445   [conn53] 
"Frame","attr":{"frame":{"a":"7FA58E1EF66B","b":"7FA58E139000","o":"B666B","s":"_ZN5mongo12_GLOBAL__N_110GetMoreCmd10Invocation28acquireLocksAndIterateCursorEPNS_16OperationContextEPNS_3rpc21ReplyBuilderInterfaceEPNS_13CursorManagerERNS_15ClientCursorPinEPNS_5CurOpE.isra.1022","s+":"105B"}}[js_test:kill_pinned_cursor] 2021-01-08T14:59:41.273+0000 d20521| 2021-01-08T14:59:41.273+00:00 I  CONTROL  31445   [conn53] "Frame","attr":{"frame":{"a":"7FA58E1F153A","b":"7FA58E139000","o":"B853A","s":"_ZN5mongo12_GLOBAL__N_110GetMoreCmd10Invocation3runEPNS_16OperationContextEPNS_3rpc21ReplyBuilderInterfaceE","s+":"19A"}}
.........

The invariant problem showing this stack trace has been fixed by SERVER-53701. However, the invariant was only possible to hit if the AutoGetCollectionForReadCommand instance was nested under a higher level lock helper instance. AutoGetCollectionForReadCommand contains an AutoStatsTracker and handles CurrentOp and Top entry updates: there's already a higher level AutoStatsTracker for the code path in question.

I suspect the getMore command has this AutoGetCollectionForRead instance instantiated when calling into CommonMongodProcessInterface::getIndexStats().



 Comments   
Comment by Githook User [ 02/Feb/21 ]

Author:

{'name': 'David Storch', 'email': 'david.storch@mongodb.com', 'username': 'dstorch'}

Message: SERVER-53723 Prevent failpoint from mistakenly acquiring db_raii helper during getMore on agg cursor
Branch: master
https://github.com/mongodb/mongo/commit/e0602a389d6a9d842b73fd9b22c9046b046131d7

Comment by David Storch [ 29/Jan/21 ]

dianna.hohensee, thanks for bringing this to our attention! There is probably an argument to be made that getIndexStats() should switch from AutoGetCollectionForReadCommandMaybeLockFree to AutoGetCollectionForReadMaybeLockFree. My understanding of the difference is that the former has an AutoStatsTracker for setting up some state related to Top and CurOp, whereas the latter does not have an AutoStatsTracker. However, I don't think that is quite the right fix here.

The reason is that CommonMongodProcessInterface::getIndexStats() is only called from aggregate cursors, not find cursors. When a getMore is run against an aggregate cursor, it actually does not acquire the AutoGetCollectionForReadCommandMaybeLockFree here. (This is because PlanExecutors for the aggregation subsystem have a "locks internally" lock policy.)

So if this is the case, then how did the invariant you observed trip, indicating that two AutoGet* objects were held at the same time? I believe the reason is that the failing test, kill_pinned_cursor.js, turns on the "waitAfterPinningCursorBeforeGetMoreBatch". It looks like there is a mistake in the implementation of this failpoint. It tries to release and reacquire the AutoGet*, but this will have the unintended consequence of acquiring the AutoGet* if it is not held in the first place. The AutoGetCollectionForReadMaybeLockFree should never be acquired here for a getMore against an agg cursor. I will propose a change that fixes the failpoint's implementation and send it to you for review.

Generated at Thu Feb 08 05:31:41 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.