[SERVER-37838] stepDown during a getMore followed by an OP_KILL_CURSORS can crash the server Created: 30/Oct/18  Updated: 29/Oct/23  Resolved: 29/Nov/18

Status: Closed
Project: Core Server
Component/s: Querying
Affects Version/s: 4.0.3
Fix Version/s: 4.0.5, 4.1.6

Type: Bug Priority: Major - P3
Reporter: Robert Stam Assignee: Ian Boros
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.0
Steps To Reproduce:

While there is an ongoing change stream operation with an active cursor run rs.stepDown() on the primary.

This occurred while manually testing the C# driver support for change stream recovery after elections. It's possible there is a driver bug... but whether or not there is the server should not abort.

Sprint: Query 2018-12-03, Query 2018-12-17
Participants:
Linked BF Score: 33

 Description   

After running rs.stepDown() on the primary the primary crashes with the following output at the tail of the log:

2018-10-30T16:04:18.438-0400 I NETWORK  [conn71] received client metadata from 192.168.0.122:63244 conn71: { driver: { name: "mongo-csharp-driver", version: "0.0.0.0" }, os: { type: "Windows", name: "Microsoft Windows 10.0.17134", architecture: "x86_32", version: "10.0.17134" }, platform: ".NET Framework 4.7.3190.0" }
2018-10-30T16:04:18.439-0400 F -        [conn71] Invariant failure pin == ErrorCodes::CursorNotFound || pin == ErrorCodes::Unauthorized src\mongo\db\cursor_manager.cpp 186
2018-10-30T16:04:18.443-0400 F -        [conn71]
***aborting after invariant() failure
2018-10-30T16:04:18.547-0400 F -        [conn71] Got signal: 22 (SIGABRT).
mongod.exe    ...\src\mongo\util\stacktrace_windows.cpp(244)                                   mongo::printStackTrace+0x67x
mongod.exe    ...\src\mongo\util\signal_handlers_synchronous.cpp(182)                          mongo::`anonymous namespace'::printSignalAndBacktrace+0x116x
mongod.exe    ...\src\mongo\util\signal_handlers_synchronous.cpp(238)                          mongo::`anonymous namespace'::abruptQuit+0x133x
ucrtbase.dll                                                                                   raise+0x475x
ucrtbase.dll                                                                                   abort+0x49x
mongod.exe    ...\src\mongo\util\assert_util.cpp(104)                                          mongo::invariantFailed+0x410x
mongod.exe    ...\src\mongo\db\cursor_manager.cpp(186)                                         mongo::`anonymous namespace'::GlobalCursorIdCache::killCursor+0x208x
mongod.exe    ...\src\mongo\db\cursor_manager.cpp(371)                                         mongo::CursorManager::killCursorGlobalIfAuthorized+0x73x
mongod.exe    ...\src\mongo\db\service_entry_point_common.cpp(1127)                            mongo::`anonymous namespace'::receivedKillCursors+0x613x
mongod.exe    ...\src\mongo\db\service_entry_point_common.cpp(1339)                            mongo::ServiceEntryPointCommon::handleRequest+0x1313x
mongod.exe    ...\src\mongo\db\service_entry_point_mongod.cpp(123)                             mongo::ServiceEntryPointMongod::handleRequest+0x67x
mongod.exe    ...\src\mongo\transport\service_state_machine.cpp(378)                           mongo::ServiceStateMachine::_processMessage+0x479x
mongod.exe    ...\src\mongo\transport\service_state_machine.cpp(440)                           mongo::ServiceStateMachine::_runNextInGuard+0x207x
mongod.exe    ...\src\mongo\transport\service_state_machine.cpp(479)                           <lambda_b23af5efc3b61ab25bff0c3bcd13382b>::operator()+0x109x
mongod.exe    ...\src\mongo\transport\service_executor_synchronous.cpp(116)                    mongo::transport::ServiceExecutorSynchronous::schedule+0x283x
mongod.exe    ...\src\mongo\transport\service_state_machine.cpp(483)                           mongo::ServiceStateMachine::_scheduleNextWithGuard+0x236x
mongod.exe    ...\src\mongo\transport\service_state_machine.cpp(303)                           mongo::ServiceStateMachine::_sourceCallback+0x289x
mongod.exe    ...\src\mongo\transport\service_state_machine.cpp(259)                           <lambda_5e72c1d9f04187bee457482c0afa7487>::operator()+0x185x
mongod.exe    ...\src\mongo\util\future.h(788)                                                 mongo::future_details::Future<mongo::Message>::getAsync<<lambda_5e72c1d9f04187bee457482c0afa7487> >+0x103x
mongod.exe    ...\src\mongo\transport\service_state_machine.cpp(254)                           mongo::ServiceStateMachine::_sourceMessage+0x192x
mongod.exe    ...\src\mongo\transport\service_state_machine.cpp(436)                           mongo::ServiceStateMachine::_runNextInGuard+0x242x
mongod.exe    ...\src\mongo\transport\service_state_machine.cpp(479)                           <lambda_b23af5efc3b61ab25bff0c3bcd13382b>::operator()+0x109x
mongod.exe    ...\src\mongo\transport\service_executor_synchronous.cpp(133)                    <lambda_472996f9e6b00ec91d31b43a6cde81f7>::operator()+0x217x
mongod.exe    c:\program files (x86)\microsoft visual studio 14.0\vc\include\thr\xthread(247)  std::_LaunchPad<std::unique_ptr<std::tuple<std::function<void __cdecl(void)> >,std::default_delete<std::tuple<std::function<void __cdecl(void)> > > > >::_Run+0x133x
mongod.exe    c:\program files (x86)\microsoft visual studio 14.0\vc\include\thr\xthread(210)  std::_Pad::_Call_func+0x9x
ucrtbase.dll                                                                                   o_ceil+0x78x
KERNEL32.DLL                                                                                   BaseThreadInitThunk+0x20x
2018-10-30T16:04:18.553-0400 F CONTROL  [conn71] *** unhandled exception 0x0000000E at 0x00007FFDB0F9A388, terminating
2018-10-30T16:04:18.553-0400 F CONTROL  [conn71] *** stack trace for unhandled exception:
2018-10-30T16:04:18.556-0400 I -        [conn71] KERNELBASE.dll                                                                                   RaiseException+0x104x
mongod.exe      ...\src\mongo\util\signal_handlers_synchronous.cpp(239)                          mongo::`anonymous namespace'::abruptQuit+0x157x
ucrtbase.dll                                                                                     raise+0x475x
ucrtbase.dll                                                                                     abort+0x49x
mongod.exe      ...\src\mongo\util\assert_util.cpp(104)                                          mongo::invariantFailed+0x410x
mongod.exe      ...\src\mongo\db\cursor_manager.cpp(186)                                         mongo::`anonymous namespace'::GlobalCursorIdCache::killCursor+0x208x
mongod.exe      ...\src\mongo\db\cursor_manager.cpp(371)                                         mongo::CursorManager::killCursorGlobalIfAuthorized+0x73x
mongod.exe      ...\src\mongo\db\service_entry_point_common.cpp(1127)                            mongo::`anonymous namespace'::receivedKillCursors+0x613x
mongod.exe      ...\src\mongo\db\service_entry_point_common.cpp(1339)                            mongo::ServiceEntryPointCommon::handleRequest+0x1313x
mongod.exe      ...\src\mongo\db\service_entry_point_mongod.cpp(123)                             mongo::ServiceEntryPointMongod::handleRequest+0x67x
mongod.exe      ...\src\mongo\transport\service_state_machine.cpp(378)                           mongo::ServiceStateMachine::_processMessage+0x479x
mongod.exe      ...\src\mongo\transport\service_state_machine.cpp(440)                           mongo::ServiceStateMachine::_runNextInGuard+0x207x
mongod.exe      ...\src\mongo\transport\service_state_machine.cpp(479)                           <lambda_b23af5efc3b61ab25bff0c3bcd13382b>::operator()+0x109x
mongod.exe      ...\src\mongo\transport\service_executor_synchronous.cpp(116)                    mongo::transport::ServiceExecutorSynchronous::schedule+0x283x
mongod.exe      ...\src\mongo\transport\service_state_machine.cpp(483)                           mongo::ServiceStateMachine::_scheduleNextWithGuard+0x236x
mongod.exe      ...\src\mongo\transport\service_state_machine.cpp(303)                           mongo::ServiceStateMachine::_sourceCallback+0x289x
mongod.exe      ...\src\mongo\transport\service_state_machine.cpp(259)                           <lambda_5e72c1d9f04187bee457482c0afa7487>::operator()+0x185x
mongod.exe      ...\src\mongo\util\future.h(788)                                                 mongo::future_details::Future<mongo::Message>::getAsync<<lambda_5e72c1d9f04187bee457482c0afa7487> >+0x103x
mongod.exe      ...\src\mongo\transport\service_state_machine.cpp(254)                           mongo::ServiceStateMachine::_sourceMessage+0x192x
mongod.exe      ...\src\mongo\transport\service_state_machine.cpp(436)                           mongo::ServiceStateMachine::_runNextInGuard+0x242x
mongod.exe      ...\src\mongo\transport\service_state_machine.cpp(479)                           <lambda_b23af5efc3b61ab25bff0c3bcd13382b>::operator()+0x109x
mongod.exe      ...\src\mongo\transport\service_executor_synchronous.cpp(133)                    <lambda_472996f9e6b00ec91d31b43a6cde81f7>::operator()+0x217x
mongod.exe      c:\program files (x86)\microsoft visual studio 14.0\vc\include\thr\xthread(247)  std::_LaunchPad<std::unique_ptr<std::tuple<std::function<void __cdecl(void)> >,std::default_delete<std::tuple<std::function<void __cdecl(void)> > > > >::_Run+0x133x
mongod.exe      c:\program files (x86)\microsoft visual studio 14.0\vc\include\thr\xthread(210)  std::_Pad::_Call_func+0x9x
ucrtbase.dll                                                                                     o_ceil+0x78x
KERNEL32.DLL                                                                                     BaseThreadInitThunk+0x20x
2018-10-30T16:04:18.561-0400 I CONTROL  [conn71] writing minidump diagnostic file C:\mongodb\v4.2018-10-30T20-04-18.mdmp
2018-10-30T16:04:18.674-0400 I REPL     [rsBackgroundSync] sync source candidate: Fractal:27017
2018-10-30T16:04:18.675-0400 I ASIO     [RS] Connecting to Fractal:27017
2018-10-30T16:04:18.678-0400 F CONTROL  [conn71] *** immediate exit due to unhandled exception



 Comments   
Comment by Githook User [ 12/Dec/18 ]

Author:

{'name': 'Ian Boros', 'email': 'ian.boros@10gen.com'}

Message: SERVER-37838 prevent crash with getMore and simultaneous stepdown
Branch: v4.0
https://github.com/mongodb/mongo/commit/3a76585f4e5d53fc380e0c0db44fd06f45dcdb24

Comment by Githook User [ 30/Nov/18 ]

Author:

{'name': 'Ian Boros', 'email': 'ian.boros@10gen.com'}

Message: SERVER-37838 add correct tags to test
Branch: master
https://github.com/mongodb/mongo/commit/f3acfdec219b7a9122b4954a269c89e3604416d2

Comment by Githook User [ 29/Nov/18 ]

Author:

{'name': 'Ian Boros', 'email': 'ian.boros@10gen.com'}

Message: SERVER-37838 prevent crash with getMore and simultaneous stepdown
Branch: master
https://github.com/mongodb/mongo/commit/b7300d0e9fac1df82468958b4b8eeda27a8a8c15

Comment by Ian Boros [ 19/Nov/18 ]

Okay, my above theory was mostly right, but here's a more thorough explanation in case I need to remember what happened:
1) We start an aggregation, and get a cursor ID
2) We do a getMore, and the getMore gets passed any checkForInterrupt() calls (I imagine this window of time is pretty small, but the race is there).
3) A stepdown happens. The operation is marked as killed.
4) The getMore "completes," and checks its cursor back in. At this point, the CursorManager sees that the operation was killed, and then marks the cursor as killed so that subsequent operations on the cursor will fail. The InterruptedDueToStepDown error code is stashed inside the cursor's executor.
5) Someone runs an OP_KILL_CURSORS on that cursor ID, on the stepped-down node.
6) As part of pinning the cursor to kill it, we see that its executor was killed. We clean up the cursor and return the stashed error code, InterruptedDueToStepDown. As a result, this invariant fails.

This can be triggered without a $changeStream stage, which is a little scary. So far, the only event I've found which can trigger this is a stepdown. The following other events happening during a getMore, after the last checkForInterrupt() don't trigger the issue:

killCursors command: The cursor will be disposed of immediately once it's checked back in, here.
killOp Same as above

Dropping a collection: Ownership of the cursor will be transferred to the operation using the cursor, and it will be cleaned up. Therefore, when the OP_KILL_CURSORS goes to check out the cursor, it won't be there so it will just return a CursorNotFound error.
Dropping an index: Same as above

 

This won't affect v3.6 because cursors could not be killed while they were checked out.

Comment by Ian Whalen (Inactive) [ 09/Nov/18 ]

need to check whether this affects branches older than 4.0.

Generated at Thu Feb 08 04:47:10 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.