[SERVER-23073] Aggregation checks the output collection's sharding state without collection lock Created: 10/Mar/16  Updated: 13/Aug/16  Resolved: 18/Jul/16

Status: Closed
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: 3.2.4, 3.3.2
Fix Version/s: 3.3.11

Type: Bug Priority: Critical - P2
Reporter: Kaloian Manassiev Assignee: Charlie Swanson
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by SERVER-24664 Get rid of calls to ShardingState::ge... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Query 12 (04/04/16), Query 17 (07/15/16), Query 18 (08/05/16)
Participants:

 Description   

The aggregation framework checks the output collection's sharding state without holding collection lock. This might cause it to see inconsistent state if the collection is currently being made sharded.

This was caught by changing the version checking code to go through the newly introduced CollectionShardingState class instead of accessing the sharded collection metadata directly.

While this is not a pressing issue it would be nice to fix it so we can switch to storing the per-collection sharding information as decoration on the collection itself.

This is the call stack in question:

[MongoDFixture:job0] 2016-03-10T15:59:04.172-0500 I -        [conn2] Invariant failure txn->lockState()->isCollectionLockedForMode(ns, MODE_IS) src\mongo\db\s\collection_sharding_state.cpp 64
[MongoDFixture:job0] 2016-03-10T15:59:04.172-0500 I -        [conn2]
[MongoDFixture:job0]
[MongoDFixture:job0] ***aborting after invariant() failure
[MongoDFixture:job0]
[MongoDFixture:job0]
[MongoDFixture:job0] 2016-03-10T15:59:04.992-0500 I CONTROL  [conn2] mongod.exe     ...\src\mongo\util\stacktrace_windows.cpp(174)                                   mongo::printStackTrace+0x5b
[MongoDFixture:job0] 2016-03-10T15:59:04.992-0500 I CONTROL  [conn2] mongod.exe     ...\src\mongo\util\signal_handlers_synchronous.cpp(181)                          mongo::`anonymous namespace'::printSignalAndBacktrace+0x84
[MongoDFixture:job0] 2016-03-10T15:59:04.992-0500 I CONTROL  [conn2] mongod.exe     ...\src\mongo\util\signal_handlers_synchronous.cpp(238)                          mongo::`anonymous namespace'::abruptQuit+0x39
[MongoDFixture:job0] 2016-03-10T15:59:04.992-0500 I CONTROL  [conn2] MSVCR120D.dll                                                                                   raise+0x35f
[MongoDFixture:job0] 2016-03-10T15:59:04.992-0500 I CONTROL  [conn2] MSVCR120D.dll                                                                                   abort+0x40
[MongoDFixture:job0] 2016-03-10T15:59:04.993-0500 I CONTROL  [conn2] mongod.exe     ...\src\mongo\util\assert_util.cpp(154)                                          mongo::invariantFailed+0x13d
[MongoDFixture:job0] 2016-03-10T15:59:04.993-0500 I CONTROL  [conn2] mongod.exe     ...\src\mongo\db\s\collection_sharding_state.cpp(64)                             mongo::CollectionShardingState::get+0xc5
[MongoDFixture:job0] 2016-03-10T15:59:04.993-0500 I CONTROL  [conn2] mongod.exe     ...\src\mongo\db\s\collection_sharding_state.cpp(59)                             mongo::CollectionShardingState::get+0x3a
[MongoDFixture:job0] 2016-03-10T15:59:04.993-0500 I CONTROL  [conn2] mongod.exe     ...\src\mongo\db\pipeline\pipeline_d.cpp(87)                                     mongo::`anonymous namespace'::MongodImplementation::isSharded+0x42
[MongoDFixture:job0] 2016-03-10T15:59:04.993-0500 I CONTROL  [conn2] mongod.exe     ...\src\mongo\db\pipeline\document_source_out.cpp(63)                            mongo::DocumentSourceOut::prepTempCollection+0x184
[MongoDFixture:job0] 2016-03-10T15:59:04.993-0500 I CONTROL  [conn2] mongod.exe     ...\src\mongo\db\pipeline\document_source_out.cpp(129)                           mongo::DocumentSourceOut::getNext+0x159
[MongoDFixture:job0] 2016-03-10T15:59:04.993-0500 I CONTROL  [conn2] mongod.exe     ...\src\mongo\db\exec\pipeline_proxy.cpp(126)                                    mongo::PipelineProxyStage::getNextBson+0x8d
[MongoDFixture:job0] 2016-03-10T15:59:04.993-0500 I CONTROL  [conn2] mongod.exe     ...\src\mongo\db\exec\pipeline_proxy.cpp(72)                                     mongo::PipelineProxyStage::doWork+0x178
[MongoDFixture:job0] 2016-03-10T15:59:04.993-0500 I CONTROL  [conn2] mongod.exe     ...\src\mongo\db\exec\plan_stage.cpp(43)                                         mongo::PlanStage::work+0x6d
[MongoDFixture:job0] 2016-03-10T15:59:04.993-0500 I CONTROL  [conn2] mongod.exe     ...\src\mongo\db\query\plan_executor.cpp(403)                                    mongo::PlanExecutor::getNextImpl+0x789
[MongoDFixture:job0] 2016-03-10T15:59:04.993-0500 I CONTROL  [conn2] mongod.exe     ...\src\mongo\db\query\plan_executor.cpp(323)                                    mongo::PlanExecutor::getNext+0x7c
[MongoDFixture:job0] 2016-03-10T15:59:04.993-0500 I CONTROL  [conn2] mongod.exe     ...\src\mongo\db\commands\pipeline_command.cpp(95)                               mongo::handleCursorCommand+0x23d
[MongoDFixture:job0] 2016-03-10T15:59:04.994-0500 I CONTROL  [conn2] mongod.exe     ...\src\mongo\db\commands\pipeline_command.cpp(306)                              mongo::PipelineCommand::run+0xe8a
[MongoDFixture:job0] 2016-03-10T15:59:04.994-0500 I CONTROL  [conn2] mongod.exe     ...\src\mongo\db\dbcommands.cpp(1493)                                            mongo::Command::run+0xc70
[MongoDFixture:job0] 2016-03-10T15:59:04.994-0500 I CONTROL  [conn2] mongod.exe     ...\src\mongo\db\dbcommands.cpp(1356)                                            mongo::Command::execCommand+0xfd4
[MongoDFixture:job0] 2016-03-10T15:59:04.994-0500 I CONTROL  [conn2] mongod.exe     ...\src\mongo\db\commands.cpp(498)                                               mongo::runCommands+0x743
[MongoDFixture:job0] 2016-03-10T15:59:04.994-0500 I CONTROL  [conn2] mongod.exe     ...\src\mongo\db\instance.cpp(307)                                               mongo::`anonymous namespace'::receivedRpc+0x23c
[MongoDFixture:job0] 2016-03-10T15:59:04.994-0500 I CONTROL  [conn2] mongod.exe     ...\src\mongo\db\instance.cpp(531)                                               mongo::assembleResponse+0x877
[MongoDFixture:job0] 2016-03-10T15:59:04.994-0500 I CONTROL  [conn2] mongod.exe     ...\src\mongo\db\db.cpp(174)                                                     mongo::MyMessageHandler::process+0x158
[MongoDFixture:job0] 2016-03-10T15:59:05.006-0500 I CONTROL  [conn2] mongod.exe     ...\src\mongo\util\net\message_server_port.cpp(231)                              mongo::PortMessageServer::handleIncomingMsg+0x509
[MongoDFixture:job0] 2016-03-10T15:59:05.016-0500 I CONTROL  [conn2] mongod.exe     c:\program files (x86)\microsoft visual studio 12.0\vc\include\functional(1150)  std::_Bind<1,void * __ptr64,void * __ptr64 (__cdecl*const)(void * __ptr64),mongo::`anonymous namespace'::MessagingPortWithHandler * __ptr64>::_Do_call<,0>+0x6e
[MongoDFixture:job0] 2016-03-10T15:59:05.036-0500 I CONTROL  [conn2] mongod.exe     c:\program files (x86)\microsoft visual studio 12.0\vc\include\functional(1138)  std::_Bind<1,void * __ptr64,void * __ptr64 (__cdecl*const)(void * __ptr64),mongo::`anonymous namespace'::MessagingPortWithHandler * __ptr64>::operator()<>+0x56
[MongoDFixture:job0] 2016-03-10T15:59:05.048-0500 I CONTROL  [conn2] mongod.exe     c:\program files (x86)\microsoft visual studio 12.0\vc\include\functional(1150)  std::_Bind<0,void,std::_Bind<1,void * __ptr64,void * __ptr64 (__cdecl*const)(void * __ptr64),mongo::`anonymous namespace'::MessagingPortWithHandler * __ptr64> >::_Do_call<>+0x35
[MongoDFixture:job0] 2016-03-10T15:59:05.056-0500 I CONTROL  [conn2] mongod.exe     c:\program files (x86)\microsoft visual studio 12.0\vc\include\functional(1138)  std::_Bind<0,void,std::_Bind<1,void * __ptr64,void * __ptr64 (__cdecl*const)(void * __ptr64),mongo::`anonymous namespace'::MessagingPortWithHandler * __ptr64> >::operator()<>+0x56
[MongoDFixture:job0] 2016-03-10T15:59:05.074-0500 I CONTROL  [conn2] mongod.exe     c:\program files (x86)\microsoft visual studio 12.0\vc\include\thr\xthread(196)  std::_LaunchPad<std::_Bind<0,void,std::_Bind<1,void * __ptr64,void * __ptr64 (__cdecl*const)(void * __ptr64),mongo::`anonymous namespace'::MessagingPortWithHandler * __ptr64> > >::_Run+0x51
[MongoDFixture:job0] 2016-03-10T15:59:05.091-0500 I CONTROL  [conn2] mongod.exe     c:\program files (x86)\microsoft visual studio 12.0\vc\include\thr\xthread(188)  std::_LaunchPad<std::_Bind<0,void,std::_Bind<1,void * __ptr64,void * __ptr64 (__cdecl*const)(void * __ptr64),mongo::`anonymous namespace'::MessagingPortWithHandler * __ptr64> > >::_Go+0x28
[MongoDFixture:job0] 2016-03-10T15:59:05.099-0500 I CONTROL  [conn2] MSVCP120D.dll                                                                                   std::_Pad::_Release+0xd9
[MongoDFixture:job0] 2016-03-10T15:59:05.107-0500 I CONTROL  [conn2] MSVCR120D.dll                                                                                   beginthreadex+0x1f5
[MongoDFixture:job0] 2016-03-10T15:59:05.113-0500 I CONTROL  [conn2] MSVCR120D.dll                                                                                   endthreadex+0x1d7
[MongoDFixture:job0] 2016-03-10T15:59:05.121-0500 I CONTROL  [conn2] KERNEL32.DLL                                                                                    BaseThreadInitThunk+0x22
[MongoDFixture:job0] 2016-03-10T15:59:05.128-0500 F -        [conn2] Got signal: 22 (SIGABRT).
[MongoDFixture:job0] 2016-03-10T15:59:05.136-0500 I CONTROL  [conn2] writing minidump diagnostic file E:\workspace\mongo\mongod.2016-03-10T20-59-05.mdmp



 Comments   
Comment by Githook User [ 18/Jul/16 ]

Author:

{u'username': u'cswanson310', u'name': u'Charlie Swanson', u'email': u'cswanson310@gmail.com'}

Message: SERVER-23073 Reliably detect concurrent changes to $out collection.

A $out stage will write to a temporary collection, and then rename into
the target collection. If the original collection is sharded, has its
options changed, or changes its indexes during processing, the
aggregation should fail.

This change removes any race conditions around detecting these changes.
Branch: master
https://github.com/mongodb/mongo/commit/e420c60e6c6fcdf1ba0d65d58e41ab85c09c0f51

Comment by Githook User [ 18/Jul/16 ]

Author:

{u'username': u'cswanson310', u'name': u'Charlie Swanson', u'email': u'cswanson310@gmail.com'}

Message: SERVER-23073 Check if either collection is sharded before renaming.

This was previously checked at the mongos level. This check now also
happens on the mongod.
Branch: master
https://github.com/mongodb/mongo/commit/8a487d846954ff28df781bab07257694cacfd6aa

Comment by Charlie Swanson [ 20/Jun/16 ]

Sure, I've put it in our next iteration.

Comment by Kaloian Manassiev [ 20/Jun/16 ]

The linked ticket is in code review because we are able to remove all usages of ShardingState::getCollectionMetadata except for the one in the aggregation framework. We cannot resolve SERVER-24664 before this ticket has been fixed, that's why there is no fix version.

Nobody from the sharding team understands the aggregation framework locking rules sufficiently to make this change, so it is preferable that you guys fix it

Can you please triage it and let me know whether fixing it in the next iteration sounds feasible. Otherwise we might need to implement some kind of workaround.

Comment by Charlie Swanson [ 20/Jun/16 ]

So does that mean this should be in "3.3 Required"? It looks like the linked ticket is in code review, but without a FixVersion... Is this work that the sharding team could do?

Comment by Kaloian Manassiev [ 20/Jun/16 ]

As part of the changes for improving the RangeDeleter and parallel migrations, we want to have a single place to obtain the per-collection sharding metadata and to have this place enforce the proper collection locks, so we don't have to have a separate mutex for synchronization (see SERVER-24664). This is the only code path, which accesses collection metadata without proper locks.

Comment by Ian Whalen (Inactive) [ 22/Apr/16 ]

kaloian.manassiev can you offer any context on why did you put this straight into required? does it need to be done for 3.4?

Generated at Thu Feb 08 04:02:17 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.