[SERVER-23944] Failure to commit chunk migration due to shutdown should not fassert Created: 27/Apr/16  Updated: 29/Aug/18  Resolved: 23/Sep/16

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.2.5, 3.3.5
Fix Version/s: 3.3.14

Type: Bug Priority: Major - P3
Reporter: Kaloian Manassiev Assignee: Kaloian Manassiev
Resolution: Done Votes: 0
Labels: bkp, neweng
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Duplicate
is duplicated by SERVER-26009 Shutdown gracefully when migration is... Closed
is duplicated by SERVER-26145 Shutdown during move chunk commit fas... Closed
is duplicated by SERVER-28808 (v3.2) applyChunkOpsDeprecated should... Closed
Related
related to SERVER-23150 Donor shard crashes in commitMigratio... Closed
is related to SERVER-30611 Failure to commit chunk migration due... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Sharding 16 (06/24/16), Sharding 2016-10-10
Participants:
Case:
Linked BF Score: 18

 Description   

If the commit chunk migration code fails to apply the metadata change transaction to the config server, it will do a best-effort attempt to figure out whether the operation was actually applied or not. If this check fails for any reason, we currently terminate the server in order to avoid data corruption or loss.

Before terminating the server, we should check whether it is being shutdown and if so, we can avoid introducing a fatal assertion.

[js_test:multi_mongos2] 2016-04-27T00:18:01.698 0000 d20511| 2016-04-27T00:18:01.195 0000 I -        [conn8] Fatal assertion 34431 CallbackCanceled: Callback canceled
[js_test:multi_mongos2] 2016-04-27T00:18:01.698 0000 d20511| 2016-04-27T00:18:01.195 0000 I -        [conn8]
[js_test:multi_mongos2] 2016-04-27T00:18:01.698 0000 d20511|
[js_test:multi_mongos2] 2016-04-27T00:18:01.699 0000 d20511| ***aborting after fassert() failure
[js_test:multi_mongos2] 2016-04-27T00:18:01.699 0000 d20511|
[js_test:multi_mongos2] 2016-04-27T00:18:01.699 0000 d20511|
[js_test:multi_mongos2] 2016-04-27T00:18:01.700 0000 d20511| 2016-04-27T00:18:01.198 0000 W SHARDING [signalProcessingThread] error encountered while cleaning up distributed ping entry for ip-10-45-46-73:20511:1461716245:2082352190 :: caused by :: ShutdownInProgress: Shutdown in progress
[js_test:multi_mongos2] 2016-04-27T00:18:01.701 0000 d20511| 2016-04-27T00:18:01.198 0000 I CONTROL  [signalProcessingThread] now exiting
[js_test:multi_mongos2] 2016-04-27T00:18:01.701 0000 d20511| 2016-04-27T00:18:01.198 0000 I NETWORK  [signalProcessingThread] shutdown: going to close listening sockets...
[js_test:multi_mongos2] 2016-04-27T00:18:01.701 0000 d20511| 2016-04-27T00:18:01.198 0000 I NETWORK  [signalProcessingThread] closing listening socket: 13
[js_test:multi_mongos2] 2016-04-27T00:18:01.702 0000 d20511| 2016-04-27T00:18:01.198 0000 I NETWORK  [signalProcessingThread] closing listening socket: 14
[js_test:multi_mongos2] 2016-04-27T00:18:01.703 0000 d20511| 2016-04-27T00:18:01.198 0000 I NETWORK  [signalProcessingThread] removing socket file: /tmp/mongodb-20511.sock
[js_test:multi_mongos2] 2016-04-27T00:18:01.703 0000 d20511| 2016-04-27T00:18:01.198 0000 I NETWORK  [signalProcessingThread] shutdown: going to flush diaglog...
[js_test:multi_mongos2] 2016-04-27T00:18:01.703 0000 d20511| 2016-04-27T00:18:01.198 0000 I STORAGE  [signalProcessingThread] WiredTigerKVEngine shutting down
[js_test:multi_mongos2] 2016-04-27T00:18:01.704 0000 d20511| 2016-04-27T00:18:01.203 0000 F -        [conn8] Got signal: 6 (Aborted).
[js_test:multi_mongos2] 2016-04-27T00:18:01.704 0000 d20511|
[js_test:multi_mongos2] 2016-04-27T00:18:01.704 0000 d20511|  0x15edd22 0x15eca49 0x15ed332 0x3cd2e0f7e0 0x3cd2a32625 0x3cd2a33e05 0x1573be1 0x121095d 0x1216283 0xc86bbb 0xc88933 0x11caa60 0xdc5eb5 0x9f4e5a 0x1597cd1 0x3cd2e07aa1 0x3cd2ae893d
[js_test:multi_mongos2] 2016-04-27T00:18:01.705 0000 d20511| ----- BEGIN BACKTRACE -----
[js_test:multi_mongos2] 2016-04-27T00:18:01.736 0000 d20511|  mongod(mongo::printStackTrace(std::ostream&) 0x32) [0x15edd22]
[js_test:multi_mongos2] 2016-04-27T00:18:01.736 0000 d20511|  mongod( 0x11ECA49) [0x15eca49]
[js_test:multi_mongos2] 2016-04-27T00:18:01.736 0000 d20511|  mongod( 0x11ED332) [0x15ed332]
[js_test:multi_mongos2] 2016-04-27T00:18:01.737 0000 d20511|  libpthread.so.0( 0xF7E0) [0x3cd2e0f7e0]
[js_test:multi_mongos2] 2016-04-27T00:18:01.737 0000 d20511|  libc.so.6(gsignal 0x35) [0x3cd2a32625]
[js_test:multi_mongos2] 2016-04-27T00:18:01.738 0000 d20511|  libc.so.6(abort 0x175) [0x3cd2a33e05]
[js_test:multi_mongos2] 2016-04-27T00:18:01.738 0000 d20511|  mongod(mongo::fassertFailedWithStatus(int, mongo::Status const&) 0xB1) [0x1573be1]
[js_test:multi_mongos2] 2016-04-27T00:18:01.740 0000 d20511|  mongod(mongo::MigrationSourceManager::commitDonateChunk(mongo::OperationContext*) 0x343D) [0x121095d]
[js_test:multi_mongos2] 2016-04-27T00:18:01.740 0000 d20511|  mongod( 0xE16283) [0x1216283]
[js_test:multi_mongos2] 2016-04-27T00:18:01.741 0000 d20511|  mongod(mongo::Command::run(mongo::OperationContext*, mongo::rpc::RequestInterface const&, mongo::rpc::ReplyBuilderInterface*) 0x80B) [0xc86bbb]
[js_test:multi_mongos2] 2016-04-27T00:18:01.742 0000 d20511|  mongod(mongo::Command::execCommand(mongo::OperationContext*, mongo::Command*, mongo::rpc::RequestInterface const&, mongo::rpc::ReplyBuilderInterface*) 0x8B3) [0xc88933]
[js_test:multi_mongos2] 2016-04-27T00:18:01.743 0000 d20511|  mongod(mongo::runCommands(mongo::OperationContext*, mongo::rpc::RequestInterface const&, mongo::rpc::ReplyBuilderInterface*) 0x260) [0x11caa60]
[js_test:multi_mongos2] 2016-04-27T00:18:01.743 0000 d20511|  mongod(mongo::assembleResponse(mongo::OperationContext*, mongo::Message&, mongo::DbResponse&, mongo::HostAndPort const&) 0xB35) [0xdc5eb5]
[js_test:multi_mongos2] 2016-04-27T00:18:01.744 0000 d20511|  mongod(mongo::MyMessageHandler::process(mongo::Message&, mongo::AbstractMessagingPort*) 0xEA) [0x9f4e5a]
[js_test:multi_mongos2] 2016-04-27T00:18:01.744 0000 d20511|  mongod(mongo::PortMessageServer::handleIncomingMsg(void*) 0x311) [0x1597cd1]
[js_test:multi_mongos2] 2016-04-27T00:18:01.745 0000 d20511|  libpthread.so.0( 0x7AA1) [0x3cd2e07aa1]



 Comments   
Comment by Githook User [ 23/Sep/16 ]

Author:

{u'username': u'kaloianm', u'name': u'Kaloian Manassiev', u'email': u'kaloian.manassiev@mongodb.com'}

Message: SERVER-23944 Expect CallbackCanceled as shutdown error during chunk commit
Branch: master
https://github.com/mongodb/mongo/commit/1955b0542e68b31b1e93f99980316817bd1e4416

Comment by Kaloian Manassiev [ 19/Sep/16 ]

Author:

{u'username': u'kaloianm', u'name': u'Kaloian Manassiev', u'email': u'kaloian.manassiev@mongodb.com'}

Message: SERVER-26145 Do not fassert at shutdown during move chunk commit
Branch: master
https://github.com/mongodb/mongo/commit/f5ee90d15c8f7f723049924d6acc804c6790e278

Comment by Dianna Hohensee (Inactive) [ 19/Sep/16 ]

Can we backport this too? It's a v3.2 problem, too. https://jira.mongodb.org/browse/BF-1936

Comment by Kaloian Manassiev [ 19/Sep/16 ]

Author:

{u'username': u'kaloianm', u'name': u'Kaloian Manassiev', u'email': u'kaloian.manassiev@mongodb.com'}

Message: SERVER-26145 Do not fassert at shutdown during move chunk commit
Branch: master
https://github.com/mongodb/mongo/commit/dece96ba0408446ca034a7c897ab890a80d901ef

Comment by Dianna Hohensee (Inactive) [ 14/Sep/16 ]

After discussion, leaning toward staying in the critical section if the log write gets a shutdown error, either by infinite loop with sleep retrying the log write, or by keeping the critical flag somehow and then returning. Shutdown cannot be halted once commenced. If Shutdown error is received on the refresh command, we can just clear the metadata because the shard will have acquired the latest optime from the remote log command and any other process that needs the metadata will correctly reload the metadata. If the optime were stale, the reload is potentially stale, so we can't let anything happen with a stale optime.

Comment by Dianna Hohensee (Inactive) [ 01/Jul/16 ]

kaloian.manassiev What do we want the shard to do if it gets interrupted by a shutdown in the refresh logic?

Comment by Dianna Hohensee (Inactive) [ 15/Jun/16 ]

The patch for this fix should also include a JS test to make sure the fix works. Set up a moveChunk, set some failpoints, shutdown servers without catching a fassert.

Generated at Thu Feb 08 04:04:54 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.