[SERVER-27900] Shutdown can get stuck behind any thread doing ShardRegistry::reload Created: 02/Feb/17  Updated: 05/Apr/17  Resolved: 05/Apr/17

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.4.2, 3.5.2
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Kaloian Manassiev Assignee: Kaloian Manassiev
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Duplicate
duplicates SERVER-27691 ServiceContext::setKillAllOperations ... Closed
Operating System: ALL
Sprint: Sharding 2017-03-27, Sharding 2017-04-17
Participants:
Linked BF Score: 0

 Description   

The ShardRegistry::reload call spawns a thread to refresh the list of shards from the config server. Because this thread runs with its own OperationContext, it ends up calling ReplicationCoordinatorImpl::waitUntilOpTimeForRead without any timeout.

Because of this, the shutdown sequence gets stuck since replication cannot make progress and update the opTime due to the server shutting down and the reload operation cannot proceed because it is waiting on the opTime to advance.

The reason for this is that replication is the last entry in the shutdown sequence, so it never gets to be invoked in the scenario above and because of this waitUntilOpTimeForRead becomes permanently stuck.



 Comments   
Comment by Kaloian Manassiev [ 05/Apr/17 ]

Fixed as result of SERVER-27691

Comment by Kaloian Manassiev [ 29/Mar/17 ]

One proposed solution involved masking the checkForInterrupt calls from actually checking anything if interrupt has been disallowed for the OperationContext. However, this has the side effect that it would also ignore operation deadlines for internal threads. If we were to ignore the operation deadlines, this would make explaining the semantics of the "allowInterrupt" setting even more complex.

Instead we will go the full way and just mark all threads as interruptible by default in master and deal with the aftermath of potentially having to fix places which do not properly handle interruptions or exceptions being thrown.

Comment by Judah Schvimer [ 03/Feb/17 ]

The node does step down here. Do we expect the query to fail when the stepDown occurs and not keep blocking?

[js_test:commands_that_write_accept_wc_configRS] 2017-01-19T18:17:53.497+0000 c20266| 2017-01-19T18:17:53.497+0000 I CONTROL  [eventTerminate] shutdown event signaled, will terminate after current cmd ends
...
[js_test:commands_that_write_accept_wc_configRS] 2017-01-19T18:18:01.430+0000 c20266| 2017-01-19T18:18:01.430+0000 I REPL     [ReplicationExecutor] can't see a majority of the set, relinquishing primary
[js_test:commands_that_write_accept_wc_configRS] 2017-01-19T18:18:01.430+0000 c20266| 2017-01-19T18:18:01.430+0000 I REPL     [ReplicationExecutor] Stepping down from primary in response to heartbeat
[js_test:commands_that_write_accept_wc_configRS] 2017-01-19T18:18:01.430+0000 c20266| 2017-01-19T18:18:01.430+0000 I REPL     [replExecDBWorker-0] transition to SECONDARY
[js_test:commands_that_write_accept_wc_configRS] 2017-01-19T18:18:01.430+0000 c20266| 2017-01-19T18:18:01.430+0000 I NETWORK  [replExecDBWorker-0] legacy transport layer closing all connections
[js_test:commands_that_write_accept_wc_configRS] 2017-01-19T18:18:01.430+0000 c20266| 2017-01-19T18:18:01.430+0000 I NETWORK  [replExecDBWorker-0] Skip closing connection for connection # 37
[js_test:commands_that_write_accept_wc_configRS] 2017-01-19T18:18:01.430+0000 c20266| 2017-01-19T18:18:01.430+0000 I NETWORK  [replExecDBWorker-0] Skip closing connection for connection # 21
[js_test:commands_that_write_accept_wc_configRS] 2017-01-19T18:18:01.430+0000 c20266| 2017-01-19T18:18:01.430+0000 I NETWORK  [replExecDBWorker-0] Skip closing connection for connection # 20

Comment by Kaloian Manassiev [ 02/Feb/17 ]

benety.goh, in the scenario above, even though ReplicationCoordinatorImpl::shutdown was not called, if the remaining nodes did shutdown, this node should have stepped down and if the other nodes did not yet shutdown, the opTime should have advanced. Do you have any idea why neither happened?

Generated at Thu Feb 08 04:16:34 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.