[SERVER-27072] Config Server shuts down during startup Created: 16/Nov/16  Updated: 14/Dec/16  Resolved: 14/Dec/16

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.2.5
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Darshan Shah Assignee: Kelsey Schubert
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

[oasprod@chi-ppe-oas019 ~]$ uname -a
Linux chi-ppe-oas019 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux
[oasprod@chi-ppe-oas019 ~]$ cat /etc/red
redhat-lsb/ redhat-release
[oasprod@chi-ppe-oas019 ~]$ cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.1 (Maipo)
[oasprod@chi-ppe-oas019 ~]$


Attachments: Text File mongodb_configdb_bxb-ppe-oas002.log     Text File mongodb_configdb_bxb-ppe-oas012.log     Text File mongodb_configdb_chi-ppe-oas019.log    
Operating System: ALL
Participants:

 Description   

On a sharded cluster running MongoDb 3.2.5 with WT, CSRS nodes simple shut down by themselves during startup.
Attaching the log files of the 3 nodes of the CSRS having the below details:

  1. bxb-ppe-oas002 is the primary member of CSRS writing to local disk
  2. chi-ppe-oas019 is the secondary member of CSRS writing to local disk
  3. bxb-ppe-oas012 is the hidden member of CSRS writing to NFS

Here are the server details:

[oasprod@chi-ppe-oas019 ~]$ uname -a
Linux chi-ppe-oas019 3.10.0-229.el7.x86_64 #1 SMP Thu Jan 29 18:37:38 EST 2015 x86_64 x86_64 x86_64 GNU/Linux
[oasprod@chi-ppe-oas019 ~]$ cat /etc/red
redhat-lsb/ redhat-release
[oasprod@chi-ppe-oas019 ~]$ cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.1 (Maipo)
[oasprod@chi-ppe-oas019 ~]$



 Comments   
Comment by Kelsey Schubert [ 14/Dec/16 ]

Hi darshan.shah@interactivedata.com,

This transition to secondary message is a bit misleading. It hadn't actually become secondary. It was merely attempting to step down as the primary. As you only had a single-node replica set at the time, it could not become a secondary. The attempted transition to a secondary was because a shutdown was triggered.

Since the log messages indicate that this shutdown initiated by an external process, my recommendation would to investigate what is sending these shutdown commands.

Kind regards,
Thomas

Comment by Darshan Shah [ 21/Nov/16 ]

Yes, it is quite frequent and happens most of the time the script is called from cron.
I re-checked and could not find any script issuing a shut down to the config servers.
It happens on a different set of servers in the DEV environment as well.
Attaching a snippet of the logs which shows this behavior - strangely, here are the observations:

  1. primary config server transitions to secondary BEFORE the termination command is received.
  2. secondary is syncing from the hidden member instead of primary

PRIMARY:
2016-11-20T19:31:48.426-0500 I REPL     [ReplicationExecutor] This node is bxb-dev-oas030:29102 in the config
2016-11-20T19:31:48.426-0500 I REPL     [ReplicationExecutor] transition to STARTUP2
2016-11-20T19:31:48.426-0500 I REPL     [ReplicationExecutor] Starting replication applier threads
2016-11-20T19:31:48.427-0500 I REPL     [ReplicationExecutor] transition to RECOVERING
2016-11-20T19:31:48.428-0500 I REPL     [ReplicationExecutor] transition to SECONDARY
2016-11-20T19:31:48.428-0500 I REPL     [ReplicationExecutor] Error in heartbeat request to bxb-dev-oas027:29102; HostUnreachable: Connection refused
2016-11-20T19:31:48.429-0500 I REPL     [ReplicationExecutor] Error in heartbeat request to bxb-dev-oas026:29102; HostUnreachable: Connection refused
2016-11-20T19:31:48.430-0500 I REPL     [ReplicationExecutor] Error in heartbeat request to bxb-dev-oas027:29102; HostUnreachable: Connection refused
2016-11-20T19:31:48.430-0500 I REPL     [ReplicationExecutor] Error in heartbeat request to bxb-dev-oas026:29102; HostUnreachable: Connection refused
2016-11-20T19:31:48.431-0500 I REPL     [ReplicationExecutor] Error in heartbeat request to bxb-dev-oas027:29102; HostUnreachable: Connection refused
2016-11-20T19:31:48.432-0500 I REPL     [ReplicationExecutor] Error in heartbeat request to bxb-dev-oas026:29102; HostUnreachable: Connection refused
2016-11-20T19:31:53.434-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Successfully connected to bxb-dev-oas027:29102
2016-11-20T19:31:53.435-0500 I REPL     [ReplicationExecutor] Member bxb-dev-oas027:29102 is now in state SECONDARY
2016-11-20T19:31:53.436-0500 I REPL     [ReplicationExecutor] Error in heartbeat request to bxb-dev-oas026:29102; HostUnreachable: Connection refused
2016-11-20T19:31:53.437-0500 I REPL     [ReplicationExecutor] Error in heartbeat request to bxb-dev-oas026:29102; HostUnreachable: Connection refused
2016-11-20T19:31:53.438-0500 I REPL     [ReplicationExecutor] Error in heartbeat request to bxb-dev-oas026:29102; HostUnreachable: Connection refused
2016-11-20T19:31:58.440-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Successfully connected to bxb-dev-oas026:29102
2016-11-20T19:31:58.441-0500 I REPL     [ReplicationExecutor] Member bxb-dev-oas026:29102 is now in state SECONDARY
2016-11-20T19:31:58.967-0500 I REPL     [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms
2016-11-20T19:31:58.967-0500 I REPL     [ReplicationExecutor] conducting a dry run election to see if we could be elected
2016-11-20T19:31:58.967-0500 I REPL     [ReplicationExecutor] dry election run succeeded, running for election
2016-11-20T19:31:58.968-0500 I REPL     [ReplicationExecutor] election succeeded, assuming primary role in term 24
2016-11-20T19:31:58.968-0500 I REPL     [ReplicationExecutor] transition to PRIMARY
2016-11-20T19:31:58.969-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Successfully connected to bxb-dev-oas026:29102
2016-11-20T19:31:58.969-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Successfully connected to bxb-dev-oas026:29102
2016-11-20T19:31:59.429-0500 I REPL     [rsSync] transition to primary complete; database writes are now permitted
2016-11-20T19:32:16.166-0500 I REPL     [ReplicationExecutor] transition to SECONDARY
2016-11-20T19:32:16.166-0500 I COMMAND  [conn11] terminating, shutdown command received
2016-11-20T19:32:16.166-0500 I FTDC     [conn11] Shutting down full-time diagnostic data capture
2016-11-20T19:32:16.167-0500 I REPL     [conn11] Stopping replication applier threads
2016-11-20T19:32:17.430-0500 I CONTROL  [conn11] now exiting
 
 
HIDDEN:
2016-11-20T19:31:57.046-0500 I REPL     [ReplicationExecutor] Starting replication applier threads
2016-11-20T19:31:57.046-0500 I REPL     [ReplicationExecutor] transition to RECOVERING
2016-11-20T19:31:57.047-0500 I REPL     [ReplicationExecutor] transition to SECONDARY
2016-11-20T19:31:57.048-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Successfully connected to bxb-dev-oas030:29102
2016-11-20T19:31:57.048-0500 I REPL     [ReplicationExecutor] Member bxb-dev-oas030:29102 is now in state SECONDARY
2016-11-20T19:31:57.049-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Successfully connected to bxb-dev-oas027:29102
2016-11-20T19:31:57.049-0500 I REPL     [ReplicationExecutor] Member bxb-dev-oas027:29102 is now in state SECONDARY
2016-11-20T19:32:02.049-0500 I REPL     [ReplicationExecutor] Member bxb-dev-oas030:29102 is now in state PRIMARY
2016-11-20T19:32:03.048-0500 I REPL     [ReplicationExecutor] syncing from: bxb-dev-oas030:29102
2016-11-20T19:32:03.051-0500 I REPL     [SyncSourceFeedback] setting syncSourceFeedback to bxb-dev-oas030:29102
2016-11-20T19:32:03.052-0500 I ASIO     [NetworkInterfaceASIO-BGSync-0] Successfully connected to bxb-dev-oas030:29102
2016-11-20T19:32:16.166-0500 I REPL     [ReplicationExecutor] could not find member to sync from
2016-11-20T19:32:16.166-0500 I ASIO     [ReplicationExecutor] dropping unhealthy pooled connection to bxb-dev-oas030:29102
2016-11-20T19:32:16.166-0500 I ASIO     [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
2016-11-20T19:32:17.430-0500 I REPL     [ReplicationExecutor] Error in heartbeat request to bxb-dev-oas030:29102; HostUnreachable: Connection reset by peer
 
 
SECONDARY:
2016-11-20T19:31:53.991-0500 I REPL     [ReplicationExecutor] Error in heartbeat request to bxb-dev-oas026:29102; HostUnreachable: Connection refused
2016-11-20T19:31:58.986-0500 I REPL     [ReplicationExecutor] Member bxb-dev-oas030:29102 is now in state PRIMARY
2016-11-20T19:31:58.993-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Successfully connected to bxb-dev-oas026:29102
2016-11-20T19:31:58.993-0500 I REPL     [ReplicationExecutor] Member bxb-dev-oas026:29102 is now in state SECONDARY
2016-11-20T19:32:04.986-0500 I REPL     [ReplicationExecutor] syncing from: bxb-dev-oas026:29102
2016-11-20T19:32:04.988-0500 I REPL     [SyncSourceFeedback] setting syncSourceFeedback to bxb-dev-oas026:29102
2016-11-20T19:32:04.990-0500 I ASIO     [NetworkInterfaceASIO-BGSync-0] Successfully connected to bxb-dev-oas026:29102
2016-11-20T19:32:16.988-0500 I ASIO     [ReplicationExecutor] dropping unhealthy pooled connection to bxb-dev-oas030:29102
2016-11-20T19:32:16.988-0500 I ASIO     [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
2016-11-20T19:32:17.430-0500 I REPL     [ReplicationExecutor] Error in heartbeat request to bxb-dev-oas030:29102; HostUnreachable: Connection reset by peer
2016-11-20T19:32:17.431-0500 I REPL     [ReplicationExecutor] Error in heartbeat request to bxb-dev-oas030:29102; HostUnreachable: Connection refused
2016-11-20T19:32:17.432-0500 I REPL     [ReplicationExecutor] Error in heartbeat request to bxb-dev-oas030:29102; HostUnreachable: Connection refused
2016-11-20T19:32:19.435-0500 I REPL     [ReplicationExecutor] Error in heartbeat request to bxb-dev-oas030:29102; HostUnreachable: Connection refused
2016-11-20T19:32:19.436-0500 I REPL     [ReplicationExecutor] Error in heartbeat request to bxb-dev-oas030:29102; HostUnreachable: Connection refused
2016-11-20T19:32:19.437-0500 I REPL     [ReplicationExecutor] Error in heartbeat request to bxb-dev-oas030:29102; HostUnreachable: Connection refused
2016-11-20T19:32:19.993-0500 I REPL     [ReplicationExecutor] could not find member to sync from
2016-11-20T19:32:19.994-0500 I REPL     [ReplicationExecutor] Error in heartbeat request to bxb-dev-oas030:29102; HostUnreachable: Connection refused
2016-11-20T19:32:19.995-0500 I REPL     [ReplicationExecutor] Error in heartbeat request to bxb-dev-oas030:29102; HostUnreachable: Connection refused
2016-11-20T19:32:19.996-0500 I REPL     [ReplicationExecutor] Error in heartbeat request to bxb-dev-oas030:29102; HostUnreachable: Connection refused
2016-11-20T19:32:21.165-0500 I COMMAND  [conn7] terminating, shutdown command received
2016-11-20T19:32:21.165-0500 I FTDC     [conn7] Shutting down full-time diagnostic data capture
2016-11-20T19:32:21.166-0500 I REPL     [conn7] Stopping replication applier threads
2016-11-20T19:32:21.166-0500 I STORAGE  [conn5] got request after shutdown()
2016-11-20T19:32:22.597-0500 I STORAGE  [conn2] got request after shutdown()

Comment by Ramon Fernandez Marina [ 21/Nov/16 ]

darshan.shah@interactivedata.com, the logs for all nodes show the following messages:

2016-11-16T13:41:06.538-0500 I COMMAND  [conn2] terminating, shutdown command received
2016-11-16T13:41:10.843-0500 I COMMAND  [conn17] terminating, shutdown command received
2016-11-16T13:41:17.140-0500 I COMMAND  [conn18] terminating, shutdown command received

I don't think this message can appear unless there's external intervention to shut down the node, but I'm not sure so I need to dig a little deeper here.

Given the timing, it seems the nodes were shut down externally on purpose. Is this a chronic issue? Does it happen every time the nodes are started up?

Generated at Thu Feb 08 04:14:05 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.