[SERVER-10811] Secondary thinks we are down Created: 18/Sep/13  Updated: 11/Jul/16  Resolved: 04/Nov/13

Status: Closed
Project: Core Server
Component/s: Networking, Replication, Sharding
Affects Version/s: 2.4.4
Fix Version/s: None

Type: Question Priority: Major - P3
Reporter: Dwayne Bull Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Ubuntu
Sharded replica set


Participants:

 Description   

Over the last few months I've been getting this error, going through new versions didn't help. Last night I had it twice so I thought it's about time I posted something. Here is the log from the primary, during this window all querys throw an error.

Wed Sep 18 05:16:44.428 [conn701870] command admin.$cmd command:

{ writebacklisten: ObjectId('52302fdfc47aee5088985eb0') }

ntoreturn:1 keyUpdates:0 reslen:44 300000ms
Wed Sep 18 05:17:21.627 [rsHealthPoll] DBClientCursor::init call() failed
Wed Sep 18 05:17:21.685 [rsHealthPoll] replSet info db5 is down (or slow to respond):
Wed Sep 18 05:17:21.686 [rsHealthPoll] replSet member db5 is now in state DOWN
Wed Sep 18 05:17:22.103 [rsHealthPoll] DBClientCursor::init call() failed
Wed Sep 18 05:17:22.103 [rsHealthPoll] replset info db9 heartbeat failed, retrying
Wed Sep 18 05:17:23.975 [ReplicaSetMonitorWatcher] Socket recv() timeout ip:port
Wed Sep 18 05:17:23.975 [ReplicaSetMonitorWatcher] SocketException: remote: ip:port error: 9001 socket exception [3] server [ip:port]
Wed Sep 18 05:17:23.976 [ReplicaSetMonitorWatcher] DBClientCursor::init call() failed
Wed Sep 18 05:17:25.193 [conn702234] command admin.$cmd command:

{ writebacklisten: ObjectId('52303e31f00d8943bc8388e0') }

ntoreturn:1 keyUpdates:0 reslen:44 300000ms
Wed Sep 18 05:17:27.208 [ReplicaSetMonitorWatcher] trying reconnect to db8
Wed Sep 18 05:17:27.208 [rsHealthPoll] replset info db9 thinks that we are down
Wed Sep 18 05:17:27.208 [rsHealthPoll] replset info db5 thinks that we are down
Wed Sep 18 05:17:27.210 [rsHealthPoll] replSet member db5 is up
Wed Sep 18 05:17:27.211 [rsHealthPoll] replSet member db5 is now in state SECONDARY
Wed Sep 18 05:17:27.214 [ReplicaSetMonitorWatcher] reconnect db8 ok
Wed Sep 18 05:17:28.051 [conn702172] command admin.$cmd command:

{ writebacklisten: ObjectId('52303e2223cd5188967ef7c5') }

ntoreturn:1 keyUpdates:0 reslen:44 300000ms
Wed Sep 18 05:17:29.212 [rsHealthPoll] replset info db5 thinks that we are down
Wed Sep 18 05:17:29.212 [rsHealthPoll] replset info db9 thinks that we are down
Wed Sep 18 05:17:29.212 [rsHealthPoll] replSet member db9 is now in state PRIMARY
Wed Sep 18 05:17:31.213 [rsHealthPoll] replSet member db9 is now in state SECONDARY
Wed Sep 18 05:17:31.893 [conn697777] command admin.$cmd command:

{ writebacklisten: ObjectId('522f491bbf08221ed0427b16') }

ntoreturn:1 keyUpdates:0 reslen:44 300000ms
Wed Sep 18 05:17:45.111 [conn619965] command admin.$cmd command:

{ writebacklisten: ObjectId('51ba0ac770d333f140193082') }

ntoreturn:1 keyUpdates:0 reslen:44 300000ms



 Comments   
Comment by Ranjay Krishna [ 04/Nov/13 ]

Thank you for notifying us about the fix. Please let us know if the problem comes back.

Comment by Dwayne Bull [ 12/Oct/13 ]

I seem to have found a fix to this issue.
This only happened when the webservers running php5-fpm were configured to spawn php5-fpm processs dynamic rather than static.

Generated at Thu Feb 08 03:24:07 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.