|
This morning, I had this problem on the mongos only. The shards themselves seemed fine. It started out with tons of assert errors in the mongos.log:
Mon May 2 09:00:02 [WriteBackListener] SocketException: remote: error: 9001 socket exception [0]
Mon May 2 09:00:02 [WriteBackListener] DBClientCursor::init call() failed
Mon May 2 09:00:02 [WriteBackListener] WriteBackListener exception : std::exception
Mon May 2 09:00:02 [conn3] MessagingPort recv() errno:104 Connection reset by peer 10.120.14.79:27018
Mon May 2 09:00:02 [conn3] SocketException: remote: error: 9001 socket exception [1]
Mon May 2 09:00:02 [conn3] AssertionException in process: dbclient error communicating with server: mongodb05.example.com:27018
Mon May 2 09:00:02 [conn3] MessagingPort recv() errno:104 Connection reset by peer 10.120.14.79:27018
Mon May 2 09:00:02 [conn3] SocketException: remote: error: 9001 socket exception [1]
Mon May 2 09:00:02 [conn3] AssertionException in process: dbclient error communicating with server: mongodb05.example.com:27018
Mon May 2 09:00:02 BackgroundJob starting:
Mon May 2 09:00:03 [conn1] MessagingPort recv() errno:104 Connection reset by peer 10.120.14.79:27018
Mon May 2 09:00:03 [conn1] SocketException: remote: error: 9001 socket exception [1]
Mon May 2 09:00:03 [conn1] AssertionException in process: dbclient error communicating with server: mongodb05.example.com:27018
Mon May 2 09:00:03 [WriteBackListener] creating new connection to:mongodb05.example.com:27018
Mon May 2 09:00:03 BackgroundJob starting:
Mon May 2 09:00:04 [conn4] setShardVersion 2 2/mongodb05.example.com:27018,mongodb04.example.com:27018,mongodb04.example.com:27018 example.stats
{ setShardVersion: "example.stats", configdb: "mongodb-config03.example.com:27019,mongodb-config02.example.co...", version: Timestamp 80000|1,
serverID: ObjectId('4dbb3143ce3b1b921b814a98'), shard: "2", shardHost: "2/mongodb05.example.com:27018,mongodb04.example.com:27018" }
0x1f601d0
Mon May 2 09:00:04 [conn4] MessagingPort recv() errno:104 Connection reset by peer 10.120.14.79:27018
Mon May 2 09:00:04 [conn4] SocketException: remote: error: 9001 socket exception [1]
Mon May 2 09:00:04 [conn4] DBClientCursor::init call() failed
Mon May 2 09:00:04 [conn4] AssertionException in process: DBClientBase::findOne: transport error: mongodb05.example.com:27018 query:
{ setShardVersion: "example.stats", configdb: "mongodb-config03.example.com:27019,mongodb-config02.example.co...", version: Timestamp 80000|1, serverID: ObjectId('4dbb3143ce
3b1b921b814a98'), shard: "2", shardHost: "2/mongodb05.example.com:27018,mongodb04.example.com:27018" }
Mon May 2 09:00:04 BackgroundJob starting:
Mon May 2 09:00:05 BackgroundJob starting:
Mon May 2 09:00:06 [ReplicaSetMonitorWatcher] ReplicaSetMonitor::_checkConnection: mongodb10.example.com:27018
{ setName: "4", ismaster: true, secondary: false, hosts: [ "mongodb10.example.com:27018", "mongodb11.example.com:27018" ], arbiters: [ "mongodb12.example.com:27018" ], maxBsonObjectSize: 16777216
, ok: 1.0 }
checking replica set: 1
Mon May 2 09:00:06 [ReplicaSetMonitorWatcher] ReplicaSetMonitor::_checkConnection: mongodb01.example.com:27018
{ setName: "1", ismaster: true, secondary: false, hosts: [ "mongodb01.example.com:27018", "mongodb02.example.com:27018" ], arbiters: [ "mongodb03.example.com:27018" ], maxBsonObjectSize: 16777216
, ok: 1.0 }
checking replica set: 2Mon May 2 09:00:06 [ReplicaSetMonitorWatcher] MessagingPort recv() errno:104 Connection reset by peer 10.120.14.79:27018
Mon May 2 09:00:06 [ReplicaSetMonitorWatcher] SocketException: remote: error: 9001 socket exception [1]
Mon May 2 09:00:06 [ReplicaSetMonitorWatcher] DBClientCursor::init call() failed
Mon May 2 09:00:06 [ReplicaSetMonitorWatcher] ReplicaSetMonitor::_checkConnection: caught exception mongodb05.example.com:27018 DBClientBase::findOne: transport error: mongodb05.example.com:27018 query:
{ ismaster: 1 }
Mon May 2 09:00:06 [ReplicaSetMonitorWatcher] _check : 2/mongodb05.example.com:27018,mongodb04.example.com:27018,mongodb04.example.com:27018
Mon May 2 09:00:06 [ReplicaSetMonitorWatcher] trying reconnect to mongodb05.example.com:27018
Mon May 2 09:00:06 BackgroundJob starting:
Mon May 2 09:00:06 [ReplicaSetMonitorWatcher] reconnect mongodb05.example.com:27018 ok
Mon May 2 09:00:06 [ReplicaSetMonitorWatcher] ERROR: ReplicaSetMonitor::_checkConnection: mongodb05.example.com:27018
{ setName: "2", ismaster: false, secondary: true, hosts: [ "mongodb05.example.com:27018", "mongodb04.example.com:27018" ], arbiters: [ "mongodb06.example.com:27018" ], maxBsonObjectSize: 1
6777216, ok: 1.0 }
MessagingPort::call() wrong id got:1c0d6d3e expect:1c0fd17b
toSend op: 2004
response msgid:2368415436
response len: 258
response op: 1
farEnd: 10.84.55.204:27018
Mon May 2 09:00:06 [ReplicaSetMonitorWatcher] Assertion failure false util/message.cpp 512
0x50f09e 0x520131 0x519b6e 0x51ad84 0x547254 0x5595f0 0x53ae5a 0x54af51 0x546afc 0x54a10c 0x53bfd5 0x54fd78 0x5507fd 0x550b92 0x550e44 0x556045 0x5035ab 0x504e64 0x69ec30 0x7f73620eb9ca
/usr/bin/mongos(_ZN5mongo12sayDbContextEPKc+0xae) [0x50f09e]
/usr/bin/mongos(_ZN5mongo8assertedEPKcS1_j+0x111) [0x520131]
/usr/bin/mongos(ZN5mongo13MessagingPort4recvERKNS_7MessageERS1+0x32e) [0x519b6e]
/usr/bin/mongos(ZN5mongo13MessagingPort4callERNS_7MessageES2+0x34) [0x51ad84]
/usr/bin/mongos(_ZN5mongo18DBClientConnection4callERNS_7MessageES2_bPSs+0x34) [0x547254]
/usr/bin/mongos(_ZN5mongo14DBClientCursor4initEv+0x150) [0x5595f0]
/usr/bin/mongos(_ZN5mongo12DBClientBase5queryERKSsNS_5QueryEiiPKNS_7BSONObjEii+0x32a) [0x53ae5a]
/usr/bin/mongos(_ZN5mongo18DBClientConnection5queryERKSsNS_5QueryEiiPKNS_7BSONObjEii+0xa1) [0x54af51]
/usr/bin/mongos(_ZN5mongo17DBClientInterface7findOneERKSsRKNS_5QueryEPKNS_7BSONObjEi+0x7c) [0x546afc]
/usr/bin/mongos(_ZN5mongo20DBClientWithCommands10runCommandERKSsRKNS_7BSONObjERS3_i+0x8c) [0x54a10c]
/usr/bin/mongos(_ZN5mongo20DBClientWithCommands8isMasterERbPNS_7BSONObjE+0x75) [0x53bfd5]
/usr/bin/mongos(_ZN5mongo17ReplicaSetMonitor16_checkConnectionEPNS_18DBClientConnectionERSsb+0x88) [0x54fd78]
/usr/bin/mongos(_ZN5mongo17ReplicaSetMonitor6_checkEv+0x14d) [0x5507fd]
/usr/bin/mongos(_ZN5mongo17ReplicaSetMonitor5checkEv+0x62) [0x550b92]
/usr/bin/mongos(_ZN5mongo17ReplicaSetMonitor8checkAllEv+0x284) [0x550e44]
/usr/bin/mongos(_ZN5mongo24ReplicaSetMonitorWatcher3runEv+0x55) [0x556045]
/usr/bin/mongos(_ZN5mongo13BackgroundJob7jobBodyEN5boost10shared_ptrINS0_9JobStatusEEE+0x12b) [0x5035ab]
/usr/bin/mongos(_ZN5boost6detail11thread_dataINS_3_bi6bind_tIvNS_4_mfi3mf1IvN5mongo13BackgroundJobENS_10shared_ptrINS7_9JobStatusEEEEENS2_5list2INS2_5valueIPS7_EENSD_ISA_EEEEEEE3runEv+0x74) [0x504e64]
/usr/bin/mongos(thread_proxy+0x80) [0x69ec30]
/lib/libpthread.so.0(+0x69ca) [0x7f73620eb9ca]
When I look at the mongodb.log at the same time, I see this:
Mon May 2 09:00:02 [conn78767] replSet info stepping down as primary secs=60
Mon May 2 09:00:02 [conn78767] replSet relinquishing primary state
Mon May 2 09:00:02 [conn78767] replSet SECONDARY
Mon May 2 09:00:02 [conn78767] replSet closing client sockets after reqlinquishing primary
Mon May 2 09:00:02 [conn78767] SocketException in connThread, closing client connection
Mon May 2 09:00:02 [conn78771] assertion 13435 not master and slaveok=false ns:example.ads query:
{ adId: "hal_livingSocial_cookbook_dfa_baselineRotation06" }
Mon May 2 09:00:02 [conn78771] ntoskip:0 ntoreturn:-1
9:00 GMT is when I run backups on my slaves (mongodb04 is the master and mongodb05 is the slave). The method I'm using to back up is described here:
http://conigliaro.org/2010/10/28/simple-automated-backups-for-mongodb-replica-sets/
Basically, I make sure that the backup host is a slave before running the backup:
if (rs.status()['myState'] == 1) {
print('Host is master (stepping down)');
rs.stepDown();
while (rs.status()['myState'] != 2)
{
sleep(1000);
}
}
Apparently, failover is just not working properly. Everything looks good in the mongodb.log after the stepdown (aside from a brief period where queries continue hitting the slave for some reason), but the mongos on my app servers seem to get stuck in a state that block subsequent queries until I restart them.
|