[SERVER-2933] [ReplSetHealthPollTask] Assertion failure false util/message.cpp 512 Created: 12/Apr/11  Updated: 12/Jul/16  Resolved: 02/May/11

Status: Closed
Project: Core Server
Component/s: Stability
Affects Version/s: 1.8.1
Fix Version/s: 1.9.1

Type: Bug Priority: Major - P3
Reporter: Michael Conigliaro Assignee: Kristina Chodorow (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Ubuntu 10.04 on EC2


Issue Links:
Depends
Operating System: Linux
Participants:

 Description   

Many thousands of these in master mongodb.log:

Tue Apr 12 07:05:23 [ReplSetHealthPollTask] MessagingPort recv() errno:104 Connection reset by peer 10.120.14.79:27018
Tue Apr 12 07:05:23 [ReplSetHealthPollTask] SocketException: remote: error: 9001 socket exception [1]
Tue Apr 12 07:05:23 [ReplSetHealthPollTask] DBClientCursor::init call() failed
Tue Apr 12 07:05:23 [ReplSetHealthPollTask] replSet info mongodb05.example.com:27018 is down (or slow to respond): DBClientBase::findOne: transport error: 10.120.14.79:27018 query:

{ ping: 1 }

Tue Apr 12 07:05:25 [ReplSetHealthPollTask] ERROR: MessagingPort::call() wrong id got:7c2a3e4f expect:7c2d5a40
toSend op: 2004
response msgid:2117972757
response len: 129
response op: 1
farEnd: 10.120.14.79:27018
Tue Apr 12 07:05:25 [ReplSetHealthPollTask] Assertion failure false util/message.cpp 512
0x54e8de 0x55f971 0x5593ae 0x55a5c4 0x586a94 0x598e30 0x57a69a 0x58a791 0x58633c 0x58994c 0x6e1bcb 0x6e5871 0x5650f3 0x542deb 0x5446a4 0x8b6a40 0x7f0a30bee9ca 0x7f0a3019d70d
/usr/bin/mongod(_ZN5mongo12sayDbContextEPKc+0xae) [0x54e8de]
/usr/bin/mongod(_ZN5mongo8assertedEPKcS1_j+0x111) [0x55f971]
/usr/bin/mongod(ZN5mongo13MessagingPort4recvERKNS_7MessageERS1+0x32e) [0x5593ae]
/usr/bin/mongod(ZN5mongo13MessagingPort4callERNS_7MessageES2+0x34) [0x55a5c4]
/usr/bin/mongod(_ZN5mongo18DBClientConnection4callERNS_7MessageES2_bPSs+0x34) [0x586a94]
/usr/bin/mongod(_ZN5mongo14DBClientCursor4initEv+0x150) [0x598e30]
/usr/bin/mongod(_ZN5mongo12DBClientBase5queryERKSsNS_5QueryEiiPKNS_7BSONObjEii+0x32a) [0x57a69a]
/usr/bin/mongod(_ZN5mongo18DBClientConnection5queryERKSsNS_5QueryEiiPKNS_7BSONObjEii+0xa1) [0x58a791]
/usr/bin/mongod(_ZN5mongo17DBClientInterface7findOneERKSsRKNS_5QueryEPKNS_7BSONObjEi+0x7c) [0x58633c]
/usr/bin/mongod(_ZN5mongo20DBClientWithCommands10runCommandERKSsRKNS_7BSONObjERS3_i+0x8c) [0x58994c]
/usr/bin/mongod(_ZN5mongo16requestHeartbeatESsSsSsRNS_7BSONObjEiRib+0x6db) [0x6e1bcb]
/usr/bin/mongod(_ZN5mongo21ReplSetHealthPollTask6doWorkEv+0x221) [0x6e5871]
/usr/bin/mongod(_ZN5mongo4task4Task3runEv+0x33) [0x5650f3]
/usr/bin/mongod(_ZN5mongo13BackgroundJob7jobBodyEN5boost10shared_ptrINS0_9JobStatusEEE+0x12b) [0x542deb]
/usr/bin/mongod(_ZN5boost6detail11thread_dataINS_3_bi6bind_tIvNS_4_mfi3mf1IvN5mongo13BackgroundJobENS_10shared_ptrINS7_9JobStatusEEEEENS2_5list2INS2_5valueIPS7_EENSD_ISA_EEEEEEE3runEv+0x74) [0x5446a4]
/usr/bin/mongod(thread_proxy+0x80) [0x8b6a40]
/lib/libpthread.so.0(+0x69ca) [0x7f0a30bee9ca]
/lib/libc.so.6(clone+0x6d) [0x7f0a3019d70d]

Many thousands of these in arbiter mongodb.log:

Tue Apr 12 07:05:27 [ReplSetHealthPollTask] MessagingPort recv() errno:104 Connection reset by peer 10.120.14.79:27018
Tue Apr 12 07:05:27 [ReplSetHealthPollTask] SocketException: remote: error: 9001 socket exception [1]
Tue Apr 12 07:05:27 [ReplSetHealthPollTask] DBClientCursor::init call() failed
Tue Apr 12 07:05:27 [ReplSetHealthPollTask] replSet info mongodb05.example.com:27018 is down (or slow to respond): DBClientBase::findOne: transport error: 10.120.14.79:27018 query:

{ ping: 1 }

Tue Apr 12 07:05:29 [ReplSetHealthPollTask] ERROR: MessagingPort::call() wrong id got:3897648b expect:38976a03
toSend op: 2004
response msgid:2117972758
response len: 129
response op: 1
farEnd: 10.120.14.79:27018
Tue Apr 12 07:05:29 [ReplSetHealthPollTask] Assertion failure false util/message.cpp 512
0x54e8de 0x55f971 0x5593ae 0x55a5c4 0x586a94 0x598e30 0x57a69a 0x58a791 0x58633c 0x58994c 0x6e1bcb 0x6e5871 0x5650f3 0x542deb 0x5446a4 0x8b6a40 0x7fe632d719ca 0x7fe63232070d
/usr/bin/mongod(_ZN5mongo12sayDbContextEPKc+0xae) [0x54e8de]
/usr/bin/mongod(_ZN5mongo8assertedEPKcS1_j+0x111) [0x55f971]
/usr/bin/mongod(ZN5mongo13MessagingPort4recvERKNS_7MessageERS1+0x32e) [0x5593ae]
/usr/bin/mongod(ZN5mongo13MessagingPort4callERNS_7MessageES2+0x34) [0x55a5c4]
/usr/bin/mongod(_ZN5mongo18DBClientConnection4callERNS_7MessageES2_bPSs+0x34) [0x586a94]
/usr/bin/mongod(_ZN5mongo14DBClientCursor4initEv+0x150) [0x598e30]
/usr/bin/mongod(_ZN5mongo12DBClientBase5queryERKSsNS_5QueryEiiPKNS_7BSONObjEii+0x32a) [0x57a69a]
/usr/bin/mongod(_ZN5mongo18DBClientConnection5queryERKSsNS_5QueryEiiPKNS_7BSONObjEii+0xa1) [0x58a791]
/usr/bin/mongod(_ZN5mongo17DBClientInterface7findOneERKSsRKNS_5QueryEPKNS_7BSONObjEi+0x7c) [0x58633c]
/usr/bin/mongod(_ZN5mongo20DBClientWithCommands10runCommandERKSsRKNS_7BSONObjERS3_i+0x8c) [0x58994c]
/usr/bin/mongod(_ZN5mongo16requestHeartbeatESsSsSsRNS_7BSONObjEiRib+0x6db) [0x6e1bcb]
/usr/bin/mongod(_ZN5mongo21ReplSetHealthPollTask6doWorkEv+0x221) [0x6e5871]
/usr/bin/mongod(_ZN5mongo4task4Task3runEv+0x33) [0x5650f3]
/usr/bin/mongod(_ZN5mongo13BackgroundJob7jobBodyEN5boost10shared_ptrINS0_9JobStatusEEE+0x12b) [0x542deb]
/usr/bin/mongod(_ZN5boost6detail11thread_dataINS_3_bi6bind_tIvNS_4_mfi3mf1IvN5mongo13BackgroundJobENS_10shared_ptrINS7_9JobStatusEEEEENS2_5list2INS2_5valueIPS7_EENSD_ISA_EEEEEEE3runEv+0x74) [0x5446a4]
/usr/bin/mongod(thread_proxy+0x80) [0x8b6a40]
/lib/libpthread.so.0(+0x69ca) [0x7fe632d719ca]
/lib/libc.so.6(clone+0x6d) [0x7fe63232070d]

The only way I've found to resolve this is to restart the servers with these messages.



 Comments   
Comment by Kristina Chodorow (Inactive) [ 02/May/11 ]

I think that this should be fixed by SERVER-3014 in most cases and the above changes in the remaining cases.

Comment by auto [ 02/May/11 ]

Author:

{u'login': u'kchodorow', u'name': u'Kristina', u'email': u'kristina@10gen.com'}

Message: fix recursive loop in sending data SERVER-2933
Branch: master
https://github.com/mongodb/mongo/commit/7a3a04458403713ee9cda6d4d978ebd5a25c5208

Comment by Michael Conigliaro [ 02/May/11 ]

This morning, I had this problem on the mongos only. The shards themselves seemed fine. It started out with tons of assert errors in the mongos.log:

Mon May 2 09:00:02 [WriteBackListener] SocketException: remote: error: 9001 socket exception [0]
Mon May 2 09:00:02 [WriteBackListener] DBClientCursor::init call() failed
Mon May 2 09:00:02 [WriteBackListener] WriteBackListener exception : std::exception
Mon May 2 09:00:02 [conn3] MessagingPort recv() errno:104 Connection reset by peer 10.120.14.79:27018
Mon May 2 09:00:02 [conn3] SocketException: remote: error: 9001 socket exception [1]
Mon May 2 09:00:02 [conn3] AssertionException in process: dbclient error communicating with server: mongodb05.example.com:27018
Mon May 2 09:00:02 [conn3] MessagingPort recv() errno:104 Connection reset by peer 10.120.14.79:27018
Mon May 2 09:00:02 [conn3] SocketException: remote: error: 9001 socket exception [1]
Mon May 2 09:00:02 [conn3] AssertionException in process: dbclient error communicating with server: mongodb05.example.com:27018
Mon May 2 09:00:02 BackgroundJob starting:
Mon May 2 09:00:03 [conn1] MessagingPort recv() errno:104 Connection reset by peer 10.120.14.79:27018
Mon May 2 09:00:03 [conn1] SocketException: remote: error: 9001 socket exception [1]
Mon May 2 09:00:03 [conn1] AssertionException in process: dbclient error communicating with server: mongodb05.example.com:27018
Mon May 2 09:00:03 [WriteBackListener] creating new connection to:mongodb05.example.com:27018
Mon May 2 09:00:03 BackgroundJob starting:
Mon May 2 09:00:04 [conn4] setShardVersion 2 2/mongodb05.example.com:27018,mongodb04.example.com:27018,mongodb04.example.com:27018 example.stats

{ setShardVersion: "example.stats", configdb: "mongodb-config03.example.com:27019,mongodb-config02.example.co...", version: Timestamp 80000|1, serverID: ObjectId('4dbb3143ce3b1b921b814a98'), shard: "2", shardHost: "2/mongodb05.example.com:27018,mongodb04.example.com:27018" }

0x1f601d0
Mon May 2 09:00:04 [conn4] MessagingPort recv() errno:104 Connection reset by peer 10.120.14.79:27018
Mon May 2 09:00:04 [conn4] SocketException: remote: error: 9001 socket exception [1]
Mon May 2 09:00:04 [conn4] DBClientCursor::init call() failed
Mon May 2 09:00:04 [conn4] AssertionException in process: DBClientBase::findOne: transport error: mongodb05.example.com:27018 query:

{ setShardVersion: "example.stats", configdb: "mongodb-config03.example.com:27019,mongodb-config02.example.co...", version: Timestamp 80000|1, serverID: ObjectId('4dbb3143ce 3b1b921b814a98'), shard: "2", shardHost: "2/mongodb05.example.com:27018,mongodb04.example.com:27018" }

Mon May 2 09:00:04 BackgroundJob starting:
Mon May 2 09:00:05 BackgroundJob starting:
Mon May 2 09:00:06 [ReplicaSetMonitorWatcher] ReplicaSetMonitor::_checkConnection: mongodb10.example.com:27018

{ setName: "4", ismaster: true, secondary: false, hosts: [ "mongodb10.example.com:27018", "mongodb11.example.com:27018" ], arbiters: [ "mongodb12.example.com:27018" ], maxBsonObjectSize: 16777216 , ok: 1.0 }

checking replica set: 1
Mon May 2 09:00:06 [ReplicaSetMonitorWatcher] ReplicaSetMonitor::_checkConnection: mongodb01.example.com:27018

{ setName: "1", ismaster: true, secondary: false, hosts: [ "mongodb01.example.com:27018", "mongodb02.example.com:27018" ], arbiters: [ "mongodb03.example.com:27018" ], maxBsonObjectSize: 16777216 , ok: 1.0 }

checking replica set: 2Mon May 2 09:00:06 [ReplicaSetMonitorWatcher] MessagingPort recv() errno:104 Connection reset by peer 10.120.14.79:27018
Mon May 2 09:00:06 [ReplicaSetMonitorWatcher] SocketException: remote: error: 9001 socket exception [1]
Mon May 2 09:00:06 [ReplicaSetMonitorWatcher] DBClientCursor::init call() failed
Mon May 2 09:00:06 [ReplicaSetMonitorWatcher] ReplicaSetMonitor::_checkConnection: caught exception mongodb05.example.com:27018 DBClientBase::findOne: transport error: mongodb05.example.com:27018 query:

{ ismaster: 1 }

Mon May 2 09:00:06 [ReplicaSetMonitorWatcher] _check : 2/mongodb05.example.com:27018,mongodb04.example.com:27018,mongodb04.example.com:27018
Mon May 2 09:00:06 [ReplicaSetMonitorWatcher] trying reconnect to mongodb05.example.com:27018
Mon May 2 09:00:06 BackgroundJob starting:
Mon May 2 09:00:06 [ReplicaSetMonitorWatcher] reconnect mongodb05.example.com:27018 ok
Mon May 2 09:00:06 [ReplicaSetMonitorWatcher] ERROR: ReplicaSetMonitor::_checkConnection: mongodb05.example.com:27018

{ setName: "2", ismaster: false, secondary: true, hosts: [ "mongodb05.example.com:27018", "mongodb04.example.com:27018" ], arbiters: [ "mongodb06.example.com:27018" ], maxBsonObjectSize: 1 6777216, ok: 1.0 }

MessagingPort::call() wrong id got:1c0d6d3e expect:1c0fd17b
toSend op: 2004
response msgid:2368415436
response len: 258
response op: 1
farEnd: 10.84.55.204:27018
Mon May 2 09:00:06 [ReplicaSetMonitorWatcher] Assertion failure false util/message.cpp 512
0x50f09e 0x520131 0x519b6e 0x51ad84 0x547254 0x5595f0 0x53ae5a 0x54af51 0x546afc 0x54a10c 0x53bfd5 0x54fd78 0x5507fd 0x550b92 0x550e44 0x556045 0x5035ab 0x504e64 0x69ec30 0x7f73620eb9ca
/usr/bin/mongos(_ZN5mongo12sayDbContextEPKc+0xae) [0x50f09e]
/usr/bin/mongos(_ZN5mongo8assertedEPKcS1_j+0x111) [0x520131]
/usr/bin/mongos(ZN5mongo13MessagingPort4recvERKNS_7MessageERS1+0x32e) [0x519b6e]
/usr/bin/mongos(ZN5mongo13MessagingPort4callERNS_7MessageES2+0x34) [0x51ad84]
/usr/bin/mongos(_ZN5mongo18DBClientConnection4callERNS_7MessageES2_bPSs+0x34) [0x547254]
/usr/bin/mongos(_ZN5mongo14DBClientCursor4initEv+0x150) [0x5595f0]
/usr/bin/mongos(_ZN5mongo12DBClientBase5queryERKSsNS_5QueryEiiPKNS_7BSONObjEii+0x32a) [0x53ae5a]
/usr/bin/mongos(_ZN5mongo18DBClientConnection5queryERKSsNS_5QueryEiiPKNS_7BSONObjEii+0xa1) [0x54af51]
/usr/bin/mongos(_ZN5mongo17DBClientInterface7findOneERKSsRKNS_5QueryEPKNS_7BSONObjEi+0x7c) [0x546afc]
/usr/bin/mongos(_ZN5mongo20DBClientWithCommands10runCommandERKSsRKNS_7BSONObjERS3_i+0x8c) [0x54a10c]
/usr/bin/mongos(_ZN5mongo20DBClientWithCommands8isMasterERbPNS_7BSONObjE+0x75) [0x53bfd5]
/usr/bin/mongos(_ZN5mongo17ReplicaSetMonitor16_checkConnectionEPNS_18DBClientConnectionERSsb+0x88) [0x54fd78]
/usr/bin/mongos(_ZN5mongo17ReplicaSetMonitor6_checkEv+0x14d) [0x5507fd]
/usr/bin/mongos(_ZN5mongo17ReplicaSetMonitor5checkEv+0x62) [0x550b92]
/usr/bin/mongos(_ZN5mongo17ReplicaSetMonitor8checkAllEv+0x284) [0x550e44]
/usr/bin/mongos(_ZN5mongo24ReplicaSetMonitorWatcher3runEv+0x55) [0x556045]
/usr/bin/mongos(_ZN5mongo13BackgroundJob7jobBodyEN5boost10shared_ptrINS0_9JobStatusEEE+0x12b) [0x5035ab]
/usr/bin/mongos(_ZN5boost6detail11thread_dataINS_3_bi6bind_tIvNS_4_mfi3mf1IvN5mongo13BackgroundJobENS_10shared_ptrINS7_9JobStatusEEEEENS2_5list2INS2_5valueIPS7_EENSD_ISA_EEEEEEE3runEv+0x74) [0x504e64]
/usr/bin/mongos(thread_proxy+0x80) [0x69ec30]
/lib/libpthread.so.0(+0x69ca) [0x7f73620eb9ca]

When I look at the mongodb.log at the same time, I see this:

Mon May 2 09:00:02 [conn78767] replSet info stepping down as primary secs=60
Mon May 2 09:00:02 [conn78767] replSet relinquishing primary state
Mon May 2 09:00:02 [conn78767] replSet SECONDARY
Mon May 2 09:00:02 [conn78767] replSet closing client sockets after reqlinquishing primary
Mon May 2 09:00:02 [conn78767] SocketException in connThread, closing client connection
Mon May 2 09:00:02 [conn78771] assertion 13435 not master and slaveok=false ns:example.ads query:

{ adId: "hal_livingSocial_cookbook_dfa_baselineRotation06" }

Mon May 2 09:00:02 [conn78771] ntoskip:0 ntoreturn:-1

9:00 GMT is when I run backups on my slaves (mongodb04 is the master and mongodb05 is the slave). The method I'm using to back up is described here:

http://conigliaro.org/2010/10/28/simple-automated-backups-for-mongodb-replica-sets/

Basically, I make sure that the backup host is a slave before running the backup:

if (rs.status()['myState'] == 1) {
print('Host is master (stepping down)');
rs.stepDown();
while (rs.status()['myState'] != 2)

{ sleep(1000); }

}

Apparently, failover is just not working properly. Everything looks good in the mongodb.log after the stepdown (aside from a brief period where queries continue hitting the slave for some reason), but the mongos on my app servers seem to get stuck in a state that block subsequent queries until I restart them.

Comment by Michael Conigliaro [ 28/Apr/11 ]

For what it's worth, I'm seeing very similar exceptions in my mongos logs too:

Thu Apr 28 12:39:38 [conn6264] ERROR: MessagingPort::call() wrong id got:9992efa4 expect:9992efd9
toSend op: 2004
response msgid:1389108114
response len: 258
response op: 1
farEnd: 10.76.57.113:27018
Thu Apr 28 12:39:38 [conn6264] Assertion failure false util/message.cpp 512
0x50f09e 0x520131 0x519b6e 0x51ad84 0x547254 0x5595f0 0x53ae5a 0x54af51 0x546afc 0x54a10c 0x53bfd5 0x54fd78 0x5507fd 0x5510a2 0x55134d 0x552cb6 0x53b41d 0x5332d2 0x575665 0x575886
/usr/bin/mongos(_ZN5mongo12sayDbContextEPKc+0xae) [0x50f09e]
/usr/bin/mongos(_ZN5mongo8assertedEPKcS1_j+0x111) [0x520131]
/usr/bin/mongos(ZN5mongo13MessagingPort4recvERKNS_7MessageERS1+0x32e) [0x519b6e]
/usr/bin/mongos(ZN5mongo13MessagingPort4callERNS_7MessageES2+0x34) [0x51ad84]
/usr/bin/mongos(_ZN5mongo18DBClientConnection4callERNS_7MessageES2_bPSs+0x34) [0x547254]
/usr/bin/mongos(_ZN5mongo14DBClientCursor4initEv+0x150) [0x5595f0]
/usr/bin/mongos(_ZN5mongo12DBClientBase5queryERKSsNS_5QueryEiiPKNS_7BSONObjEii+0x32a) [0x53ae5a]
/usr/bin/mongos(_ZN5mongo18DBClientConnection5queryERKSsNS_5QueryEiiPKNS_7BSONObjEii+0xa1) [0x54af51]
/usr/bin/mongos(_ZN5mongo17DBClientInterface7findOneERKSsRKNS_5QueryEPKNS_7BSONObjEi+0x7c) [0x546afc]
/usr/bin/mongos(_ZN5mongo20DBClientWithCommands10runCommandERKSsRKNS_7BSONObjERS3_i+0x8c) [0x54a10c]
/usr/bin/mongos(_ZN5mongo20DBClientWithCommands8isMasterERbPNS_7BSONObjE+0x75) [0x53bfd5]
/usr/bin/mongos(_ZN5mongo17ReplicaSetMonitor16_checkConnectionEPNS_18DBClientConnectionERSsb+0x88) [0x54fd78]
/usr/bin/mongos(_ZN5mongo17ReplicaSetMonitor6_checkEv+0x14d) [0x5507fd]
/usr/bin/mongos(_ZN5mongo17ReplicaSetMonitor9getMasterEv+0x72) [0x5510a2]
/usr/bin/mongos(_ZN5mongo18DBClientReplicaSet11checkMasterEv+0x3d) [0x55134d]
/usr/bin/mongos(_ZN5mongo18DBClientReplicaSet7connectEv+0x16) [0x552cb6]
/usr/bin/mongos(_ZNK5mongo16ConnectionString7connectERSs+0x1cd) [0x53b41d]
/usr/bin/mongos(_ZN5mongo16DBConnectionPool3getERKSs+0xd2) [0x5332d2]
/usr/bin/mongos(_ZN5mongo15ShardConnection5_initEv+0x305) [0x575665]
/usr/bin/mongos(ZN5mongo15ShardConnectionC1ERKSsS2+0x76) [0x575886]
Thu Apr 28 12:39:38 [conn6264] ReplicaSetMonitor::_checkConnection: caught exception mongodb01.example.com:27018 assertion util/message.cpp:512

Comment by Erez Zarum [ 28/Apr/11 ]

I'm having those issues as well.
random mongos crashing.
random shards (with replicaset) crashing.
I am running a big cluster in production and having doubts right now as of what to do.
This looks like critical to me as it simply crash.

Comment by Michael Conigliaro [ 19/Apr/11 ]

I have more info about this. Last night, I had the master and slaves in my replica set basically go offline due to high load (an unrelated issue I suspect). I ended up having to restart one of the servers, and when it came back up, it started replaying its journal. This took longer than expected, so when Monit noticed that MongoDB wasn't responding, it restarted the service. It was at this point that I started seeing the above exceptions in the Arbiter of my replica set.

Generated at Thu Feb 08 03:01:36 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.