[SERVER-6867] Too stale to catch up and 9001 socket exception Created: 27/Aug/12  Updated: 12/Sep/12  Resolved: 12/Sep/12

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.0.6, 2.0.7
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Jianfeng Xu Assignee: Gary Murakami
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

CentOS release 5.5 (Final)
mongodb-linux-x86_64-2.0.6.tgz
mongodb-linux-x86_64-2.0.7.tgz


Attachments: Text File mongod_master.log     Text File mongod_secondary.log    
Operating System: Linux
Participants:

 Description   

Hi all

There is a replica set with 2 mongod nodes and 1 arbiter. It is 2000 average updates on master node by mongostat.

All of mongodb oplog: 10g
Master syncdelay: 100s
Secondary syncdelay: 150s

At the request peak, secondary node reports "syncing to: 10.136.24.88:27032" on MongoDB web interface. A monment later, it reports "too stale to catch up". I found that some informations from secondary log, but the network is well because all of nodes are at LAN.

Mon Aug 27 13:10:37 [rsSync] Socket recv() timeout 10.136.24.88:27032
Mon Aug 27 13:10:37 [rsSync] SocketException: remote: 10.136.24.88:27032 error: 9001 socket exception [3] server [10.136.24.88:27032]
Mon Aug 27 13:10:37 [rsSync] DBClientCursor::init call() failed
Mon Aug 27 13:10:38 [rsSync] replSet syncing to: 10.136.24.88:27032

Also I found that the optime of secondary oplog would not be updated when secondary node reports "syncing to: 10.136.24.88:27032" on MongoDB web interface.

I tested MongoDB 2.0.6/2.0.7 to reproduct it.



 Comments   
Comment by Ian Whalen (Inactive) [ 12/Sep/12 ]

Glad to hear it, Jianfeng. Please reopen if you do encounter this again in 2.2.

Comment by Jianfeng Xu [ 10/Sep/12 ]

It's solved after upgrading the version to 2.2.0. Thanks!

Comment by Jianfeng Xu [ 03/Sep/12 ]

Hi, I try to reconfig my replica sets. But the problem is the same. I attach all of nodes' log. Please diagnose this problem. Thank you!

Comment by Gary Murakami [ 30/Aug/12 ]

@Ian - given my schedule, I won't be able to start looking at this until next Wed.

Comment by Jianfeng Xu [ 28/Aug/12 ]

################################################
mongostat messages:

[root@younimsg-mongodb-89 ~]# mongostat --host=10.136.24.88:27032
connected to: 10.136.24.88:27032
insert query update delete getmore command flushes mapped vsize res faults locked % idx miss % qr|qw ar|aw netIn netOut conn set repl time
0 0 1965 0 0 2 0 20g 20.5g 17.6g 0 95.5 0 1|63 1|64 432k 1k 69 younimsgset M 17:12:09
0 0 1954 0 0 3 0 20g 20.5g 17.6g 0 95.6 0 0|63 1|64 430k 1k 69 younimsgset M 17:12:10
0 0 1979 0 0 2 0 20g 20.5g 17.6g 0 95.5 0 0|63 1|64 435k 1k 69 younimsgset M 17:12:11
0 0 1958 0 0 2 0 20g 20.5g 17.6g 0 95.7 0 0|63 1|64 430k 1k 69 younimsgset M 17:12:12
0 0 1970 0 0 2 0 20g 20.5g 17.6g 0 95.6 0 0|63 1|64 433k 1k 69 younimsgset M 17:12:13
0 0 1963 0 0 2 0 20g 20.5g 17.6g 0 95.6 0 0|63 1|64 432k 1k 69 younimsgset M 17:12:14
0 0 1970 0 0 3 0 20g 20.5g 17.6g 0 95.6 0 0|63 1|64 433k 1k 69 younimsgset M 17:12:15
0 0 1969 0 0 2 0 20g 20.5g 17.6g 0 95.6 0 0|63 1|64 432k 1k 69 younimsgset M 17:12:16
0 0 1962 0 0 2 0 20g 20.5g 17.6g 0 95.6 0 0|63 1|64 431k 1k 69 younimsgset M 17:12:17

################################################
master node messages:

PRIMARY> rs.status()
{
"set" : "younimsgset",
"date" : ISODate("2012-08-28T09:12:52Z"),
"myState" : 1,
"members" : [
{
"_id" : 0,
"name" : "10.136.24.88:27032",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"optime" :

{ "t" : 1346145172000, "i" : 1230 }

,
"optimeDate" : ISODate("2012-08-28T09:12:52Z"),
"self" : true
},
{
"_id" : 2,
"name" : "10.136.24.89:37032",
"health" : 1,
"state" : 7,
"stateStr" : "ARBITER",
"uptime" : 88389,
"optime" :

{ "t" : 0, "i" : 0 }

,
"optimeDate" : ISODate("1970-01-01T00:00:00Z"),
"lastHeartbeat" : ISODate("2012-08-28T09:12:51Z"),
"pingMs" : 0
},
{
"_id" : 3,
"name" : "10.136.24.89:27032",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 88301,
"optime" :

{ "t" : 1346139793000, "i" : 970 }

,
"optimeDate" : ISODate("2012-08-28T07:43:13Z"),
"lastHeartbeat" : ISODate("2012-08-28T09:12:51Z"),
"pingMs" : 0
}
],
"ok" : 1
}
PRIMARY> db.printReplicationInfo()
configured oplog size: 10000MB
log length start to end: 23900secs (6.64hrs)
oplog first event time: Tue Aug 28 2012 10:34:40 GMT+0800 (CST)
oplog last event time: Tue Aug 28 2012 17:13:00 GMT+0800 (CST)
now: Tue Aug 28 2012 17:13:00 GMT+0800 (CST)
PRIMARY> db.printSlaveReplicationInfo()
source: 10.136.24.89:37032
syncedTo: Thu Jan 01 1970 08:00:00 GMT+0800 (CST)
= 1346145186 secs ago (373929.22hrs)
source: 10.136.24.89:27032
syncedTo: Tue Aug 28 2012 15:43:13 GMT+0800 (CST)
= 5393 secs ago (1.5hrs)

################################################
secondary node messages:

SECONDARY> rs.status()
{
"set" : "younimsgset",
"date" : ISODate("2012-08-28T09:14:17Z"),
"myState" : 2,
"syncingTo" : "10.136.24.88:27032",
"members" : [
{
"_id" : 0,
"name" : "10.136.24.88:27032",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"uptime" : 88376,
"optime" :

{ "t" : 1346145255000, "i" : 1245 }

,
"optimeDate" : ISODate("2012-08-28T09:14:15Z"),
"lastHeartbeat" : ISODate("2012-08-28T09:14:15Z"),
"pingMs" : 0
},
{
"_id" : 2,
"name" : "10.136.24.89:37032",
"health" : 1,
"state" : 7,
"stateStr" : "ARBITER",
"uptime" : 88376,
"optime" :

{ "t" : 0, "i" : 0 }

,
"optimeDate" : ISODate("1970-01-01T00:00:00Z"),
"lastHeartbeat" : ISODate("2012-08-28T09:14:15Z"),
"pingMs" : 0
},
{
"_id" : 3,
"name" : "10.136.24.89:27032",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"optime" :

{ "t" : 1346139793000, "i" : 970 }

,
"optimeDate" : ISODate("2012-08-28T07:43:13Z"),
"self" : true
}
],
"ok" : 1
}
SECONDARY> db.printReplicationInfo()
configured oplog size: 10000MB
log length start to end: 24108secs (6.7hrs)
oplog first event time: Tue Aug 28 2012 09:01:25 GMT+0800 (CST)
oplog last event time: Tue Aug 28 2012 15:43:13 GMT+0800 (CST)
now: Tue Aug 28 2012 17:14:20 GMT+0800 (CST)
SECONDARY> db.printSlaveReplicationInfo()
source: 10.136.24.89:37032
syncedTo: Thu Jan 01 1970 08:00:00 GMT+0800 (CST)
= 1346145262 secs ago (373929.24hrs)
source: 10.136.24.89:27032
syncedTo: Tue Aug 28 2012 15:43:13 GMT+0800 (CST)
= 5469 secs ago (1.52hrs)

################################################
maseter node log:
Tue Aug 28 17:11:29 [initandlisten] connection accepted from 10.136.24.89:40425 #208968
Tue Aug 28 17:11:30 [conn204217] getmore local.oplog.rs query: { ts:

{ $gte: new Date(5781371733873262939) }

} cursorid:879668613180316522 nreturned:17974 re
slen:4194382 679383ms
Tue Aug 28 17:11:30 [conn204217] SocketException handling request, closing client connection: 9001 socket exception [2] server [10.136.24.89:31540]
Tue Aug 28 17:11:47 [initandlisten] connection accepted from 10.136.24.89:40426 #208969
################################################
secondary node log:
Tue Aug 28 17:10:06 [conn6250] end connection 127.0.0.1:8040
Tue Aug 28 17:10:11 [rsSync] Socket recv() timeout 10.136.24.88:27032
Tue Aug 28 17:10:11 [rsSync] SocketException: remote: 10.136.24.88:27032 error: 9001 socket exception [3] server [10.136.24.88:27032]
Tue Aug 28 17:10:11 [rsSync] replSet syncThread: 10278 dbclient error communicating with server: 10.136.24.88:27032
Tue Aug 28 17:10:20 [conn6248] end connection 10.136.24.89:32296

################################################

Thanks for your reply! Please check the above messages that i posted. Any suggestions is welcome.

Comment by Scott Hernandez (Inactive) [ 27/Aug/12 ]

Please post mongostat numbers from the primary during this time, and also rs.status(), db.printReplicationInfo() and db.printSlaveReplicationInfo() from both nodes.

Comment by Jianfeng Xu [ 27/Aug/12 ]

the "9001 socket exception" error is reporting one time per ten minutes on secondary node.

Mon Aug 27 13:40:40 [rsSync] SocketException: remote: 10.136.24.88:27032 error: 9001 socket exception [3] server [10.136.24.88:27032]
Mon Aug 27 13:40:40 [rsSync] DBClientCursor::init call() failed
Mon Aug 27 13:40:41 [rsSync] replSet syncing to: 10.136.24.88:27032
Mon Aug 27 13:41:01 [conn3980] end connection 10.136.24.88:15884
............
Mon Aug 27 13:50:41 [rsSync] SocketException: remote: 10.136.24.88:27032 error: 9001 socket exception [3] server [10.136.24.88:27032]
Mon Aug 27 13:50:41 [rsSync] DBClientCursor::init call() failed
Mon Aug 27 13:50:42 [rsSync] replSet syncing to: 10.136.24.88:27032
Mon Aug 27 13:51:01 [conn4020] end connection 10.136.24.88:15926

Master node error:
Mon Aug 27 15:00:04 [conn201151] query local.oplog.rs query: { ts:

{ $gte: new Date(5781165171716130065) }

} nscanned:102 nreturned:101 reslen:23531 11376977ms
Mon Aug 27 15:00:04 [conn201151] SocketException handling request, closing client connection: 9001 socket exception [2] server [10.136.24.100:21728]
Mon Aug 27 15:00:07 [conn201194] query local.oplog.rs query: { ts:

{ $gte: new Date(5781165171716130065) }

} nscanned:102 nreturned:101 reslen:23531 10778593ms
Mon Aug 27 15:00:07 [conn201194] SocketException handling request, closing client connection: 9001 socket exception [2] server [10.136.24.100:21769]
Mon Aug 27 15:00:07 [conn201237] query local.oplog.rs query: { ts:

{ $gte: new Date(5781165171716130065) }

} nscanned:102 nreturned:101 reslen:23531 10177833ms
Mon Aug 27 15:00:07 [conn201237] SocketException handling request, closing client connection: 9001 socket exception [2] server [10.136.24.100:21810]
Mon Aug 27 15:00:08 [conn201280] query local.oplog.rs query: { ts:

{ $gte: new Date(5781165171716130065) }

} nscanned:102 nreturned:101 reslen:23531 9577827ms
Mon Aug 27 15:00:08 [conn201280] SocketException handling request, closing client connection: 9001 socket exception [2] server [10.136.24.100:21851]
.......................
Mon Aug 27 15:56:32 [conn202100] Assertion: 15889:FindingStartCursor::recoverFromYield() failed to recover
Mon Aug 27 15:56:32 [conn202100] assertion 15889 FindingStartCursor::recoverFromYield() failed to recover ns:local.oplog.rs query:{ ts:

{ $gte: new Date(5781165171716130065) }

}
Mon Aug 27 15:56:32 [conn202100] query local.oplog.rs query: { ts:

{ $gte: new Date(5781165171716130065) }

} exception: FindingStartCursor::recoverFromYield() failed to recover code:15889 reslen:102 1543311ms
Mon Aug 27 15:56:32 [conn202100] SocketException handling request, closing client connection: 9001 socket exception [2] server [10.136.24.100:22645]
Mon Aug 27 15:56:36 [conn202211] Assertion: 15889:FindingStartCursor::recoverFromYield() failed to recover
Mon Aug 27 15:56:36 [conn202211] assertion 15889 FindingStartCursor::recoverFromYield() failed to recover ns:local.oplog.rs query:{ ts:

{ $gte: new Date(5781165171716130065) }

}
Mon Aug 27 15:56:36 [conn202211] query local.oplog.rs query: { ts:

{ $gte: new Date(5781165171716130065) }

} exception: FindingStartCursor::recoverFromYield() failed to recover code:15889 reslen:102 10149ms
Mon Aug 27 15:56:36 [conn202211] end connection 10.136.24.100:22752
Mon Aug 27 15:56:36 [conn202210] end connection 10.136.24.89:27979
Mon Aug 27 15:56:36 [initandlisten] connection accepted from 10.136.24.89:27981 #202213
Mon Aug 27 15:56:41 [conn202057] Assertion: 15889:FindingStartCursor::recoverFromYield() failed to recover
Mon Aug 27 15:56:41 [conn202057] assertion 15889 FindingStartCursor::recoverFromYield() failed to recover ns:local.oplog.rs query:{ ts:

{ $gte: new Date(5781165171716130065) }

}
Mon Aug 27 15:56:41 [conn202057] query local.oplog.rs query: { ts:

{ $gte: new Date(5781165171716130065) }

} exception: FindingStartCursor::recoverFromYield() failed to recover code:15889 reslen:102 2153040ms
Mon Aug 27 15:56:41 [conn202057] SocketException handling request, closing client connection: 9001 socket exception [2] server [10.136.24.100:22602]
Mon Aug 27 15:56:47 [initandlisten] connection accepted from 10.136.24.100:22755 #202214

Generated at Thu Feb 08 03:12:57 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.