[SERVER-6872] scoped connection not being return to the pool Created: 27/Aug/12  Updated: 15/Feb/13  Resolved: 02/Sep/12

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Anton V. Volokhov Assignee: Jeremy Mikola
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

mongodb 2.0.6


Issue Links:
Duplicate
duplicates SERVER-5157 Cursor id is being stored at mongos a... Closed
Operating System: Linux
Participants:

 Description   

mongodb configuration: 2shards, 3instances each, 3 config servers, remote mongos, 52mbit network between mongos and mongod, 1gbit between mongod's

I'm trying to retrieve data from mongodb via single .find() query in mongo console.
code:
var i=0;
var cursor=db.load.find();
while(cursor.hasNext())

{cursor.next();i++;}

;print;

while attempts to do so on the local machine are always successful, the query on remote mongos fails repeatable on the same entity.
entities length seems to be homogeneous.

console output:
1670457
getMore: cursor didn't exist on server, possible restart or timeout?

db.load.count():
3236686

db.stats():

mongos> db.stats(1024*1024);
{
"raw" : {
"rset01/load1ft:27017,load2ft:27017,load3ft:27017" :

{ "db" : "test", "collections" : 4, "objects" : 1670511, "avgObjSize" : 5729.003803027936, "dataSize" : 9127, "storageSize" : 9734, "numExtents" : 40, "indexes" : 4, "indexSize" : 376, "fileSize" : 16313, "nsSizeMB" : 16, "ok" : 1 }

,
"rset02/load1gt:27017,load2gt:27017,load3gt:27017" :

{ "db" : "test", "collections" : 3, "objects" : 1566193, "avgObjSize" : 5903.711271854746, "dataSize" : 8818, "storageSize" : 9223, "numExtents" : 39, "indexes" : 2, "indexSize" : 348, "fileSize" : 14266, "nsSizeMB" : 16, "ok" : 1 }

},
"objects" : 3236704,
"avgObjSize" : 0.005544220293236576,
"dataSize" : 17945,
"storageSize" : 18957,
"numExtents" : 79,
"indexes" : 6,
"indexSize" : 724,
"fileSize" : 30579,
"ok" : 1
}

mongos logs:

Mon Aug 27 18:49:16 [conn1] want cursor : 1975918310978336011
Mon Aug 27 18:49:16 [conn1] CursorCache::get id: 1975918310978336011
Mon Aug 27 18:49:16 [conn1] scoped connection to load1ft:27017 not being returned to the pool
Mon Aug 27 18:49:16 [conn1] AssertionException while processing op type : 2005 to : testing.load :: caused by :: 13127 getMore: cursor didn't exist on server, possible restart or timeout?



 Comments   
Comment by Anton V. Volokhov [ 01/Sep/12 ]

they say, it is related to https://jira.mongodb.org/browse/SERVER-5157

Comment by Anton V. Volokhov [ 29/Aug/12 ]

output from mongos during the last run:
Tue Aug 28 20:06:20 [conn3] Request::process ns: test.$cmd msg id:6 attempt: 0
Tue Aug 28 20:06:20 [conn3] single query: test.$cmd { count: "load", query: {} } ntoreturn: 1 options : 4
Tue Aug 28 20:06:20 [conn3] have to set shard version for conn: 0x23433e0 ns:test.load my last seq: 0 current: 2 version: 86|0 manager: 0x7f0694004c90
Tue Aug 28 20:06:20 [conn3] setShardVersion rset01 load1ft:27017 test.load

{ setShardVersion: "test.load", configdb: "load1ft:27018", version: Timestamp 86000|0, serverID: ObjectId('503cb063fed7a82fa1518d62'), shard: "rset01", shardHost: "rset01/load1ft:27017,load2ft:27017,load3ft:27017" }

0x23433e0
Tue Aug 28 20:06:20 [conn3] setShardVersion success:

{ oldVersion: Timestamp 0|0, ok: 1.0 }

Tue Aug 28 20:06:20 [conn3] have to set shard version for conn: 0x2344d20 ns:test.load my last seq: 0 current: 2 version: 86|1 manager: 0x7f0694004c90
Tue Aug 28 20:06:20 [conn3] setShardVersion rset02 load2gt:27017 test.load

{ setShardVersion: "test.load", configdb:load1ft:27018", version: Timestamp 86000|1, serverID: ObjectId('503cb063fed7a82fa1518d62'), shard: "rset02", shardHost: "rset02/load1gt:27017,load2gt:27017,load3gt:27017" }

0x2344d20
Tue Aug 28 20:06:20 [conn3] setShardVersion success:

{ oldVersion: Timestamp 0|0, ok: 1.0 }

Tue Aug 28 20:06:20 [conn3] slave '' is not initialized or invalid
Tue Aug 28 20:06:20 [conn3] dbclient_rs getSlave rset02/load1gt:27017,load2gt:27017,load3gt:27017
Tue Aug 28 20:06:20 BackgroundJob starting: ConnectBG
Tue Aug 28 20:06:20 [conn3] Request::process ns: test.load msg id:7 attempt: 0
Tue Aug 28 20:06:20 [conn3] shard query: test.load { query: {}, $snapshot: true }
Tue Aug 28 20:06:20 [conn3] shard query servers: 2
Tue Aug 28 20:06:20 [conn3] ParallelSortClusteredCursor::init server:rset01/load1ft:27017,load2ft:27017,load3ft:27017 ns:test.load query:{ query: {}, $snapshot: true } _fields:{} options: 20
Tue Aug 28 20:06:20 [conn3] ParallelSortClusteredCursor::init server:rset02/load1gt:27017,load2gt:27017,load3gt:27017 ns:test.load query:{ query: {}, $snapshot: true } _fields:{} options: 20
Tue Aug 28 20:06:21 [conn3] cursor type: ParallelSort
Tue Aug 28 20:06:21 [conn3] hasMore: 1 sendMore: 1 cursorMore: 1 ntoreturn: 0 num: 100 wouldSendMoreIfHad: 1 id:3399191084891024289 totalSent: 0
Tue Aug 28 20:06:22 [conn3] storing cursor : 3399191084891024289
...
Tue Aug 28 21:48:46 [conn3] Request::process ns: test.load msg id:5864 attempt: 0
Tue Aug 28 21:48:46 [conn3] want cursor : 3399191084891024289
Tue Aug 28 21:48:46 [conn3] CursorCache::get id: 3399191084891024289
Tue Aug 28 21:48:46 [conn3] hasMore: 1 sendMore: 1 cursorMore: 1 ntoreturn: 0 num: 776 wouldSendMoreIfHad: 1 id:3399191084891024289 totalSent: 3235793
Tue Aug 28 21:48:46 [conn3] Request::process ns: test.load msg id:5865 attempt: 0
Tue Aug 28 21:48:46 [conn3] want cursor : 3399191084891024289
Tue Aug 28 21:48:46 [conn3] CursorCache::get id: 3399191084891024289
Tue Aug 28 21:48:46 [conn3] hasMore: 0 sendMore: 1 cursorMore: 0 ntoreturn: 0 num: 117 wouldSendMoreIfHad: 1 id:3399191084891024289 totalSent: 3236569
Tue Aug 28 21:48:46 [conn3] Socket recv() conn closed? 127.0.0.1:39734
Tue Aug 28 21:48:46 [conn3] SocketException: remote: 127.0.0.1:39734 error: 9001 socket exception [0] server [127.0.0.1:39734]
Tue Aug 28 21:48:46 [conn3] end connection 127.0.0.1:39734

Comment by Anton V. Volokhov [ 28/Aug/12 ]

no, the process wasn't stopped.

some more info for previous run:
1. We never getMore on killed cursor, except the time, when exception occurred.
2. Last getMore on load1ft was in 18:21
Mon Aug 27 18:21:50 [conn197472] getmore test.load cursorid:8060402827378091386 nreturned:671 reslen:3861713 4ms
3. In the interval 18:18 - 18:49 we got data from the second shard.

this run timing is even more strange:
Tue Aug 28 16:04:37 [clientcursormon] killing old cursor 5889331906261155351 test.load idle:600254ms
Tue Aug 28 18:48:02 [conn203588] getMore: cursorid not found test.load 5889331906261155351
Tue Aug 28 18:48:02 [conn203588] getmore test.load cursorid:5889331906261155351 ntoreturn:3 reslen:20 0ms
and we still asked getMore on this cursor only one time, in 18:48:02

the rest of log is equal to previous.

Comment by Scott Hernandez (Inactive) [ 28/Aug/12 ]

Can you post the server logs from that run?

Mon Aug 27 18:18:36 [clientcursormon] killing old cursor 6099439472920059228 test.load idle:604000ms
mem (MB) res:24408 virt:128207 mapped:63717
Mon Aug 27 18:49:16 [conn197395] getMore: cursorid not found test.load 6099439472920059228

These events are very far apart; are you sure that you are not stopping somewhere during your cursor next() loop?

Comment by Anton V. Volokhov [ 28/Aug/12 ]

The same behaviour with your query.
mongos> print(db.load.find().batchSize(3).itcount());
1670500
getMore: cursor didn't exist on server, possible restart or timeout?

And same errors on mongod's

Comment by Scott Hernandez (Inactive) [ 28/Aug/12 ]

Please test that with the batchSize; it will return batches from the cursor with 3 docs at a time.

Comment by Anton V. Volokhov [ 28/Aug/12 ]

logs on load1ft
grep 6099439472920059228 mongodb.log #dead cursor id

Mon Aug 27 18:18:36 [clientcursormon] killing old cursor 6099439472920059228 test.load idle:604000ms
mem (MB) res:24408 virt:128207 mapped:63717
Mon Aug 27 18:49:16 [conn197395] getMore: cursorid not found test.load 6099439472920059228
Mon Aug 27 18:49:16 [conn197395] getmore test.load cursorid:6099439472920059228 reslen:20 0ms

Is there any way to increase cursor timeout, or we have to split the data and retrieve it by pieces?

all instances in rset01 and primary in rset02 got following error in time interval 18:49:15 - 18:49:20
Mon Aug 27 18:49:18 BackgroundJob starting: ConnectBG
Mon Aug 27 18:49:19 [conn197808] Socket recv() conn closed? x.x.x.x):53160
Mon Aug 27 18:49:19 [conn197808] SocketException: remote: x.x.x.x:53160 error: 9001 socket exception [0] server [x.x.x.x:53160]
Mon Aug 27 18:49:19 [conn197808] end connection x.x.x.x:53160
primary shard for test database is rset02/

Comment by Scott Hernandez (Inactive) [ 28/Aug/12 ]

Are there any messages on the mongod (shards) at this time? This sounds like a cursor timeout (they timeout after 10 minutes of inactivity) in processing the documents in the loop, and that it is related to the network/latency.

mongos> print(db.load.find().batchSize(3).itcount());
//Note that itcount does the same thing as your loop; run the line below without the "()" to see
mongos> db.load.find().itcount

Comment by Anton V. Volokhov [ 27/Aug/12 ]

well, almost the same entity. +- 1000 in 10 runs.

Generated at Thu Feb 08 03:12:58 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.