[SERVER-5845] Cursor can get deleted because of timeout at the finish stage of a sharded map reduce Created: 15/May/12  Updated: 11/Jul/16  Resolved: 30/Aug/12

Status: Closed
Project: Core Server
Component/s: MapReduce
Affects Version/s: 2.1.1
Fix Version/s: 2.3.0

Type: Improvement Priority: Critical - P2
Reporter: Azat Khuzhin Assignee: Randolph Tan
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File mongo_mr_domain_failed_ob3.log    
Issue Links:
Related
related to SERVER-6906 Potential cursor timeout at reduce st... Closed
Participants:

 Description   

When I run one of MR job, the second time fail with error:

{"errmsg":"exception: getMore: cursor didn't exist on server, possible restart or timeout?","code":13127,"ok":0}

And by logs, I see that in final stage (when it iterate over "CollectionByWhatMapReduceRun_IntNumber"), it not handle connections

// conn178752 - MR job connection
...
Mon May 14 17:48:35 [conn178752] getmore Database.CollectionByWhatMapReduceRun_IntNumber cursorid:4899920544814319828 ntoreturn:0 keyUpdates:0 nreturned:4316 reslen:5621110 142ms
...
Mon May 14 17:48:51 [initandlisten] connection accepted from 127.0.0.1:33015 #559035 (446 connections now open)
Mon May 14 17:48:51 [initandlisten] connection accepted from 127.0.0.1:33016 #559036 (447 connections now open)
...
Mon May 14 18:23:14 [conn178752]                52200/108636    48%
...
Mon May 14 18:23:14 [conn180633] SocketException handling request, closing client connection: 9001 socket exception [2] server [127.0.0.1:60746]
...
Mon May 14 18:24:11 [initandlisten] connection refused because too many open connections: 819
Mon May 14 18:24:11 [initandlisten] connection accepted from 127.0.0.1:59086 #754273 (820 connections now open)
...
Mon May 14 18:24:16 [conn178752] getMore: cursorid not found Database.CollectionByWhatMapReduceRun_IntNumber 4899920544814319828

And after - number of connection decrements

But another type of MR job is runs okay.

Any ideas?



 Comments   
Comment by auto [ 30/Aug/12 ]

Author:

{u'date': u'2012-08-30T15:54:30-07:00', u'name': u'Randolph Tan', u'email': u'randolph@10gen.com'}

Message: SERVER-5845 Cursor can get deleted because of timeout at the finish stage of a sharded map reduce
Branch: master
https://github.com/mongodb/mongo/commit/374289bb6e043b16fe9bf613bc18e56f6df14a81

Comment by Azat Khuzhin [ 06/Jun/12 ]

See my pull request
https://github.com/mongodb/mongo/pull/231

I'v add only QueryOption_NoCursorTimeout

But need to add yielding ( maybe only if nonAtomic is set? )
And maybe better to use just "update"/"insert" instead of "upsert", because it run "findOne" command twice^
first in "State::postProcessCollectionNonAtomic"
second in "Helpers::upsert"

Comment by Azat Khuzhin [ 06/Jun/12 ]

And why you are using upsert, if you already check exist it or not in "State::postProcessCollectionNonAtomic"?

Comment by Azat Khuzhin [ 06/Jun/12 ]

Line mongo/db/commands/mr.cpp:507 (State::postProcessCollectionNonAtomic) must have QueryOption_NoCursorTimeout
And must yielding

Comment by Azat Khuzhin [ 06/Jun/12 ]

It fails at stage
"m/r: reduce post processing 1154/55614 2"
after 20 seconds
"m/r: reduce post processing 1159/55614 2%"
It merge "inc" collection with "out" collection (but too slow, "out" collection has ~2.5 million of rows, not so much I think)

Db is locked "lockType" : "W", but I use "nonAtomic" flag, when running mapreduce
CPU usage 100%
No io activity

New clients arrived, but not served, so connections accumulated
And the it fails

Comment by Azat Khuzhin [ 05/Jun/12 ]

Attach log
3k lines before starting MR job
some lines after failed MR job

Comment by Randolph Tan [ 05/Jun/12 ]

It should be fine because the cursor has been initialized with QueryOption_NoCursorTimeout. It would be best to post the entire log, but 3k lines before might be enough. The grep output might be enough too.

Comment by Azat Khuzhin [ 05/Jun/12 ]

If I grep by "conn34700" is it enough?

Comment by Randolph Tan [ 05/Jun/12 ]

Can you post the complete new log?

Comment by Azat Khuzhin [ 05/Jun/12 ]

About cursor already not exist:

Maybe because of https://github.com/mongodb/mongo/blob/a105cf0bb3b8bc8aac662441f53deae5e4e83462/src/mongo/db/commands/mr.cpp#L760
And if there is 100 objects (~16mb) = 1600mb ?

And maybe better to yield on every 100mb ?

Comment by Azat Khuzhin [ 05/Jun/12 ]

Now I have the same error (maybe because of data set increase)

Sun May 27 17:13:15 [conn34700] getMore: cursorid not found seoquake_ad_api.tmp.mr.ad_statistics_6 4879762645212462063
Sun May 27 17:13:15 [conn34700] mr failed, removing collection :: caused by :: 13127 getMore: cursor didn't exist on server, possible restart or timeout?

And what about data inconsistency?

Comment by Azat Khuzhin [ 04/Jun/12 ]

But in my case no sharding is used

Comment by Randolph Tan [ 30/May/12 ]

I have confirmed that this is a bug and this happens because the cursor used by the shard doesn't use notimeout when querying over the map reduce results of individual shards (when mapreduce.shardedfinish is called).

Comment by Azat Khuzhin [ 28/May/12 ]

BTW if it iterate over "CollectionByWhatMapReduceRun_IntNumber" is it write to "out" collection?
If so, and it fails - than this cause inconsistency

Comment by Azat Khuzhin [ 25/May/12 ]

This cursor create in mapreduce. I can't set noTimeout flag for it.
But it already create with it

auto_ptr<ClientCursor> cursor( new ClientCursor( QueryOption_NoCursorTimeout , temp , _config.incLong.c_str() ) );

Comment by Randolph Tan [ 24/May/12 ]

The server cleans up cursors that has been idle for 10 minutes unless you set the no timeout flag. Assuming that you posted everything for conn178752 between 17:48:35 and 18:24:16, the cursor has been untouched for more than 10 mins.

Comment by Azat Khuzhin [ 17/May/12 ]

I think that it can be because of too many connections.
And when mongo kill it (timeout ?) it kills this cursor, or something like this

Now I add replicaset and I successfully run this MR job

Generated at Thu Feb 08 03:10:02 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.