[SERVER-4329] uncaught exception in mapreduce causes mongod to terminate Created: 19/Nov/11  Updated: 11/Jul/16  Resolved: 21/Nov/11

Status: Closed
Project: Core Server
Component/s: MapReduce
Affects Version/s: None
Fix Version/s: 2.1.0

Type: Bug Priority: Major - P3
Reporter: Antoine Girbal Assignee: Antoine Girbal
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-3923 Compact command on secondary with non... Closed
Operating System: ALL
Participants:

 Description   

Root cause:

Tue Nov 15 08:00:21 [conn202500] ERROR: Uncaught std::exception: could not initialize cursor across all shards because : socket exception @ prod1/aboutmemng-m01.db.aol.com:27017,aboutmemng-d01.db.aol.com:27017,ec2-184-73-54-43.compute-1.amazonaws.com:27017, terminating

Looks like map/reduce isn't handling certain socket errors correctly.



 Comments   
Comment by Antoine Girbal [ 21/Nov/11 ]

This commit should also prevent db termination

commit 617e9ff8ec1ecd134c7e6e42c85983ff8873a30d
Author: Mathias Stearn <mathias@10gen.com>
Date: Mon Oct 24 15:20:03 2011 -0400

Catch DBException separate from std::exception SERVER-4137

Comment by Antoine Girbal [ 21/Nov/11 ]

actually this is fixed by this commit:

commit d869bd9bb787707eefd650c6b59ecfdd2686d9d4
Author: Kristina <kristina@10gen.com>
Date: Thu Sep 22 12:35:44 2011 -0400

try/catch around all command calls SERVER-3923

Comment by Antoine Girbal [ 21/Nov/11 ]

actually it's easy to reproduce with the 2.0 line.
With 2.0.1, I stopped one of the shards right before the parallel cursor is initialized, and got primary mongod to terminate:
Mon Nov 21 13:39:10 [conn3] ERROR: Uncaught std::exception: could not initialize cursor across all shards because : socket exception @ localhost:27018, terminating

I could not get the termination from HEAD, so have to figure out what change fixed it.

Comment by Antoine Girbal [ 19/Nov/11 ]

I cannot reproduce this issue easily.
If I stop the other shard while the primary shard is in that code, the code throws a DBException which is properly caught at top level.
If seems that this commit introduced a fix:

commit 4d8ee4cc7c4d32ace1b1cab403dd429d9467a677
Author: gregs <greg@10gen.com>
Date: Thu Jun 9 15:41:21 2011 -0400

parallel cursor recover gracefully from replica set and other errors SERVER-2481

Comment by Antoine Girbal [ 19/Nov/11 ]

try/catch can easily be added to that spot, but it would be better to have a general solution to avoid missing try/catch from terminating mongod.

Generated at Thu Feb 08 03:05:40 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.