[SERVER-15334] Map/Reduce jobs stall when database is taking writes Created: 19/Sep/14  Updated: 01/Apr/15  Resolved: 01/Apr/15

Status: Closed
Project: Core Server
Component/s: MapReduce
Affects Version/s: 2.4.10
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Alex Piggott Assignee: Ramon Fernandez Marina
Resolution: Done Votes: 0
Labels: map_reduce, mapreduce
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Steps To Reproduce:

Appears to be:

Start longish running indexed map/reduce job(s)
Delete all records while the job is still running

Participants:

 Description   

Because of a race condition in my code, I can delete a set of docs over which I am in the middle of running a map/reduce job (precise details below, I don't believe they are relevant). The map/reduce query is indexed.

My expectation is that the map/reduce job would just "ignore" records that were deleted.

Instead something odder seems to happen - the jobs last for way longer than they should and we see performance degradation

For example, here's a db.currentOp:

                {
                        "opid" : "replica_set1:406504931",
                        "active" : true,
                        "secs_running" : 560,
                        "op" : "query",
                        "ns" : "doc_metadata.metadata",
                        "query" : {
                                "$msg" : "query not recording (too large)"
                        },
                        "client_s" : "10.10.90.42:41453",
                        "desc" : "conn881377",
                        "threadId" : "0x6790e940",
                        "connectionId" : 881377,
                        "locks" : {
                                "^doc_metadata" : "R"
                        },
                        "waitingForLock" : true,
                        "msg" : "m/r: (1/3) emit phase M/R: (1/3) Emit Progress: 2899/1 289900%",
                        "progress" : {
                                "done" : 2899,
                                "total" : 1
                        },

Check out the Emit Progress...

There were a few of these, they all ran for 20 minutes or so (the number of docs being deleted was small - in the few thousand range), before eventually cleaning themselves up.

Bonus worry: I have a similar case in which I run a map/reduce over the entire collection (several 10s of millions of documents), to which documents
are continually being added or removed - should I worry, or is this an edge case that happens when a high % of the query set is removed....

(Details:
Thread 1: 1a) update a bunch of docs to have field:DELETE_ME
Thread 1: 2a) run a map/reduce job to count some of their attributes prior to deletion
Thread 2: 1b) update a bunch more docs to have field:DELETE_ME
Thread 2: 2b) run a map/reduce job to count some of their attributes prior to deletion
Thread 1: 3a) Remove all docs with field:DELETE ME
Thread 2: 3b) Remove all docs with field:DELETE ME
)



 Comments   
Comment by Alex Piggott [ 01/Apr/15 ]

Thanks for the update!

I haven't seen it, or at least noticed it, since I put some workarounds in to minimize the probability of the bulk delete/MR happening at the same time - so I mainly wanted to make sure you were aware it was happening and could look at fixing it in a later version - sounds like mission accomplished!

Comment by Ramon Fernandez Marina [ 01/Apr/15 ]

Hi apiggott@ikanow.com, it seems we let this ticket fall through the cracks – very sorry about that.

After testing on my end I think the root of the issue is the following: the total emit work is calculated at the beginning of the job, but if more documents are inserted before the job completes the percentage will go above 100%. I'm also able to see apparent performance problem: if documents are being inserted the write lock will prevent readers from making progress, so the mapReduce job will appear to be stuck.

Note that newer versions of MongoDB have finer locking granularity, so if you need to be able to write to one collection while doing a mapReduce on another you may want to consider upgrading to MongoDB 3.0.

Regards,
Ramón.

Comment by Alex Piggott [ 14/Nov/14 ]

Anyone planning to look at this? I've personally worked around it, but it seems like there's a reasonably serious bug that doesn't require a particularly outlandish scenario to occur...

Comment by Alex Piggott [ 19/Sep/14 ]

Here are the map/reduce jobs being run

https://github.com/IKANOW/Infinit.e/blob/master/core/infinit.e.processing.generic.library/src/com/ikanow/infinit/e/processing/generic/aggregation/AggregationUtils_scriptlets.xml

via:
BasicDBObject mrQuery = new BasicDBObject(DocumentPojo.url_, uuid);
DbManager.getDocument().getMetadata().mapReduce(mapScript, reduceScript, outCollection, OutputType.REPLACE, mrQuery);

Hmm so one other race condition potential is that I have multiple jobs running against the same output collection (and as mentioned above with the same query) - I couldn't find anything in the documentation that explicitly state that would be a problem

Comment by Alex Piggott [ 19/Sep/14 ]

Someone else reported something similar on stack overflow: http://stackoverflow.com/questions/24312785/mongodb-mapreduce-causes-error-error-during-query

Comment by Alex Piggott [ 19/Sep/14 ]

title should read "degrade performance" not "kill performance", sorry got carried away!

Generated at Thu Feb 08 03:37:43 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.