[SERVER-5471] Crash during map reduce with lots of queued reads Created: 01/Apr/12  Updated: 15/Aug/12  Resolved: 20/Apr/12

Status: Closed
Project: Core Server
Component/s: MapReduce
Affects Version/s: 2.0.4
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Nic Cottrell (Personal) Assignee: Tad Marshall
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

RHEL5


Attachments: Text File mongod-crash.txt    
Operating System: Linux
Participants:

 Description   

MongoD crashed during the end of a slow map reduce job.



 Comments   
Comment by Nic Cottrell (Personal) [ 06/Apr/12 ]

Hi Tad,

In fact, I don't know exactly which params were used in that M/R job, but my cron runs very similar queries every hour and I haven't had any similar crashes since then even with the profiler on. So this may in fact be impossible to reproduce Sorry!

Nic.

Comment by Tad Marshall [ 06/Apr/12 ]

Hi Nic,

Would it be possible to turn off profiling temporarily and try the MapReduce again? This would help us narrow down where the crash is coming from. If the MapReduce produces the desired results and there is no crash, that would be very informative.

These two lines from the stack trace are suggesting that it is the profiler that caused the crash, not the MapReduce:

/usr/bin/mongod(_ZNK5mongo7OpDebug6appendERKNS_5CurOpERNS_14BSONObjBuilderE+0x2bf) [0x894f3f]
/usr/bin/mongod(_ZN5mongo7profileERKNS_6ClientERNS_5CurOpE+0x218) [0x778758]

Once we know for sure that it is the profiler that is crashing (if that turns out to be the case), we will know where to focus debugging.

Thanks!

Tad

Comment by Nic Cottrell (Personal) [ 03/Apr/12 ]

Hi Tad,

Yeah, I use INLINE as output for almost everything since it seems much faster and I'm usually well within the 16MB limit.

I install from your official yum repo for linux - and am pretty sure it was 2.0.4 at the time of the crash.

Yeah, I have profiling enabled, in fact I have:

quiet = true
profile = 1
journal = true

  1. oplog = 0
    cpu = true
    notablescan = true
    rest = true
Comment by Tad Marshall [ 03/Apr/12 ]

Hi Nic,

Thanks for the info.

I'm used to seeing MapReduce expressed in JavaScript, in the mongo shell, for example. The syntax for issuing a MapReduce command from the shell is described at http://www.mongodb.org/display/DOCS/MapReduce#MapReduce-Overview and this is a kind of view into the way that MapReduce is seen by the MongoDB server. MapReduce is a server command, which is received as a query against a pseudo-collection named "$cmd".

None of this matters to your issue, but for the server to have executed a MapReduce operation for you it had to have received a command telling it to do that. This command could have had options, perhaps including an optional "out" parameter that controls what happens to the output of the MapReduce operation.

Can you post the Java code that is invoking the MapReduce operation? That might give us an idea of how it could be structured to avoid the memory limit.

You mentioned "inline" ... as documented at http://www.mongodb.org/display/DOCS/MapReduce#MapReduce-Outputoptions , "Note that this option is possible only when the result set fits within the 16MB limit of a single document."

Regarding the crash, are you running a build downloaded from mongodb.org or one from another source? Also, I see mention of "profile" and "OpDebug" in the stack trace, and I'm wondering if you had profiling enabled when the crash occurred. It is possible (if profiling was active) that the actual crash wasn't in the MapReduce operation but in the profiler attempting to log the long-running operation. The stack traces in your mongod-crash.txt log show no indication of JavaScript. If we know that the version you were running is exactly our 2.0.4 Linux version from mongodb.org, then we can use the addresses from the stack to see exactly where we were when we crashed.

Thanks!

Tad

Comment by Nic Cottrell (Personal) [ 03/Apr/12 ]

Hi Tad!

I'm not sure what you mean by the MapReduce command. I'm using the Java library via Morphia - which uses the MapReduceCommand class I believe.

I don't know exactly what was going on during the crash - it was some batch process running at the time. I believe the query would have been something like:

{"indices.text": {$in: [.....]}, lc: "eng", group: "all"}

and the scope field params would have been

{"targetTerms": [...]}

and the [...] array would have been 500-1000 short strings probably all sorts of high unicode.

Here's an example object from that collection:

{ "_id" : ObjectId( "4e2c9d243e7a4af060b57cf2" ),
  "avgWL" : 0,
  "createDate" : Date( 1232409600000 ),
  "date" : Date( 1232454035000 ),
  "group" : "all",
  "indices" : [ 
    { "text" : "日",
      "textLc" : "日",
      "pos" : 4 }, 
    { "text" : "され",
      "textLc" : "され",
      "pos" : 12 }, 
    { "text" : "いる",
      "textLc" : "いる",
      "pos" : 15 }, 
    { "text" : "更新状",
      "textLc" : "更新状",
      "pos" : 21 }, 
    { "text" : "更新状況",
      "textLc" : "更新状況",
      "pos" : 21 }, 
    { "text" : "状況",
      "textLc" : "状況",
      "pos" : 23 }, 
    { "text" : "クされ",
      "textLc" : "クされ",
      "pos" : 11 }, 
    { "text" : "ンクされ",
      "textLc" : "ンクされ",
      "pos" : 10 }, 
    { "text" : "ンク",
      "textLc" : "ンク",
      "pos" : 10 }, 
    { "text" : "るペ",
      "textLc" : "るペ",
      "pos" : 16 }, 
    { "text" : "されている",
      "textLc" : "されている",
      "pos" : 12 }, 
    { "text" : "ている",
      "textLc" : "ている",
      "pos" : 14 }, 
    { "text" : "て",
      "textLc" : "て",
      "pos" : 14 }, 
    { "text" : "ク",
      "textLc" : "ク",
      "pos" : 11 }, 
    { "text" : "されてい",
      "textLc" : "されてい",
      "pos" : 12 }, 
    { "text" : "てい",
      "textLc" : "てい",
      "pos" : 14 }, 
    { "text" : "れてい",
      "textLc" : "れてい",
      "pos" : 13 }, 
    { "text" : "クされてい",
      "textLc" : "クされてい",
      "pos" : 11 }, 
    { "text" : "い",
      "textLc" : "い",
      "pos" : 15 }, 
    { "text" : "ンクされて",
      "textLc" : "ンクされて",
      "pos" : 10 }, 
    { "text" : "クさ",
      "textLc" : "クさ",
      "pos" : 11 }, 
    { "text" : "新状",
      "textLc" : "新状",
      "pos" : 22 }, 
    { "text" : "から",
      "textLc" : "から",
      "pos" : 7 }, 
    { "text" : "状",
      "textLc" : "状",
      "pos" : 23 }, 
    { "text" : "れて",
      "textLc" : "れて",
      "pos" : 13 }, 
    { "text" : "況",
      "textLc" : "況",
      "pos" : 24 }, 
    { "text" : "る",
      "textLc" : "る",
      "pos" : 16 }, 
    { "text" : "月",
      "textLc" : "月",
      "pos" : 2 }, 
    { "text" : "更",
      "textLc" : "更",
      "pos" : 21 }, 
    { "text" : "れ",
      "textLc" : "れ",
      "pos" : 13 }, 
    { "text" : "ンクさ",
      "textLc" : "ンクさ",
      "pos" : 10 }, 
    { "text" : "ら",
      "textLc" : "ら",
      "pos" : 8 }, 
    { "text" : "れている",
      "textLc" : "れている",
      "pos" : 13 }, 
    { "text" : "ン",
      "textLc" : "ン",
      "pos" : 10 }, 
    { "text" : "クされて",
      "textLc" : "クされて",
      "pos" : 11 }, 
    { "text" : "されて",
      "textLc" : "されて",
      "pos" : 12 }, 
    { "text" : "ペ",
      "textLc" : "ペ",
      "pos" : 17 }, 
    { "text" : "ているペ",
      "textLc" : "ているペ",
      "pos" : 14 }, 
    { "text" : "ジ",
      "textLc" : "ジ",
      "pos" : 19 }, 
    { "text" : "新",
      "textLc" : "新",
      "pos" : 22 }, 
    { "text" : "更新",
      "textLc" : "更新",
      "pos" : 21 }, 
    { "text" : "の",
      "textLc" : "の",
      "pos" : 20 }, 
    { "text" : "さ",
      "textLc" : "さ",
      "pos" : 12 }, 
    { "text" : "リ",
      "textLc" : "リ",
      "pos" : 9 }, 
    { "text" : "Wikipedia",
      "textLc" : "wikipedia",
      "pos" : 28 }, 
    { "text" : "ー",
      "textLc" : "ー",
      "pos" : 18 }, 
    { "text" : "新状況",
      "textLc" : "新状況",
      "pos" : 22 }, 
    { "text" : "か",
      "textLc" : "か",
      "pos" : 7 }, 
    { "text" : "いるペ",
      "textLc" : "いるペ",
      "pos" : 15 }, 
    { "text" : "れているペ",
      "textLc" : "れているペ",
      "pos" : 13 } ],
  "indicesSize" : 50,
  "kodoId" : 7564212,
  "l" : { "$ref" : "Language",
    "$id" : "jap" },
  "lc" : "jap",
  "length" : 37,
  "lengthPlain" : 37,
  "owner" : 1,
  "pageCount" : 0,
  "pattern" : false,
  "random" : 13415051,
  "sig" : "psWsWp_WWsWWWWWWWWsWsWWWW___W",
  "tHash" : -8020904180978289261,
  "tLcHash" : -7390523363417393741,
  "tLcLeft" : "\"6月5日\" からリンクされてい",
  "tLeft" : "\"6月5日\" からリンクされてい",
  "text" : "\"6月5日\" からリンクされているページの更新状況 - Wikipedia",
  "textLc" : "\"6月5日\" からリンクされているページの更新状況 - wikipedia",
  "tlc" : 0,
  "topicCount" : 0,
  "type" : "PLAIN",
  "user" : { "$ref" : "User",
    "$id" : ObjectId( "4df3712e71123b7848e81894" ) },
  "v" : 0,
  "wordCount" : 23,
  "wrongLanguage" : false }

Hope this helps!

Comment by Tad Marshall [ 02/Apr/12 ]

Can you post the mapReduce command itself, including all arguments, and also post a sample of one of the documents being found by your query (and used in the map function)?

There is a separate problem in that the server crashed after reporting the BSON size assertion twice: trying to generate a BSON object that is too large should not crash the server and does not do this in other cases. So, we like to help you get a working map/reduce and also find out what caused the crash.

Thanks!

Comment by Nic Cottrell (Personal) [ 02/Apr/12 ]

Yeah - it's quite likely that I hit some limit for an inline map reduce, but each object in the result set should have been really small. I think this was the map reduce code running at the crash:

map = function () {
for (i in targetTerms) {
var targetTerm = targetTerms[i];
for (j in this.indices) {
var index = this.indices[j];
if (targetTerm === index.textLc)

{ emit(index.text, 1); }

}
}
}

reduce = function (prev, curr) {
var result = 0;
for (i in curr)

{ result += curr[i]; }

return result;
}

The candidate set from the query was about 200k objects, average object size in that collection is about 2487.

Comment by Tad Marshall [ 01/Apr/12 ]

It looks like the map/reduce exceeded the maximum size of a BSON object:

Sun Apr 1 00:56:56 [conn920] Assertion: 13548:BufBuilder grow() > 64MB
0x584102 0x504fd9 0x894f3f 0x778758 0x88e0cc 0xaa33f6 0x637407 0x3083c0677d 0x30834d325d
/usr/bin/mongod(_ZN5mongo11msgassertedEiPKc+0x112) [0x584102]
/usr/bin/mongod [0x504fd9]
/usr/bin/mongod(_ZNK5mongo7OpDebug6appendERKNS_5CurOpERNS_14BSONObjBuilderE+0x2bf) [0x894f3f]
/usr/bin/mongod(_ZN5mongo7profileERKNS_6ClientERNS_5CurOpE+0x218) [0x778758]
/usr/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0x9fc) [0x88e0cc]
/usr/bin/mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x76) [0xaa33f6]
/usr/bin/mongod(_ZN5mongo3pms9threadRunEPNS_13MessagingPortE+0x287) [0x637407]
/lib64/libpthread.so.0 [0x3083c0677d]
/lib64/libc.so.6(clone+0x6d) [0x30834d325d]

Is this a reasonable result of your map/reduce job (to produce a document of this size at some point)? If so, you have hit a built-in limit. If not, can you post your map/reduce code and/or a description of what you are doing so we could see where the problem might be?

Generated at Thu Feb 08 03:09:00 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.