[SERVER-15552] Errors writing to temporary collections during mapReduce command execution should be operation-fatal Created: 07/Oct/14  Updated: 11/Jul/16  Resolved: 25/Nov/14

Status: Closed
Project: Core Server
Component/s: MapReduce
Affects Version/s: 2.6.4
Fix Version/s: 2.6.6, 2.8.0-rc2

Type: Bug Priority: Major - P3
Reporter: Kamal Gajendran Assignee: J Rassi
Resolution: Done Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File dump.tgz     File logs.tgz    
Issue Links:
Depends
Related
related to SERVER-16308 Emitting arrays as ids in the map() f... Backlog
Tested
Operating System: ALL
Backport Completed:
Steps To Reproduce:

Running the below Map Reduce job put the Mongodb instance in this state every time and its reproducible.

mapFunc = function(){var k = [this.index[2],this.index[0],this.index[1]]; var v = {'Count':1, 'TotalWeight':this.value['Volume']}; emit(k,v);}
reduceFunc = function(key,emits){total = {'Count':0, 'TotalWeight':0.0}; for (var i in emits){ total['Count'] += 1; total['TotalWeight'] += emits[i]['TotalWeight'];} return total;}
db.RawData.mapReduce(mapFunc,reduceFunc,'Weights')

Participants:

 Description   

A error in map-reduce job crashes the secondary servers, and prevents the secondaries from starting again. I know what the error is in my map function that causes the job to fail, but that shouldn't be leaving my mongodb instance in a irrecoverable state. The primary is up and running, but pushed to a secondary since that's the only replica that's running.

The map function uses a list for a key, which is not supported. The unique index constraint is enforced on the last index of the list, which is not unique. Once I change it to a dictionary or concatenated string, it works just fine.

Every time I try starting the secondary server, I get the same error "duplicate key error index" and it crashes. I had to wipe out the secondaries and let Mongodb do a clean sync, which came with a big downtime.

This looks to be a mongodb bug. I am running a 3 replica set environment with 4 shards. All 4 shard servers in the 2 secondaries crashed with the same error.

Any help is greatly appreciated. If there is a way to recover from current state, Please let me know as well.

thanks!

2014-10-07T00:28:32.159+0000 [conn18913] end connection 172.31.15.135:55897 (9 connections now open)
2014-10-07T00:28:32.159+0000 [initandlisten] connection accepted from 172.31.15.135:55905 #18915 (10 connections now open)
2014-10-07T00:28:32.160+0000 [conn18915]  authenticate db: local { authenticate: 1, nonce: "xxx", user: "__system", key: "xxx" }
2014-10-07T00:28:40.150+0000 [repl writer worker 1] ERROR: writer worker caught exception:  :: caused by :: 11000 insertDocument :: caused by :: 11000 E11000 duplicate key error index: ModelDatabase.tmp.mr.RawData_0.$_id_  dup key: { : "009020" } on: { ts: Timestamp 1412641720000|2, h: -267785287631189678, v: 2, op: "i", ns: "ModelDatabase.tmp.mr.RawData_0", o: { _id: [ "20111028", "0088", "009020" ], value: { Count: 6.0, TotalWeight: 7.0 } } }
2014-10-07T00:28:40.150+0000 [repl writer worker 1] Fatal Assertion 16360
2014-10-07T00:28:40.150+0000 [repl writer worker 1] 



 Comments   
Comment by Githook User [ 25/Nov/14 ]

Author:

{u'username': u'jrassi', u'name': u'Jason Rassi', u'email': u'rassi@10gen.com'}

Message: SERVER-15552 mapReduce failure to insert to temp ns should abort op

(cherry picked from commit a4d077c775d8322c9e59313c3618fe73ac85e925)
Branch: v2.6
https://github.com/mongodb/mongo/commit/2fb5b67d280b5aa1f196d9f0afe802120bc22a56

Comment by Githook User [ 25/Nov/14 ]

Author:

{u'username': u'jrassi', u'name': u'Jason Rassi', u'email': u'rassi@10gen.com'}

Message: SERVER-15552 mapReduce failure to insert to temp ns should abort op
Branch: master
https://github.com/mongodb/mongo/commit/a4d077c775d8322c9e59313c3618fe73ac85e925

Comment by Kamal Gajendran [ 08/Oct/14 ]

Hi Thomas, thanks for looking into this bug. The workaround works just fine for us.

best, Kamal

Comment by Thomas Rueckstiess [ 08/Oct/14 ]

Hi Kamal,

Thanks for reporting. I'm able to reproduce the issue with the map/reduce job you provided and some sample data I created. One of our map/reduce developers will have a closer look. In the mean time, the workaround is (as you already discovered) to emit documents rather than lists. A full resync of the secondaries is the best way to recover them, as they would otherwise keep trying to replicate the invalid operation from the oplog, and complex manual intervention would be necessary.

Thanks,
Thomas

Generated at Thu Feb 08 03:38:20 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.