[SERVER-15552] Errors writing to temporary collections during mapReduce command execution should be operation-fatal Created: 07/Oct/14 Updated: 11/Jul/16 Resolved: 25/Nov/14 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | MapReduce |
| Affects Version/s: | 2.6.4 |
| Fix Version/s: | 2.6.6, 2.8.0-rc2 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Kamal Gajendran | Assignee: | J Rassi |
| Resolution: | Done | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Backport Completed: | |||||||||||||||||
| Steps To Reproduce: | Running the below Map Reduce job put the Mongodb instance in this state every time and its reproducible.
|
||||||||||||||||
| Participants: | |||||||||||||||||
| Description |
|
A error in map-reduce job crashes the secondary servers, and prevents the secondaries from starting again. I know what the error is in my map function that causes the job to fail, but that shouldn't be leaving my mongodb instance in a irrecoverable state. The primary is up and running, but pushed to a secondary since that's the only replica that's running. The map function uses a list for a key, which is not supported. The unique index constraint is enforced on the last index of the list, which is not unique. Once I change it to a dictionary or concatenated string, it works just fine. Every time I try starting the secondary server, I get the same error "duplicate key error index" and it crashes. I had to wipe out the secondaries and let Mongodb do a clean sync, which came with a big downtime. This looks to be a mongodb bug. I am running a 3 replica set environment with 4 shards. All 4 shard servers in the 2 secondaries crashed with the same error. Any help is greatly appreciated. If there is a way to recover from current state, Please let me know as well. thanks!
|
| Comments |
| Comment by Githook User [ 25/Nov/14 ] |
|
Author: {u'username': u'jrassi', u'name': u'Jason Rassi', u'email': u'rassi@10gen.com'}Message: (cherry picked from commit a4d077c775d8322c9e59313c3618fe73ac85e925) |
| Comment by Githook User [ 25/Nov/14 ] |
|
Author: {u'username': u'jrassi', u'name': u'Jason Rassi', u'email': u'rassi@10gen.com'}Message: |
| Comment by Kamal Gajendran [ 08/Oct/14 ] |
|
Hi Thomas, thanks for looking into this bug. The workaround works just fine for us. best, Kamal |
| Comment by Thomas Rueckstiess [ 08/Oct/14 ] |
|
Hi Kamal, Thanks for reporting. I'm able to reproduce the issue with the map/reduce job you provided and some sample data I created. One of our map/reduce developers will have a closer look. In the mean time, the workaround is (as you already discovered) to emit documents rather than lists. A full resync of the secondaries is the best way to recover them, as they would otherwise keep trying to replicate the invalid operation from the oplog, and complex manual intervention would be necessary. Thanks, |