[SERVER-4987] new bigMapReduce.js issue Created: 16/Feb/12  Updated: 11/Jul/16  Resolved: 09/Mar/12

Status: Closed
Project: Core Server
Component/s: MapReduce, Sharding
Affects Version/s: 2.1.0
Fix Version/s: 2.1.1

Type: Bug Priority: Major - P3
Reporter: Greg Studer Assignee: Randolph Tan
Resolution: Done Votes: 0
Labels: buildbot
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-5086 getLastError failed to synchronize th... Closed
is related to SERVER-5077 sharded getLastError doesn't handle c... Closed
Operating System: ALL
Participants:

 Description   

Docs not counted correctly in mapreduce output - not the mapreduce input.

http://buildbot.mongodb.org/builders/OS%20X%2010.5%2032-bit/builds/3364/steps/test_3/logs/stdio/text



 Comments   
Comment by auto [ 09/Mar/12 ]

Author:

{u'login': u'', u'name': u'Randolph Tan', u'email': u'randolph@10gen.com'}

Message: Modified SERVER-4987 and added more comments to properly setup this test.
Branch: master
https://github.com/mongodb/mongo/commit/2d91311153e841bc66be1c600966da5663e2cc57

Comment by auto [ 03/Mar/12 ]

Author:

{u'login': u'', u'name': u'Randolph Tan', u'email': u'randolph@10gen.com'}

Message: Test for SERVER-4987
Branch: master
https://github.com/mongodb/mongo/commit/915ef6cd2ef940ccb263d697b71ccf770dfda189

Comment by Randolph Tan [ 24/Feb/12 ]

Investigation result:

The test failed because the getLastError that is supposed to synchronize and make sure that prior inserts made it to the shards allowed some failed inserts to slip through, and thus the map reduce jobs where not able to get all the inputs.

The write failures occurred because of the shard version changes during background chunk migrations by the balancer. Normally, these failures are handled via writebacks which is either fetched by a background thread or when getLastError is called. The bug happens when an insert will trigger a split chunk during a migration, and fails. What happens during this failure is that an uncaught StaleConfigExceptions will cause the connection not to be put back into the pool. So when getLastError command is called, it will be using a new connection, and it will not catch the pending writebacks since these info are kept in a thread local variable of the previous connection.

Comment by auto [ 24/Feb/12 ]

Author:

{u'login': u'gregstuder', u'name': u'Greg Studer', u'email': u'greg@10gen.com'}

Message: SERVER-4987 if using sharded connections, always need to handle StaleConfigExceptions
Branch: master
https://github.com/mongodb/mongo/commit/f8eee92420b445224ab4efb52a0ae0ad39ff6ecc

Comment by Greg Studer [ 22/Feb/12 ]

New failure - http://buildbot.mongodb.org/builders/Linux%2064-bit%20v8/builds/3032/steps/test_3/logs/stdio/text

Issue is that writebacks (46 of them) are not being processed during the GLE - they are returning much later (after two m/r jobs have completely finished). Suspect the issue is the writes are in-flight when the GLE is called, and therefore not registered as queued when GLE called, which we suspect can happen if there are multiple connections to a shard.

Comment by Greg Studer [ 16/Feb/12 ]

... i.e. - not the right number of docs after mapreduce - previous failures were on the raw inserts to the collection pre-processing.

Generated at Thu Feb 08 03:07:33 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.