[SERVER-7926] Map Reduce with sharded output can apply reduce on duplicate documents if a migration happened Created: 13/Dec/12  Updated: 29/Jan/18  Resolved: 29/Jan/18

Status: Closed
Project: Core Server
Component/s: MapReduce, Sharding
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Ian Whalen (Inactive) Assignee: Randolph Tan
Resolution: Done Votes: 4
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File mrShardedOutput.js     Text File mrShardedOutput_with_cmdResult.txt    
Issue Links:
Depends
depends on SERVER-7210 Consolidate internal mongos interfaces Closed
Duplicate
is duplicated by SERVER-14269 Map Reduce fails with duplicate-key w... Closed
is duplicated by SERVER-15024 mapReduce output to sharded collectio... Closed
Related
related to DOCS-9834 MapReduce output on sharded clusters Closed
is related to SERVER-22611 ChunkManager refresh can occasionally... Closed
Operating System: ALL
Sprint: Sharding 2018-02-12
Participants:
Case:

 Description   

Buildbot failure:

http://buildbot.mongodb.org/builders/Linux%2032-bit/builds/5402

If a migration happened on the map reduce job output collection right before the mapreduce.shardedfinish is called, there is a possibility that one of the shard still thinks he owns the chunk, whereas it has already donated it to another shard.



 Comments   
Comment by Randolph Tan [ 23/Jan/18 ]

This should be fixed by SERVER-22611 after it made mapReduceFinish command use the catalogCache instead of accessing ChunkManager

Comment by Randolph Tan [ 18/Dec/12 ]

Oops. My eyes must be fooling me. The timestamps for the moveChunks were before the drop collection command so they should have been deleted. Have to dig further. I also found a very easy way to make the test fail, by simply adding these lines in the test:

for (var iter = 0; iter < numIterations; ++iter) {
    if (iter == 0) {
        st.stopBalancer();
    }
    else {
       st.startBalancer();
    }

Comment by Randolph Tan [ 17/Dec/12 ]

Based on the logs, looks like there is a bug in the finish stage. This is what happened:

1. Drop output collection on shard0.
2. Balancing round move chunks from shard1 to shard0 <---- // balancer is adding new docs to shard0!!!!
3. Drop output collection on shard1.
4. Proceed...

Note: the chunks being moved in step#2 are from the output of the previous iteration.

 m30000| Mon Dec 17 15:32:20.698 [conn6] CMD: drop test.mrShardedOut
 
 m30999| Mon Dec 17 15:32:24.883 [Balancer]  ns: test.mrShardedOut going to move { _id: "test.mrShardedOut-_id_ObjectId('50cf812d256383d556ab59ba')", lastmod: Timestamp 11000|1, lastmodEpoch: ObjectId('50cf81365ec0810ee359b56b'), ns: "test.mrShardedOut", min: { _id: ObjectId('50cf812d256383d556ab59ba') }, max: { _id: ObjectId('50cf812d256383d556ab5b88') }, shard: "shard0001" } from: shard0001 to: shard0000 tag []
 
 m30999| Mon Dec 17 15:32:27.392 [Balancer]  ns: test.mrShardedOut going to move { _id: "test.mrShardedOut-_id_ObjectId('50cf812d256383d556ab59ba')", lastmod: Timestamp 11000|1, lastmodEpoch: ObjectId('50cf81365ec0810ee359b56b'), ns: "test.mrShardedOut", min: { _id: ObjectId('50cf812d256383d556ab59ba') }, max: { _id: ObjectId('50cf812d256383d556ab5b88') }, shard: "shard0001" } from: shard0001 to: shard0000 tag []
 
 m30999| Mon Dec 17 15:32:44.215 [Balancer] moving chunk ns: test.mrShardedOut moving ( ns:test.mrShardedOutshard: shard0001:localhost:30001lastmod: 11|1||000000000000000000000000min: { _id: ObjectId('50cf812d256383d556ab59ba') }max: { _id: ObjectId('50cf812d256383d556ab5b88') }) shard0001:localhost:30001 -> shard0000:localhost:30000
 
 m30999| Mon Dec 17 15:32:28.612 [Balancer]  ns: test.mrShardedOut going to move { _id: "test.mrShardedOut-_id_ObjectId('50cf812d256383d556ab59ba')", lastmod: Timestamp 11000|1, lastmodEpoch: ObjectId('50cf81365ec0810ee359b56b'), ns: "test.mrShardedOut", min: { _id: ObjectId('50cf812d256383d556ab59ba') }, max: { _id: ObjectId('50cf812d256383d556ab5b88') }, shard: "shard0001" } from: shard0001 to: shard0000 tag []
 
 m30999| Mon Dec 17 15:32:36.674 [Balancer]  ns: test.mrShardedOut going to move { _id: "test.mrShardedOut-_id_ObjectId('50cf812d256383d556ab59ba')", lastmod: Timestamp 11000|1, lastmodEpoch: ObjectId('50cf81365ec0810ee359b56b'), ns: "test.mrShardedOut", min: { _id: ObjectId('50cf812d256383d556ab59ba') }, max: { _id: ObjectId('50cf812d256383d556ab5b88') }, shard: "shard0001" } from: shard0001 to: shard0000 tag []
 
 m30999| Mon Dec 17 15:32:46.609 [Balancer]  ns: test.mrShardedOut going to move { _id: "test.mrShardedOut-_id_ObjectId('50cf812d256383d556ab59ba')", lastmod: Timestamp 11000|1, lastmodEpoch: ObjectId('50cf81365ec0810ee359b56b'), ns: "test.mrShardedOut", min: { _id: ObjectId('50cf812d256383d556ab59ba') }, max: { _id: ObjectId('50cf812d256383d556ab5b88') }, shard: "shard0001" } from: shard0001 to: shard0000 tag []
 
 m30999| Mon Dec 17 15:33:00.570 [Balancer]  ns: test.mrShardedOut going to move { _id: "test.mrShardedOut-_id_ObjectId('50cf812d256383d556ab59ba')", lastmod: Timestamp 11000|1, lastmodEpoch: ObjectId('50cf81365ec0810ee359b56b'), ns: "test.mrShardedOut", min: { _id: ObjectId('50cf812d256383d556ab59ba') }, max: { _id: ObjectId('50cf812d256383d556ab5b88') }, shard: "shard0001" } from: shard0001 to: shard0000 tag []
 
 m30001| Mon Dec 17 15:33:02.005 [conn3] CMD: drop test.mrShardedOut

// Move chunks: 4620 docs -> mismatch is 4619 docs!!!!!

 m30001| Mon Dec 17 15:32:02.426 [cleanupOldData-50cf8142c94e4981dc6c1b1d] moveChunk deleted 461 documents for test.mrShardedOut from { _id: MinKey } -> { _id: ObjectId('50cf812d256383d556ab497c') }
 m30001| Mon Dec 17 15:32:04.704 [cleanupOldData-50cf8143c94e4981dc6c1b21] moveChunk deleted 462 documents for test.mrShardedOut from { _id: ObjectId('50cf812d256383d556ab497c') } -> { _id: ObjectId('50cf812d256383d556ab4b4a') }
 m30001| Mon Dec 17 15:32:04.995 [cleanupOldData-50cf8144c94e4981dc6c1b25] moveChunk deleted 462 documents for test.mrShardedOut from { _id: ObjectId('50cf812d256383d556ab4b4a') } -> { _id: ObjectId('50cf812d256383d556ab4d18') }
 m30001| Mon Dec 17 15:32:06.443 [cleanupOldData-50cf8146c94e4981dc6c1b29] moveChunk deleted 462 documents for test.mrShardedOut from { _id: ObjectId('50cf812d256383d556ab4d18') } -> { _id: ObjectId('50cf812d256383d556ab4ee6') }
 m30001| Mon Dec 17 15:32:07.203 [cleanupOldData-50cf8147c94e4981dc6c1b2d] moveChunk deleted 462 documents for test.mrShardedOut from { _id: ObjectId('50cf812d256383d556ab4ee6') } -> { _id: ObjectId('50cf812d256383d556ab50b4') }
 m30001| Mon Dec 17 15:32:08.431 [cleanupOldData-50cf8148c94e4981dc6c1b31] moveChunk deleted 462 documents for test.mrShardedOut from { _id: ObjectId('50cf812d256383d556ab50b4') } -> { _id: ObjectId('50cf812d256383d556ab5282') }
 m30001| Mon Dec 17 15:32:11.595 [cleanupOldData-50cf814bc94e4981dc6c1b35] moveChunk deleted 462 documents for test.mrShardedOut from { _id: ObjectId('50cf812d256383d556ab5282') } -> { _id: ObjectId('50cf812d256383d556ab5450') }
 m30001| Mon Dec 17 15:32:13.301 [cleanupOldData-50cf814cc94e4981dc6c1b39] moveChunk deleted 462 documents for test.mrShardedOut from { _id: ObjectId('50cf812d256383d556ab5450') } -> { _id: ObjectId('50cf812d256383d556ab561e') }
 m30001| Mon Dec 17 15:32:14.871 [cleanupOldData-50cf814dc94e4981dc6c1b3d] moveChunk deleted 462 documents for test.mrShardedOut from { _id: ObjectId('50cf812d256383d556ab561e') } -> { _id: ObjectId('50cf812d256383d556ab57ec') }
 m30001| Mon Dec 17 15:32:16.290 [cleanupOldData-50cf8150c94e4981dc6c1b41] moveChunk deleted 462 documents for test.mrShardedOut from { _id: ObjectId('50cf812d256383d556ab57ec') } -> { _id: ObjectId('50cf812d256383d556ab59ba') }
 m30001| Mon Dec 17 15:33:02.165 [cleanupOldData-50cf817ec94e4981dc6c1b52] moveChunk deleted 1 documents for test.mrShardedOut from { _id: ObjectId('50cf8141256383d556ac320e') } -> { _id: MaxKey }

Generated at Thu Feb 08 03:15:58 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.