[SERVER-20150] Chunk migration locks are constantly blocking map/reduce Created: 26/Aug/15  Updated: 11/Sep/15  Resolved: 11/Sep/15

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.0.5
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Alex Assignee: Sam Kleinman (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Participants:

 Description   

Hi.

Please see SERVER-20149 for our deployment details.

We're seeing another problem with shard balancer. After upgrade our app failed to perform map/reduce on sharded collections with following error:

[conn26] the collection metadata could not be locked for mapreduce, already locked by { _id: "<db>.<collection>", process: "db03:27017:1440329966:296879767", state: 2, ts: ObjectId('55da2f929342a2275e4eb52f'), when: new Date(1440362386369), who: "db03:27017:1440329966:296879767:conn5087:1423199152", why: "migrating chunk [{ : MinKey }, { : MaxKey }) in db>.<collection>" }

I've checked the status of the shard, and found that there a chunk migration is pending from Shard3 to Shard1, since there a imbalance in chunks distribution:

      chunks:
                                shard1     713
                                shard2     715
                                shard3     812

I've checked the db.opStatus on shard1 primary node and found out that migration process is blocked by secondary node, because it's doing the initial sync. We've decided to stop the initial sync to give primary node time to accept chunks from shard3. But after ~2 hours only 2 chunks are actually migrated and our collection was still locked. So we decided to stop the balancer, to allow our app to run again.

Is this by design or something really went wrong during upgrade proccess? Because we haven't seen this issue before on 2.4 installation.



 Comments   
Comment by Alex [ 11/Sep/15 ]

Hi Sam,

1) Are you sure this is the expected behaviour for the Shard Balancer to lock a collection for hours (2 hours in our case)?
2) We believe that this is a regression: we haven't had this problem on 2.4.

Otherwise, this means that regardless of what is written in the docs we simply can't use a sharded collection as map/reduce output since it will likely be locked by the balancer.

Comment by Sam Kleinman (Inactive) [ 11/Sep/15 ]

Sorry for the delay in getting back to you.

After discussing this with the teams that work on sharding, it looks like this is in fact the expected behavior: mapReduce require the distributed lock to prevent chunk migrations when running with sharded output to prevent chunk migrations from interfering with the output of the map reduce operation.

I hope this makes sense, and sorry for any confusion.

Regards,
sam

Comment by Alex [ 26/Aug/15 ]

Oopsie, task should be SERVER-20149

Generated at Thu Feb 08 03:53:18 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.