[SERVER-16770] mongos blocks db during shardCollection Created: 08/Jan/15  Updated: 03/Jan/18  Resolved: 20/Dec/16

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 2.4.14, 2.6.12, 2.8.0-rc4, 3.0.14
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Kevin Pulo Assignee: Kaloian Manassiev
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
Operating System: ALL
Steps To Reproduce:
  1. Start a small sharded cluster, eg:

    mlaunch init --single --sharded 2 --config 3 --mongos 2 --port 37592 --smallfiles

  2. Add some data, such that shardCollection will want to create a large number of initial chunks:

    sh.stopBalancer();
    db.getSiblingDB("config").settings.update( { _id: "chunksize" }, { value: 1 } );
    sh.enableSharding("test");
    var s = (new Array(512*1024+1)).join("x");
    for (i = 0; i < 7157; i++) db.test.insert( { i: i, s: s } );
    db.test.ensureIndex( { i: 1 } );

  3. Shard the collection:

    sh.shardCollection("test.test", { i: 1 } )

  4. While the shardCollection is in progress, on a different connection to the same mongos, try to do any operation on the database in question (the database which is having a collection sharded), eg any/all of:

    db.test.count()
    db.test.findOne()
    db.test.insert({})
    db.foo.count()
    db.foo.findOne()
    db.foo.insert({})

    Any of these commands will block until shardCollection completes.

Participants:
Case:

 Description   
Symptoms

While shardCollection is running, accesses to the same database (via the same mongos) will block until shardCollection finishes, at which point they run. However, accessing the database is possible via other mongos. On the affected mongos, accesses to other databases are fine.

Impact

shardCollection can sometimes take a long time to run — with 3 config servers, initial chunks are created at a rate of about 10/s in 2.4/2.6, and about 40/s in 2.8. This means there can be a long period where the database in question isn't accessible via the mongos that is doing the shardCollection.

Results

When running the repro, the shardCollection takes about 13 mins to do the initial chunk splits on 2.4/2.6, and about 3 mins on 2.8 (with 3 config servers).

2015-01-08T15:40:46.105+1100 I COMMAND  [conn5] CMD: shardcollection: { shardCollection: "test.test", key: { i: 1.0 } }
2015-01-08T15:40:46.105+1100 I SHARDING [conn5] enable sharding on: test.test with shard key: { i: 1.0 }
2015-01-08T15:40:46.105+1100 I SHARDING [conn5] about to log metadata event: { _id: "genique-2015-01-08T04:40:46-54ae0a4e284cf6acc610c196", server: "genique", clientAddr: "N/A", time: new Date(1420692046105), what: "shardCollection.start", ns: "test.test", details: { shardKey: { i: 1.0 }, collection: "test.test", primary: "shard01:genique:37594", initShards: [], numChunks: 1 } }
2015-01-08T15:40:46.123+1100 I SHARDING [conn5] going to create 7157 chunk(s) for: test.test using new epoch 54ae0a4e284cf6acc610c197
2015-01-08T15:43:17.497+1100 I SHARDING [conn5] ChunkManager: time to load chunks for test.test: 76ms sequenceNumber: 3 version: 1|7156||54ae0a4e284cf6acc610c197 based on: (empty)
2015-01-08T15:43:17.554+1100 I SHARDING [conn5] about to log metadata event: { _id: "genique-2015-01-08T04:43:17-54ae0ae5284cf6acc610c198", server: "genique", clientAddr: "N/A", time: new Date(1420692197554), what: "shardCollection", ns: "test.test", details: { version: "1|7156||54ae0a4e284cf6acc610c197" } }

While the shardCollection is running, any of the given 6 test commands will block until the shardCollection completes (when run against the "test" db, on the mongos which is doing the shardCollection).

By contrast,

  • The actions do not block if done on the same mongos, but a different db (even one living on the same shard as the affected db).
  • The actions do not block if done on another mongos (either the same db or a different db).
  • The actions do not block if done directly on any shard (for any db).
Hypothesis

It's as if the db lock is being held by the mongos, although I can't imagine why it would need to do that. Unfortunately it's not possible to use currentOp to introspect what's happening inside the shardCollection on the mongos (SERVER-18094).

Workaround

Use a separate, dedicated mongos for the purposes of running shardCollection.



 Comments   
Comment by Kaloian Manassiev [ 20/Dec/16 ]

The cause for this was that the initial chunk creation was done under the DBConfig cache mutex. This no longer applies starting in version 3.2, where the collection sharding is performed independently.

Generated at Thu Feb 08 03:42:12 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.