[SERVER-58719] drop collection / database hang on sharded cluster Created: 21/Jul/21  Updated: 14/Oct/21  Resolved: 14/Oct/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 5.0.0
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: adrien petel Assignee: Edwin Zhou
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Steps To Reproduce:
  • create a sharded cluster
  • connect to the mongos with mongo or mongosh
  • create a new collection
  • drop the collection

 

[direct: mongos] admin> use test
switched to db test
[direct: mongos] test> db.createCollection("new", {})
{ ok: 1 }
[direct: mongos] test> db.new.drop()
   -> hangs indefinitely

 

Participants:

 Description   

Since mongodb 5.0.0, on a sharded cluster with 3 config server, 2 shards and 1 mongos, dropping a collection or a database hang indefinitely without succeding

On Ubuntu 20.04

[direct: mongos] admin> sh.status()
shardingVersion
{
  _id: 1,
  minCompatibleVersion: 5,
  currentVersion: 6,
  clusterId: ObjectId("60f7e17702142e807fc3033d")
}
---
shards
[
  {
    _id: 'shard0000',
    host: 'localhost:27021',
    state: 1,
    topologyTime: Timestamp(2, 1626857864)
  },
  {
    _id: 'shard0001',
    host: 'localhost:27022',
    state: 1,
    topologyTime: Timestamp(2, 1626857876)
  }
]
---
active mongoses
[ { '5.0.0': 1 } ]
---
autosplit
{ 'Currently enabled': 'yes' }
---
balancer
{
  'Currently running': 'no',
  'Currently enabled': 'yes',
  'Failed balancer rounds in last 5 attempts': 0,
  'Migration Results for the last 24 hours': { '342': "Failed with error 'aborted', from shard0000 to shard0001" }
}
---
databases
[
  {
    database: { _id: 'config', primary: 'config', partitioned: true },
    collections: {
      'config.system.sessions': {
        shardKey: { _id: 1 },
        unique: false,
        balancing: true,
        chunkMetadata: [ { shard: 'shard0000', nChunks: 1024 } ],
        chunks: [
          'too many chunks to print, use verbose if you want to force print'
        ],
        tags: []
      }
    }
  }
]

 

 



 Comments   
Comment by Edwin Zhou [ 14/Oct/21 ]

Thanks for following up felix2626, I'll go ahead and resolve this issue.

Best,
Edwin

Comment by adrien petel [ 13/Oct/21 ]

Hi @Edwin Zhou,

 

using --replSet when creating shards fixed the issue, thanks for pointing it out

 

Comment by Edwin Zhou [ 11/Oct/21 ]

Hi felix2626,

We still need additional information to diagnose the problem. If this is still an issue for you, would you please let us know if you are having issues creating a sharded clusters using a replica set?

Best,
Edwin

Comment by Edwin Zhou [ 22/Sep/21 ]

Hi felix2626,

Thanks for your report. In your deploy.sh script, it appears that you're attempting to create sharded nodes as standalone nodes, i.e., the command is missing the --replSet flag. Since MongoDB v3.6, shards must be deployed as a replica set.

SERVER-27383 implements guardrails in v5.0.3 that prohibits using --shardsvr without --replSet.

Can you modify your script to use --replSet when creating shards and let us know if the error persists?

For additional guidance on deploying a sharded cluster, please visit our documentation

Best,
Edwin

Comment by Eric Sedor [ 28/Jul/21 ]

Thanks felix2626, we'll take a look.

Comment by adrien petel [ 28/Jul/21 ]

Hi @eric.sedor,

 

I've uploaded the logs and diagnostic.data files

Comment by adrien petel [ 23/Jul/21 ]

Hi Eric,

Here are the steps I use to set up the cluster. This script was working for all previous version of MongoDB ( from 3.6 to 4.4 )

 

 

git clone https://github.com/feliixx/mongodbShardedCluster.git
 
cd mongodbShardedCluster
 
./deploy.sh config.txt /data/db

 

If that's not enough, I'll send the logs and diagnostic data when I get back to my working station

 

 

 

Comment by Eric Sedor [ 21/Jul/21 ]

Hi felix2626,

I am not able to reproduce this from scratch so I suspect an issue with the cluster itself, possibly related to the chunk migration failures seen in sh.status()

To investigate this as a possible bug, we'd like information from the following nodes in the cluster:

  • The mongos where you run the command
  • The config server primary
  • The primary shard for the database containing the collection that is created/dropped

For each of these nodes, please archive (tar or zip) the mongod.log file covering a hanging collection drop attempt, and the $dbpath/diagnostic.data directory (the contents are described here)

The specific time (with timezone) of the hanging attempt will also be helpful.

I've created a secure upload portal for you. Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time.

Eric

Generated at Thu Feb 08 05:45:16 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.