[SERVER-44052] Inconsistencies in sharded collections Created: 16/Oct/19  Updated: 25/Feb/20  Resolved: 25/Feb/20

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.0.2, 4.0.4, 4.0.5
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Stephen Paul Adithela Assignee: Carl Champain (Inactive)
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File alerts_20191030_stats.txt     Text File shardcolcreation.txt    
Issue Links:
Duplicate
duplicates SERVER-32198 Missing collection metadata on the sh... Closed
Related
related to SERVER-43689 getShardDistribution() incorrectly sh... Closed
is related to SERVER-5160 Handle all failed shardCollection com... Closed
Operating System: ALL
Participants:

 Description   

Mongos is reporting inconsistent information for sharded collections.

Our Setup:

A DB Cluster with three shards. Each shard has PSA Architecture. Three config servers.

Mongo Versions: 4.0.X

Scenario:

Mongos reports that collection alerts_20191026 is not sharded. But config.chunks reports that there are three chunks for that namespace that are in different shards

mongos> db.alerts_20191026.getShardDistribution()

Collection hitron.alerts_20191026 is not sharded.

mongos> use config

switched to db config

mongos> db.chunks.find({ns: "hitron.alerts_20191026"})

{ "_id" : "hitron.alerts_20191026-lineId_-3074457345618258602", "lastmod" : Timestamp(2, 0), "lastmodEpoch" : ObjectId("5da687c8a664d0a846cf713f"), "ns" : "hitron.alerts_20191026", "min" : \{ "lineId" : NumberLong("-3074457345618258602") }, "max" : { "lineId" : NumberLong("3074457345618258602") }, "shard" : "rs2", "history" : [ { "validAfter" : Timestamp(1571194825, 261), "shard" : "rs2" }, { "validAfter" : Timestamp(1571194824, 926), "shard" : "rs1" } ] }

{ "_id" : "hitron.alerts_20191026-lineId_3074457345618258602", "lastmod" : Timestamp(3, 0), "lastmodEpoch" : ObjectId("5da687c8a664d0a846cf713f"), "ns" : "hitron.alerts_20191026", "min" : \{ "lineId" : NumberLong("3074457345618258602") }, "max" : { "lineId" : { "$maxKey" : 1 } }, "shard" : "rs3", "history" : [ { "validAfter" : Timestamp(1571194825, 616), "shard" : "rs3" }, { "validAfter" : Timestamp(1571194824, 926), "shard" : "rs1" } ] }

{ "_id" : "hitron.alerts_20191026-lineId_MinKey", "lastmod" : Timestamp(3, 1), "lastmodEpoch" : ObjectId("5da687c8a664d0a846cf713f"), "ns" : "hitron.alerts_20191026", "min" : { "lineId" : { "$minKey" : 1 } }, "max" : { "lineId" : NumberLong("-3074457345618258602") }, "shard" : "rs1", "history" : [ { "validAfter" : Timestamp(1571194824, 926), "shard" : "rs1" } ] }

mongos> db.alerts_20191026.stats().nchunks
1


mongos> sh.status()

hitron.alerts_20191026
shard key: { "lineId" : "hashed" }
unique: false
balancing: true
chunks:
rs1 1
rs2 1
rs3 1
{ "lineId" : { "$minKey" : 1 } } -->> { "lineId" : NumberLong("-3074457345618258602") } on : rs1 Timestamp(3, 1)
{ "lineId" : NumberLong("-3074457345618258602") } -->> { "lineId" : NumberLong("3074457345618258602") } on : rs2 Timestamp(2, 0)
{ "lineId" : NumberLong("3074457345618258602") } -->> { "lineId" : { "$maxKey" : 1 } } on : rs3 Timestamp(3, 0)

 

 



 Comments   
Comment by Carl Champain (Inactive) [ 25/Feb/20 ]

Hi,

We haven’t heard back from you for some time, so I’m going to mark this ticket as resolved. If this is still an issue for you, please provide additional information and we will reopen the ticket.

Thanks,
Carl
 

Comment by Carl Champain (Inactive) [ 03/Dec/19 ]

Hi sadithela@assia-inc.com, icruz,

Thanks for the additional details on the topology of your sharded cluster.
In order to be able to recreate your issue, we need some more information and context, can you please provide:

  1. The names of the database commands you run?
  2. The order in which you run them?
  3. Where they are being run in the topology?
Comment by Isaac Cruz [ 25/Nov/19 ]

Hi Carl,

we have a sharded cluster, with 4 shards, each consisting on a replica set with 2 replicas + 1 arbiter.

Then we have 2 separate app servers, with our java app connecting to a local mongos on each of these app servers. From these java apps we are creating everyday some collections for the future days (several collections per day), and it is here where we shard the collections. When this issue happens, it happens for all collections created at that point. One of the things (not sure if it might affect), is that we are creating these collections in parallel from app servers (different collections, but calling shardCollection at the same time on different mongos instances)

Comment by Carl Champain (Inactive) [ 20/Nov/19 ]

Hi sadithela@assia-inc.com, icruz,

Thanks for your patience.

We think that the issue explained in SERVER-32198 is the cause of the described behavior here. To put it simply, the mongos and the mongod have outdated metadata. For example, the shard version is set to unsharded for the alerts_20191111 collection (even though it is in fact sharded), so the output of getShardDistribution() or stats() return an incorrect state. If only one mongos or mongod was outdated, then the most recent one would tell the other to update its version, hence this problem wouldn't happen. Importantly, SERVER-32198 is an open ticket, meaning that it is still being examined, so please watch it for updates.

We weren’t able to reproduce the issue that causes the mongod to have out of date metadata, but we'd be happy to try again. To do so, we would need a complete list of steps, plus the topology of your sharded cluster.

Kind regards,
Carl

Comment by Isaac Cruz [ 12/Nov/19 ]

Sorry for late reply, the response we get from java is "ok: 1", no error.

Comment by Carl Champain (Inactive) [ 11/Nov/19 ]

Hi sadithela@assia-inc.com,

Have you found out what the java responses are?

Thanks,
Carl 

Comment by Stephen Paul Adithela [ 02/Nov/19 ]

Hi Carl,

I had uploaded a zip file called "alerts_20191111" to your secure uploader. It should have all the mongos, mongocfg, mongodb related logs. It should also have shardVersion output.

Regarding java responses, I have to follow up with my colleagues.

Sorry for late reply.

Thanks,

Stephen

Comment by Carl Champain (Inactive) [ 29/Oct/19 ]

Hi again sadithela@assia-inc.com, icruz,

To help us look more into how the deployment reaches this state:

1. Can you please run getShardVersion() on shards which have chunks for the alerts_20191030 collection (or any collection affected by the described behavior) and share the output? 

2. In the Java code, what response(s) are returned by the shardCollection commands?

3. Can you provide the logs from the mongos, shard primary and config servers?

Ideally, please provide these all for the same improperly sharded collection.

Please upload your files to our secure uploader here. Only MongoDB engineers can view these files and they will expire after a period time.

Thank you,
Carl

Comment by Carl Champain (Inactive) [ 28/Oct/19 ]

Hi sadithela@assia-inc.com, icruz,

Very sorry for the confusion about the Java code. I re-opened the ticket for additional investigation.

You mentioned that you dropped and re-sharded the collection; I want to make sure that you are aware of SERVER-17397 which provides a way to do so properly.

Back to your initial issue, we are currently investigating what the cause might be and are attempting to reproduce the described behavior. We will keep you updated and will reach out if questions come up.

Kind regards,
Carl

Comment by Isaac Cruz [ 24/Oct/19 ]

That difference is due to shardCollection command vs sh.shardCollection helper, which have a different syntax.

Only differences between java code and shell are:

  1. In java we are using numInitialChunks parameter, while on shell we use the default.
  2. In java there are several servers issuing a shardCollection command (for different collections) at the same time. Not sure if this concurrent shardCollection might cause the bug.

Please reopen this ticket because it is indeed a server bug, no matter what we do from a client, the DB should not end up with an inconsistent sharding configuration.

Comment by Carl Champain (Inactive) [ 24/Oct/19 ]

Hi sadithela@assia-inc.com,

 

In the Java code, you are using key as the hashed key:

new BasicDBObject("shardCollection", collection.getFullName())
                         .append("key", new BasicDBObject(shardKey, "hashed"))
                         .append("numInitialChunks", shardCount))

 

This is different from the shell code in which you are using lineId as the hashed key:

 
sh.shardCollection('hitron.alerts_20191030',{lineId: 'hashed'},false,{numInitialChunks: 3})

 

It seems that the Java code should be:

new BasicDBObject("shardCollection", collection.getFullName())
                         .append("lineId", new BasicDBObject(shardKey, "hashed"))
                         .append("numInitialChunks", shardCount))

 

That said, the SERVER project is for bugs and feature suggestions for the MongoDB server. As this ticket does not appear to be a bug, I will now close it. If you need further assistance troubleshooting, I encourage you to ask our community by posting on the mongodb-user group or on Stack Overflow with the mongodb tag

 

Kind regards,
Carl

Comment by Stephen Paul Adithela [ 23/Oct/19 ]

Hi Carl,

From mongo cli:

sh.shardCollection('hitron.alerts_20191030',{lineId: 'hashed'},false,{numInitialChunks: 3})

From java code:

Attached code snippet in file: shardcolcreation.txt

Comment by Carl Champain (Inactive) [ 23/Oct/19 ]

Hi sadithela@assia-inc.com,

Thanks for sharing the stats.
I still need more details to determine what is happening. Can you please share some sample code showing how you created the collection via the Java driver and via the Shell?

Comment by Stephen Paul Adithela [ 22/Oct/19 ]

Hi Carl,

As this collection (alerts_20191026) has to be sharded for our production systems to work properly, We have dropped that collection and re-created again as a sharded collection from mongo CLI. That collection was initially created by java driver in our software.

Right now, that collection's shard distribution output:

Shard rs1 at rs1/hitron-db-01a:27018,hitron-db-01b:27018
data : 0B docs : 0 chunks : 1
estimated data per chunk : 0B
estimated docs per chunk : 0

Shard rs2 at rs2/hitron-db-02b:27018,hitron-db-02c:27018
data : 0B docs : 0 chunks : 1
estimated data per chunk : 0B
estimated docs per chunk : 0

Shard rs3 at rs3/hitron-db-03a:27018,hitron-db-03b:27018
data : 0B docs : 0 chunks : 1
estimated data per chunk : 0B
estimated docs per chunk : 0

Totals
data : 0B docs : 0 chunks : 3
Shard rs1 contains NaN% data, NaN% docs in cluster, avg obj size on shard : NaNGiB
Shard rs2 contains NaN% data, NaN% docs in cluster, avg obj size on shard : NaNGiB
Shard rs3 contains NaN% data, NaN% docs in cluster, avg obj size on shard : NaNGiB

This is the right output we expect when we create a sharded collection. 

The collections being created from our software using java driver are still facing the same issue as mentioned in the description of this ticket.

Here are results of a similar collection: (alerts_20191030)

mongos> db.alerts_20191030.getShardDistribution()
Collection hitron.alerts_20191030 is not sharded.

mongos> db.chunks.find({ns: "hitron.alerts_20191030"})
{ "_id" : "hitron.alerts_20191030-lineId_-3074457345618258602", "lastmod" : Timestamp(2, 0), "lastmodEpoch" : ObjectId("5dabcdd2a664d0a8465ae63b"), "ns" : "hitron.alerts_20191030", "min" : \{ "lineId" : NumberLong("-3074457345618258602") }, "max" : { "lineId" : NumberLong("3074457345618258602") }, "shard" : "rs2", "history" : [ { "validAfter" : Timestamp(1571540435, 16), "shard" : "rs2" }, { "validAfter" : Timestamp(1571540434, 306), "shard" : "rs1" } ] }
{ "_id" : "hitron.alerts_20191030-lineId_3074457345618258602", "lastmod" : Timestamp(3, 0), "lastmodEpoch" : ObjectId("5dabcdd2a664d0a8465ae63b"), "ns" : "hitron.alerts_20191030", "min" :
{ "lineId" : NumberLong("3074457345618258602") }, "max" : { "lineId" : { "$maxKey" : 1 } }, "shard" : "rs3", "history" : [ { "validAfter" : Timestamp(1571540435, 208), "shard" : "rs3" }, { "validAfter" : Timestamp(1571540434, 306), "shard" : "rs1" } ] }

{ "_id" : "hitron.alerts_20191030-lineId_MinKey", "lastmod" : Timestamp(3, 1), "lastmodEpoch" : ObjectId("5dabcdd2a664d0a8465ae63b"), "ns" : "hitron.alerts_20191030", "min" : { "lineId" : { "$minKey" : 1 } }, "max" : { "lineId" : NumberLong("-3074457345618258602") }, "shard" : "rs1", "history" : [ { "validAfter" : Timestamp(1571540434, 306), "shard" : "rs1" } ] }

Also, as you have mentioned, i am attaching collstats of alerts_20191030 collection to this ticket. alerts_20191030_stats.txt

Please let me know if you are in need of any other info.

Thank you

 

Useful Info:

mongo-java-driver version: 2.14.3

mongo-async-driver version: 2.0.2

 

 

Comment by Carl Champain (Inactive) [ 21/Oct/19 ]

Hi sadithela@assia-inc.com,

Thank you for the report.
Can you please run

db.runCommand({ collStats : "alerts_20191026" })

in mongos and share the output here? This will help me better understand what is happening.

Kind regards,
Carl

Generated at Thu Feb 08 05:04:51 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.