[SERVER-24892] "Creating first chunks failed: Data inconsistency detected amongst config servers" when using 3.2.3+ without replica set config servers Created: 05/Jul/16  Updated: 11/Dec/17  Resolved: 13/Jul/16

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.2.3
Fix Version/s: 3.2.9

Type: Bug Priority: Major - P3
Reporter: Akira Kurogane Assignee: Randolph Tan
Resolution: Done Votes: 0
Labels: code-only
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

Use versions 3.2.3 - 3.2.7 (<= 3.2.2 not tested yet)

The sharded cluster must have the <= v3.0 standard setup of three config servers that are not running as a replica set. Number or type of shards, or number of mongos nodes, seems to be irrelevant.

The collection being sharded should have a large initial number of chunks, to make the shardCollection command run a for long time. 30 secs or more seems to make it very probable; maybe 60 secs or higher will guarantee it. My repro was done with a 1GB collection in combination with a 1MB chunkSize, i.e. having > 1,000 chunks.

Sprint: Sharding 17 (07/15/16)
Participants:
Case:

 Description   

During an initial sharding of collection the error shown below can occur. During the insert of the initial chunk documents to config.chunks the asynchronously-running data consistency checking thread can observe an inconsistent view of the config db, and throws the "Data inconsistency detected amongst config servers" error up.

mongos> sh.shardCollection("test.foo", key: { "x": 1 })
{
    "ok" : 0,
    "errmsg" : "Creating first chunks failed: Data inconsistency detected amongst config servers",
    "code" : 132
}

This stops the shardCollection command at the point of having inserted some fraction of chunk documents into config.chunks, but no document into config.collections. So if the same command is attempted again then the following error appears:

{ "ok" : 0, "errmsg" : "collection test.foo already sharded with 834 chunks.", "code" : 23 }

The likelihood of having this timing collision with the consistency checking action seems to be very low if you only have a few chunks to insert. I could only reproduce when I had > 1,000 chunks, which in the environment I was using caused the shardCollection command to run for > 30 secs.

Changing the config servers to a replica set (per these instructions) was the only way I could consistently avoid this error while sharding the large collections in my test.



 Comments   
Comment by Randolph Tan [ 10/Nov/17 ]

No other steps are needed

Comment by Pelech Ilja [ 10/Nov/17 ]

Are there any other steps necessary in addition to upgrade and reconfiguration of configservers? (i.e. to remove records from config.chuks or anything else)

Comment by Githook User [ 13/Jul/16 ]

Author:

{u'username': u'renctan', u'name': u'Randolph Tan', u'email': u'randolph@10gen.com'}

Message: SERVER-24892 Revert to old behavior of SCCC consistency checking
Branch: v3.2
https://github.com/mongodb/mongo/commit/c81896b30186741b4443eaa0624a314d3543cdc5

Comment by Kaloian Manassiev [ 05/Jul/16 ]

From looking at the code it looks like we have made the check happen more frequently in 3.2, where it is happening on each metadata write, whereas in 3.0 it was only done during auto-balancing and auto-split.

In the case reported, the error itself is benign (i.e., there isn't actually any data inconsistency between the config servers), but it is problematic, because it fails the entire shardCollection command.

Unfortunately, there is no workaround for this problem since there is no way to disable the SCCC consistency checker thread.

Comment by Ramon Fernandez Marina [ 05/Jul/16 ]

Thanks for the detailed report akira.kurogane, we're investigating.

Generated at Thu Feb 08 04:07:42 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.