[SERVER-10662] Sharding stopped working on a collection Created: 02/Sep/13  Updated: 04/Apr/23  Resolved: 16/Dec/13

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 2.2.3
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Anthony Pastor Assignee: Unassigned
Resolution: Incomplete Votes: 0
Labels: crash, mongos, sharding
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Linux debian 6.0.5


Operating System: Linux
Participants:

 Description   

Hello,

We experienced a problem on our Mongo Sharded Cluster.
We Use Mongo 2.2.3 on Linux Debian 6.0.5

One of the 3 ConfigServs failed recently.
The server came back online few hours later.
On the mongoS logfile we could observe :

Thu Aug 29 10:50:00 [CheckConfigServers] ERROR: config servers 172.16.16.1:27019 and 172.16.18.1:27019 differconfig servers 172.16.16.1:27019 and 172.16.18.1:27019 differconfig servers 172.16.16.1:27019 and 172.16.18.1:27019 differconfig servers 172.16.16.1:27019 and 172.16.18.1:27019 differconfig servers not in sync! config servers 172.16.16.1:27019 and 172.16.18.1:27019 differ

To recover from this state we did :

1) Disabled the Balancer with :
sh.setBalancerState(false)

2) Stopped the mongodb-conf daemon from the failed-server (server1) and from a second server (We let the third server and it ConfigServ running).

3) Rsynced configdb datas from the server2 to the server1

4) Re-Started mongodb-conf daemon from server2 : ok

5) Re-Started mongodb-conf daemon from server1 : ok

6) Enabled again the Balancer with :

sh.setBalancerState(true)

Everything seemed ok, but now we could see in logs this issue :

[Balancer] caught exception while doing balance: not sharded:rawlogs.raw_log

Collections seems to be present but sharding is not ok :

mongos> db.collections.find()
{ "_id" : "rawlogs.raw_log", "lastmod" : ISODate("1970-01-16T19:08:22.332Z"), "dropped" : false, "key" :

{ "_id" : 1 }

, "unique" : false, "lastmodEpoch" : ObjectId("515a78325c52d82fad24aa03") }
{ "_id" : "rawlogs.raw_log_ghost", "lastmod" : ISODate("1970-01-16T22:04:08.874Z"), "dropped" : false, "key" :

{ "_id" : 1 }

, "unique" : false, "lastmodEpoch" : ObjectId("51fbaf295c52d82fad24eccb") }
mongos>

mongos> db.raw_log.stats()
{
"sharded" : false,
"primary" : "shard1",
"ns" : "rawlogs.raw_log",
"count" : 2380607210,
"size" : 1044708269072,
"avgObjSize" : 438.84109259334724,
"storageSize" : NumberLong("1116861681584"),
"numExtents" : 541,
"nindexes" : 1,
"lastExtentSize" : 2146426864,
"paddingFactor" : 1,
"systemFlags" : 1,
"userFlags" : 0,
"totalIndexSize" : 80006256176,
"indexSizes" :

{ "_id_" : 80006256176 }

,
"ok" : 1
}
mongos>

But apparently the config of the sharding is set (like it was)

mongos> sh.status()
— Sharding Status —
sharding version:

{ "_id" : 1, "version" : 3 }

shards:

{ "_id" : "shard1", "host" : "shard1/172.16.19.1:27018,172.16.19.2:27018" } { "_id" : "shard2", "host" : "shard2/172.16.19.3:27018,172.16.19.4:27018" } { "_id" : "shard3", "host" : "shard3/172.16.19.5:27018,172.16.19.6:27018" } { "_id" : "shard4", "host" : "shard4/172.16.19.7:27018,172.16.19.8:27018" } { "_id" : "shard5", "host" : "shard5/172.16.19.10:27018,172.16.19.9:27018" }

databases:

{ "_id" : "admin", "partitioned" : false, "primary" : "config" } { "_id" : "rawlogs", "partitioned" : true, "primary" : "shard1" }

rawlogs.raw_log chunks:
shard1 24615
shard2 8279
shard3 3314
shard4 3498
shard5 10263
too many chunks to print, use verbose if you want to force print
rawlogs.raw_log_ghost chunks:
shard1 368
shard3 277
shard2 277
shard4 414
shard5 1162
too many chunks to print, use verbose if you want to force print

{ "_id" : "tempstats", "partitioned" : false, "primary" : "shard1" } { "_id" : "test", "partitioned" : false, "primary" : "shard3" } { "_id" : "stats", "partitioned" : false, "primary" : "shard5" } { "_id" : "rawlog", "partitioned" : false, "primary" : "shard5" }

mongos>

We also tryied to retry the procedure (recovery of Configserv1 from configServ2) with mongoS stopped.
We haven't had much success but in the logs when we restarted the mongoS and re-enabled the balancer we could see in logs :

Fri Aug 30 10:11:36 [Balancer] warning: got invalid chunk version 1|0||521f0c0563b2cfc94d8fad9b in document { _id: "rawlogs.raw_log-_id_MinKey", lastmod: Timestamp 1000|0, lastmodEpoch: ObjectId('521f0c0563b2cfc94d8fad9b'), ns: "rawlogs.raw_log", min:

{ _id: MinKey }

, max:

{ _id: BinData }

, shard: "shard1" } when trying to load differing chunks at version 0|0||515a78325c52d82fad24aa03
Fri Aug 30 10:11:36 [Balancer] warning: major change in chunk information found when reloading rawlogs.raw_log, previous version was 0|0||515a78325c52d82fad24aa03
Fri Aug 30 10:11:36 [Balancer] ChunkManager: time to load chunks for rawlogs.raw_log: 48ms sequenceNumber: 2 version: 0|0||000000000000000000000000 based on: (empty)
Fri Aug 30 10:11:36 [Balancer] warning: no chunks found for collection rawlogs.raw_log, assuming unsharded
Fri Aug 30 10:11:36 [Balancer] ChunkManager: time to load chunks for rawlogs.raw_log_ghost: 31ms sequenceNumber: 3 version: 707|393||51fbaf295c52d82fad24eccb based on: (empty)
Fri Aug 30 10:11:36 [Balancer] distributed lock 'balancer/mycompt.local:27021:1377850265:1804289383' unlocked.
Fri Aug 30 10:11:36 [Balancer] scoped connection to 172.16.16.1:27019,172.16.18.1:27019,172.16.18.2:27019 not being returned to the pool
Fri Aug 30 10:11:36 [Balancer] caught exception while doing balance: not sharded:rawlogs.raw_log

We don't want the sharding to initiate 'from scratch'
We'd like to enable the continuity of the past state (before config server1 failed).

We've already tryied to refresh the MongoS with :
db.adminCommand(

{flushRouterConfig: 1}

)
Without better result.

Unfortunately we haven't preserved the contents of the crashed config server that we replaced.

Any idea please to resume sharding, please ?



 Comments   
Comment by Stennie Steneker (Inactive) [ 16/Dec/13 ]

Hi Anthony,

Apologies for the delayed response on this issue. The SERVER project is only intended for reporting bugs or feature suggestions for the MongoDB server. For MongoDB-related support discussion please post on the mongodb-user group (http://groups.google.com/group/mongodb-user) or Stack Overflow / ServerFault.

Given you reported this issue some time ago, I'm going to assume that you have since found a solution.

Regards,
Stephen

Generated at Thu Feb 08 03:23:44 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.