[SERVER-6870] sh.status() fails with decode failed. probably invalid utf-8 string [???] Created: 27/Aug/12  Updated: 11/Jul/16  Resolved: 28/Aug/12

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 2.2.0-rc2
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Edmunds Kalnins Assignee: Scott Hernandez (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Ubuntu 10.04 64bit


Attachments: File dump-configsvr.tar.gz     File dump-configvr-01.tar.gz     File dump-configvr-02.tar.gz    
Operating System: Linux
Participants:

 Description   

I am running a shard with two members, 3 configservers and one mongos. I am currently in a state where on mongos sh.status() always fails. The error message is:

on Aug 27 13:58:35 decode failed. probably invalid utf-8 string [???]
Mon Aug 27 13:58:35 why: TypeError: malformed UTF-8 character sequence at offset 0
TypeError: malformed UTF-8 character sequence at offset 0

The same error also is printed when I use the config database an issue db.databases.find({}); I can see 3 databases listed and the error is then displayed. db.databases.find({}).count() shows that there are 5 records there. When connected to the shard members, there are 4 databases on each (and they match by name at least).

Now I got in this state during some heavy modifications to the sharding setup, which might be relevant:

I ran db.runCommand(

{removeshard: "shard0000"}

) to remove a shard running very low on disk space. Unfortunately the free space ran out and mongod process froze/crashed and had to be restarted (this is actually because of the post-cleanup.XXXX.bson files created, I now bravely delete them by hand). So, the draining process then hangs, log displays:

Mon Aug 27 10:57:16 [conn4] about to log metadata event: { _id: "mongoimg-2012-08-27T07:57:16-19", server: "mongoimg", clientAddr: "192.168.100.40:36263", time: new Date(1346054236814), what: "moveChunk.from", ns: "project.fs.chunks", details: { min:

{ files_id: ObjectId('4f8323e4af8cd13414001317') }

, max:

{ files_id: ObjectId('4f8414daae8cd11f7200004a') }

, step1 of 6: 0, note: "aborted" } }

each 6 seconds. To solve this, I set the shards draining status to false and restarted configservers (more than once) then enabled draining, by issuing db.runCommand(

{removeshard: "shard0000"}

) again and restarted configservers one by one again (this was repeated several times because one of the configservers was moved to a different IP and the repeating logmessage wouldn't go away).

So, currently the shard seems to be draining normally, but sh.status() always fails.

Nothing interesting in mongos.log, the string "UTF8" doesn't appear. The only thing that I see that might be related to this is:

Mon Aug 27 11:45:03 [Balancer] moveChunk result: { who: { _id: "project.fs.chunks", process: "mongoimg:27017:1345997938:164881649", state: 2, ts: ObjectId('503b20b8706f005dbbff04b8'), when: new Date(1346052280689), who: "mongoimg:27017:1345997938:164881649:conn28:1347786322", why: "migrate-

{ files_id: ObjectId('4f8323e4af8cd13414001317') }

" }, errmsg: "the collection metadata could not be locked with lock migrate-

{ files_id: MinKey }

", ok: 0.0 }



 Comments   
Comment by Scott Hernandez (Inactive) [ 28/Aug/12 ]

I would suggest that you upgrade to 2.2.0 everywhere when it is released, within the next week. Also, since we were not able to find where this started, or what inserted the bad utf-8 data, I can't offer you steps to keep this from happening but I would suggest limiting access to the mongos/config-servers/shards on your network to make sure no errant users or processes accidentally write bad data there.

Comment by Edmunds Kalnins [ 28/Aug/12 ]

OK, thanks, dropping and recreating databases collection solved the issue.

Comment by Scott Hernandez (Inactive) [ 27/Aug/12 ]

Yep, looks like invalid UTF-8 in your databases collection:

bsondump dump-configsvr/config/databases.bson 
{ "_id" : "Pi?\u0001", "partitioned" : false, "primary" : "shard0001" }
{ "_id" : "admin", "partitioned" : false, "primary" : "config" }
{ "_id" : "otrapuse", "partitioned" : true, "primary" : "shard0000" }
{ "_id" : "test", "partitioned" : false, "primary" : "shard0000" }
{ "_id" : "???\u0001", "partitioned" : false, "primary" : "shard0001" }
5 objects found

You can fix this by removing all documents (dropping the collection) and re-saving them:

> use config
> db.databases.drop()
> db.databases.save({ "_id" : "admin", "partitioned" : false, "primary" : "config" })
> db.databases.save({ "_id" : "otrapuse", "partitioned" : true, "primary" : "shard0000" })
> db.databases.save({ "_id" : "test", "partitioned" : false, "primary" : "shard0000" })
//restart all mongos instances

I would schedule a maintenance window to do this in incase you have any unforeseen issues.

Comment by Edmunds Kalnins [ 27/Aug/12 ]

Fonfigserver dumps.

Note that what I called "project" collection is called "otrapuse"

Comment by Edmunds Kalnins [ 27/Aug/12 ]

The shard that is set to be drained (shard0000) has been upgraded from 1.8 I think. Other shard (shard0001) is brand new, 2.2.0RC1, so is mongos. One of the configservers is running 2.0, other two are 2.2.0RC. Basically we were running unsharded for a year or so, upgrading when new releases came out. Started running out of disk space so are moving to a sharded setup last week, setting up new instances with 2.2.0RCx (except one configserver, which shares binary with another not yet upgraded mongod).

db.getMongo().getDBs()

// shard0000:
{
	"databases" : [
		{
			"name" : "local",
			"sizeOnDisk" : 1291845632,
			"empty" : false
		},
		{
			"name" : "project",
			"sizeOnDisk" : 107809079296,
			"empty" : false
		},
		{
			"name" : "admin",
			"sizeOnDisk" : 1,
			"empty" : true
		},
		{
			"name" : "config",
			"sizeOnDisk" : 1,
			"empty" : true
		}
	],
	"totalSize" : 109100924928,
	"ok" : 1
}
 
// shard0001:
{
	"databases" : [
		{
			"name" : "project",
			"sizeOnDisk" : 53611593728,
			"empty" : false
		},
		{
			"name" : "admin",
			"sizeOnDisk" : 1,
			"empty" : true
		},
		{
			"name" : "config",
			"sizeOnDisk" : 1,
			"empty" : true
		},
		{
			"name" : "local",
			"sizeOnDisk" : 1,
			"empty" : true
		}
	],
	"totalSize" : 53611593728,
	"ok" : 1
}

BTW the "project" collection contains only fs.files and fs.chunks collections, with fs.chunks being sharded.

Comment by Scott Hernandez (Inactive) [ 27/Aug/12 ]

Have these servers always been running on 2.2.0RC0, or was this upgraded from an earlier version? If so, can you describe the systems life from its start to now?

Can you please provide a mongodump of the config db and attach it to this issue? If you have three then one from each would be good.

Can you run this on each shard please:

db.getMongo().getDBs()

Generated at Thu Feb 08 03:12:57 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.