[SERVER-16787] balancer not work Created: 09/Jan/15  Updated: 22/Jan/15  Resolved: 22/Jan/15

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 2.4.10, 2.4.11
Fix Version/s: None

Type: Question Priority: Critical - P2
Reporter: Jiangcheng Wu Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: HTML File mongoconfig-servers-collections-diff    
Participants:

 Description   

Hi,
Mongodb have 2 shards, which both are replica sets. After I switched replica sets to new nodes, as add new members and remove old members. Then I migrate all config servers to 3 new server with the same hostname of each. And balancer not work now.
Mongodb clusters = 18 x mongos + 2 x shards ( replica sets ) + 3 x configs

mongod/mongos/mongo config server version are 2.4.11

Start mongos:

mongos> sh.getBalancerState()
false
mongos> sh.startBalancer()
 
assert.soon failed: function (){
        var lock = db.getSisterDB( "config" ).locks.findOne({ _id : lockId })
 
        if( state == false ) return ! lock || lock.state == 0
        if( state == true ) return lock && lock.state == 2
        if( state == undefined ) return (beginTS == undefined && lock) ||
                                        (beginTS != undefined && ( !lock || lock.ts + "" != beginTS + "" ) )
        else return lock && lock.state == state
    }, msg:Waited too long for lock balancer to change to state undefined
Error: Printing Stack Trace
    at printStackTrace (src/mongo/shell/utils.js:37:15)
    at doassert (src/mongo/shell/assert.js:6:5)
    at Function.assert.soon (src/mongo/shell/assert.js:110:60)
    at Function.sh.waitForDLock (src/mongo/shell/utils_sh.js:156:12)
    at Function.sh.waitForBalancer (src/mongo/shell/utils_sh.js:250:12)
    at Function.sh.startBalancer (src/mongo/shell/utils_sh.js:131:8)
    at (shell):1:4
Sat Oct 11 00:30:46.841 assert.soon failed: function (){
        var lock = db.getSisterDB( "config" ).locks.findOne({ _id : lockId })
 
        if( state == false ) return ! lock || lock.state == 0
        if( state == true ) return lock && lock.state == 2
        if( state == undefined ) return (beginTS == undefined && lock) ||
                                        (beginTS != undefined && ( !lock || lock.ts + "" != beginTS + "" ) )
        else return lock && lock.state == state
    }, msg:Waited too long for lock balancer to change to state undefined at src/mongo/shell/assert.js:7
mongos> 
mongos> sh.getBalancerState()
true

but balancer not work:

mongos> sh._lastMigration()
{
	"_id" : "np1-2014-09-13T17:03:27-541478df0ef1637eaeaee578",
	"server" : "np1",
	"clientAddr" : "10.124.57.210:41684",
	"time" : ISODate("2014-09-13T17:03:27.697Z"),
	"what" : "moveChunk.from",
	"ns" : "wnkyjknp2ahihwx5p9xwcjdtosnuvo4n8z2n43tmbe4v875r.PlayerData",
	"details" : {
		"min" : {
			"deviceId" : "8859ABE4-5CAA-460F-BC8F-3EC00B4D3124"
		},
		"max" : {
			"deviceId" : "8895206B-F8BD-4EB2-AE84-2401B3BA0BC0"
		},
		"step1 of 6" : 0,
		"step2 of 6" : 256,
		"step3 of 6" : 5955,
		"step4 of 6" : 26640,
		"step5 of 6" : 1357,
		"step6 of 6" : 0
	}
}

balancer configs:

mongos> db.settings.find()
{ "_id" : "chunksize", "value" : 64 }
{ "_id" : "balancer", "activeWindow" : { "start" : "1:00", "stop" : "5:00" }, "stopped" : false }

And some other info

mongos> db.locks.find( { _id : "balancer" } ).pretty()
{
        "_id" : "balancer",
        "process" : "app1:27020:1418872912:1804289383",
        "state" : 0,
        "ts" : ObjectId("54aeefd05d996bd55e5ba6d4"),
        "when" : ISODate("2015-01-08T21:00:00.485Z"),
        "who" : "app1:27020:1418872912:1804289383:Balancer:846930886",
        "why" : "doing balance round"
}

and log of mongos@app1 at that time, and server local timezone is GMT+8

Fri Jan  9 01:02:29.998 [Balancer] distributed lock 'balancer/app1:27020:1418872912:1804289383' acquired, ts : 54aeb8255d996bd55e5ba2b7
Fri Jan  9 01:02:30.271 [Balancer] distributed lock 'balancer/app1:27020:1418872912:1804289383' unlocked. 
Fri Jan  9 01:02:48.465 [Balancer] warning: distributed lock 'balancer/app1:27020:1418872912:1804289383 did not propagate properly. :: caused by :: 8017 update not consistent  ns: config.locks query: { _id: "balancer", state: 0, ts: ObjectId('54aeb837a3c711e134a45a99') } update: { $set: { state: 1, who: "app1:27020:1418872912:1804289383:Balancer:846930886", process: "app1:27020:1418872912:1804289383", when: new Date(1420736568376), why: "doing balance round", ts: ObjectId('54aeb8385d996bd55e5ba2b8') } } gle1: { updatedExisting: true, n: 1, lastOp: Timestamp 1420736568000|6, connectionId: 2525, waited: 13, err: null, ok: 1.0 } gle2: { updatedExisting: false, n: 0, lastOp: Timestamp 1420736550000|3, connectionId: 6503, waited: 1, err: null, ok: 1.0 }
Fri Jan  9 01:02:48.469 [Balancer] lock update won, completing lock propagation for 'balancer/app1:27020:1418872912:1804289383'
Fri Jan  9 01:02:48.534 [Balancer] distributed lock 'balancer/app1:27020:1418872912:1804289383' acquired, ts : 54aeb8385d996bd55e5ba2b8
Fri Jan  9 01:02:48.808 [Balancer] distributed lock 'balancer/app1:27020:1418872912:1804289383' unlocked.

And I do dbHash on all mongoconfig servers

# server 1
configsvr> db.runCommand({dbHash:1})
{
        "numCollections" : 13,
        "host" : "mgcfg1:27019",
        "collections" : {
                "changelog" : "375ecc22738a993cd372979de468318c",
                "chunks" : "7c226ccb8703980fb2435af7752fab6b",
                "collections" : "e8b0c16d75a6eadec3298e452f5107a7",
                "databases" : "ef13a57ab08f66b4b5c424ce80382a79",
                "lockpings" : "e9a28df83cfefcf4a85b18b43427cd91",
                "locks" : "84c4b019ffbee93a834a510d61f993ff",
                "mongos" : "0b6d8e7277be46909e413c411b257741",
                "settings" : "d4d9768de4d9c3b5168bbab7681201c1",
                "shards" : "24c99c3de50fd10349ec93df4ed5e3fa",
                "tags" : "d41d8cd98f00b204e9800998ecf8427e",
                "version" : "f53c462a5f204958b665e49f27da5a14"
        },
        "md5" : "51bf3433e738fbb7a049e6b6337dd48e",
        "timeMillis" : 0,
        "fromCache" : [
                "config.changelog",
                "config.chunks",
                "config.collections",
                "config.databases",
                "config.locks",
                "config.settings",
                "config.shards",
                "config.tags",
                "config.version"
        ],
        "ok" : 1
}
 
# server2
configsvr> db.runCommand({dbHash:1})
{
        "numCollections" : 13,
        "host" : "mgcfg2:27019",
        "collections" : {
                "changelog" : "375ecc22738a993cd372979de468318c",
                "chunks" : "7c226ccb8703980fb2435af7752fab6b",
                "collections" : "e8b0c16d75a6eadec3298e452f5107a7",
                "databases" : "ef13a57ab08f66b4b5c424ce80382a79",
                "lockpings" : "8f0240a565553a35152e3ad5ce1bad36",
                "locks" : "84c4b019ffbee93a834a510d61f993ff",
                "mongos" : "e63914c568c08ea4dd1261e71cf1e125",
                "settings" : "d4d9768de4d9c3b5168bbab7681201c1",
                "shards" : "24c99c3de50fd10349ec93df4ed5e3fa",
                "tags" : "d41d8cd98f00b204e9800998ecf8427e",
                "version" : "f53c462a5f204958b665e49f27da5a14"
        },
        "md5" : "5201c1a258315e583dbcbfc04d99f5d2",
        "timeMillis" : 0,
        "fromCache" : [
                "config.changelog",
                "config.chunks",
                "config.collections",
                "config.databases",
                "config.locks",
                "config.settings",
                "config.shards",
                "config.tags",
                "config.version"
        ],
        "ok" : 1
}
 
# server 3
configsvr> db.runCommand({dbHash:1})
{
        "numCollections" : 13,
        "host" : "mgcfg3:27019",
        "collections" : {
                "changelog" : "375ecc22738a993cd372979de468318c",
                "chunks" : "7c226ccb8703980fb2435af7752fab6b",
                "collections" : "e8b0c16d75a6eadec3298e452f5107a7",
                "databases" : "ef13a57ab08f66b4b5c424ce80382a79",
                "lockpings" : "26e5c9c4bf4a3cb541e42fb10e6f7bec",
                "locks" : "84c4b019ffbee93a834a510d61f993ff",
                "mongos" : "f3410089e23f3c9dde90db27427f9260",
                "settings" : "d4d9768de4d9c3b5168bbab7681201c1",
                "shards" : "24c99c3de50fd10349ec93df4ed5e3fa",
                "tags" : "d41d8cd98f00b204e9800998ecf8427e",
                "version" : "f53c462a5f204958b665e49f27da5a14"
        },
        "md5" : "e6f192b490340f5562efa423049607f8",
        "timeMillis" : 0,
        "fromCache" : [
                "config.changelog",
                "config.chunks",
                "config.collections",
                "config.databases",
                "config.locks",
                "config.settings",
                "config.shards",
                "config.tags",
                "config.version"
        ],
        "ok" : 1
}

I have attached collections mongos and lockpings as attachments if that would helps.

I don't know where goes wrong and how to fix it, any reply will be appreciated.



 Comments   
Comment by Ramon Fernandez Marina [ 22/Jan/15 ]

Since the issue has disappeared I'm now resolving this ticket. If this happens again feel free to re-open it.

Regards,
Ramón.

Comment by Jiangcheng Wu [ 16/Jan/15 ]

after add a new shard, balancer suddenly return to work now.

mongos> sh._lastMigration()
{
        "_id" : "chat7-2015-01-15T20:23:33-54b821c54d95a35bf844c478",
        "server" : "chat7",
        "clientAddr" : "10.10.16.120:59610",
        "time" : ISODate("2015-01-15T20:23:33.510Z"),
        "what" : "moveChunk.from",
        "ns" : "wnkyjknp2ahihwx5p9xwcjdtosnuvo4n8z2n43tmbe4v875r.PlayerData",
        "details" : {
                "min" : {
                        "deviceId" : "89C10E71-AE3B-46D6-B3C8-8BE78412744C"
                },
                "max" : {
                        "deviceId" : "89c4a50c893db689d5e8146ec80977c1"
                },
                "step1 of 6" : 0,
                "step2 of 6" : 413,
                "step3 of 6" : 121,
                "step4 of 6" : 17419,
                "step5 of 6" : 572,
                "step6 of 6" : 0
        }
}

Comment by Jiangcheng Wu [ 13/Jan/15 ]

looks like balancer get locks all rights

mongos> db.locks.find( { _id : "balancer" } ).pretty()
{
        "_id" : "balancer",
        "process" : "mgsh31:27020:1414921091:1804289383",
        "state" : 2,
        "ts" : ObjectId("54b557ef87131372d22bfc8d"),
        "when" : ISODate("2015-01-13T17:37:51.645Z"),
        "who" : "mgsh31:27020:1414921091:1804289383:Balancer:846930886",
        "why" : "doing balance round"
}
mongos> sh.isBalancerRunning()
true
mongos> db.locks.find( { _id : "balancer" } ).pretty()
{
        "_id" : "balancer",
        "process" : "mgsh31:27020:1414921091:1804289383",
        "state" : 1,
        "ts" : ObjectId("54b5580987131372d22bfc8f"),
        "when" : ISODate("2015-01-13T17:38:17.255Z"),
        "who" : "mgsh31:27020:1414921091:1804289383:Balancer:846930886",
        "why" : "doing balance round"
}

but config.changelog show no new moveChunk command executed log

mongos> db.changelog.find({ what : /^moveChunk/ }).sort({ time : -1 }).limit( 10 )
{ "_id" : "np1-2014-09-13T17:03:27-541478df0ef1637eaeaee578", "server" : "np1", "clientAddr" : "10.124.57.210:41684", "time" : ISODate("2014-09-13T17:03:27.697Z"), "what" : "moveChunk.from", "ns" : "wnkyjknp2ahihwx5p9xwcjdtosnuvo4n8z2n43tmbe4v875r.PlayerData", "details" : { "min" : { "deviceId" : "8859ABE4-5CAA-460F-BC8F-3EC00B4D3124" }, "max" : { "deviceId" : "8895206B-F8BD-4EB2-AE84-2401B3BA0BC0" }, "step1 of 6" : 0, "step2 of 6" : 256, "step3 of 6" : 5955, "step4 of 6" : 26640, "step5 of 6" : 1357, "step6 of 6" : 0 } }
{ "_id" : "np1-2014-09-13T17:03:27-541478df0ef1637eaeaee576", "server" : "np1", "clientAddr" : "10.124.57.210:41684", "time" : ISODate("2014-09-13T17:03:27.286Z"), "what" : "moveChunk.commit", "ns" : "wnkyjknp2ahihwx5p9xwcjdtosnuvo4n8z2n43tmbe4v875r.PlayerData", "details" : { "min" : { "deviceId" : "8859ABE4-5CAA-460F-BC8F-3EC00B4D3124" }, "max" : { "deviceId" : "8895206B-F8BD-4EB2-AE84-2401B3BA0BC0" }, "from" : "SHARD02", "to" : "SHARD03" } }
{ "_id" : "chat1-2014-09-13T17:03:27-541478df18f5452362059cf1", "server" : "chat1", "clientAddr" : ":27017", "time" : ISODate("2014-09-13T17:03:27.142Z"), "what" : "moveChunk.to", "ns" : "wnkyjknp2ahihwx5p9xwcjdtosnuvo4n8z2n43tmbe4v875r.PlayerData", "details" : { "min" : { "deviceId" : "8859ABE4-5CAA-460F-BC8F-3EC00B4D3124" }, "max" : { "deviceId" : "8895206B-F8BD-4EB2-AE84-2401B3BA0BC0" }, "step1 of 5" : 0, "step2 of 5" : 0, "step3 of 5" : 26562, "step4 of 5" : 0, "step5 of 5" : 999 } }
{ "_id" : "np1-2014-09-13T17:02:53-541478bd0ef1637eaeaee575", "server" : "np1", "clientAddr" : "10.124.57.210:41684", "time" : ISODate("2014-09-13T17:02:53.528Z"), "what" : "moveChunk.start", "ns" : "wnkyjknp2ahihwx5p9xwcjdtosnuvo4n8z2n43tmbe4v875r.PlayerData", "details" : { "min" : { "deviceId" : "8859ABE4-5CAA-460F-BC8F-3EC00B4D3124" }, "max" : { "deviceId" : "8895206B-F8BD-4EB2-AE84-2401B3BA0BC0" }, "from" : "SHARD02", "to" : "SHARD03" } }
{ "_id" : "np1-2014-09-13T17:02:48-541478b80ef1637eaeaee573", "server" : "np1", "clientAddr" : "10.124.57.210:41684", "time" : ISODate("2014-09-13T17:02:48.326Z"), "what" : "moveChunk.from", "ns" : "wnkyjknp2ahihwx5p9xwcjdtosnuvo4n8z2n43tmbe4v875r.PlayerData", "details" : { "min" : { "deviceId" : "88205348-749B-4D29-B458-8D382088DAE8" }, "max" : { "deviceId" : "8859ABE4-5CAA-460F-BC8F-3EC00B4D3124" }, "step1 of 6" : 0, "step2 of 6" : 968, "step3 of 6" : 9484, "step4 of 6" : 24590, "step5 of 6" : 833, "step6 of 6" : 0 } }
{ "_id" : "np1-2014-09-13T17:02:47-541478b70ef1637eaeaee571", "server" : "np1", "clientAddr" : "10.124.57.210:41684", "time" : ISODate("2014-09-13T17:02:47.570Z"), "what" : "moveChunk.commit", "ns" : "wnkyjknp2ahihwx5p9xwcjdtosnuvo4n8z2n43tmbe4v875r.PlayerData", "details" : { "min" : { "deviceId" : "88205348-749B-4D29-B458-8D382088DAE8" }, "max" : { "deviceId" : "8859ABE4-5CAA-460F-BC8F-3EC00B4D3124" }, "from" : "SHARD02", "to" : "SHARD03" } }
{ "_id" : "chat1-2014-09-13T17:02:47-541478b718f5452362059cf0", "server" : "chat1", "clientAddr" : ":27017", "time" : ISODate("2014-09-13T17:02:47.395Z"), "what" : "moveChunk.to", "ns" : "wnkyjknp2ahihwx5p9xwcjdtosnuvo4n8z2n43tmbe4v875r.PlayerData", "details" : { "min" : { "deviceId" : "88205348-749B-4D29-B458-8D382088DAE8" }, "max" : { "deviceId" : "8859ABE4-5CAA-460F-BC8F-3EC00B4D3124" }, "step1 of 5" : 0, "step2 of 5" : 0, "step3 of 5" : 23414, "step4 of 5" : 0, "step5 of 5" : 1175 } }
{ "_id" : "np1-2014-09-13T17:02:12-541478940ef1637eaeaee570", "server" : "np1", "clientAddr" : "10.124.57.210:41684", "time" : ISODate("2014-09-13T17:02:12.614Z"), "what" : "moveChunk.start", "ns" : "wnkyjknp2ahihwx5p9xwcjdtosnuvo4n8z2n43tmbe4v875r.PlayerData", "details" : { "min" : { "deviceId" : "88205348-749B-4D29-B458-8D382088DAE8" }, "max" : { "deviceId" : "8859ABE4-5CAA-460F-BC8F-3EC00B4D3124" }, "from" : "SHARD02", "to" : "SHARD03" } }
{ "_id" : "np1-2014-09-13T17:02:06-5414788e0ef1637eaeaee56e", "server" : "np1", "clientAddr" : "10.124.57.211:39062", "time" : ISODate("2014-09-13T17:02:06.608Z"), "what" : "moveChunk.from", "ns" : "wnkyjknp2ahihwx5p9xwcjdtosnuvo4n8z2n43tmbe4v875r.PlayerData", "details" : { "min" : { "deviceId" : "87F15A64-E4C8-47FE-895E-4393B2C2D6B3" }, "max" : { "deviceId" : "88205348-749B-4D29-B458-8D382088DAE8" }, "step1 of 6" : 0, "step2 of 6" : 218, "step3 of 6" : 168, "step4 of 6" : 36888, "step5 of 6" : 444, "step6 of 6" : 0 } }
{ "_id" : "np1-2014-09-13T17:02:06-5414788e0ef1637eaeaee56c", "server" : "np1", "clientAddr" : "10.124.57.211:39062", "time" : ISODate("2014-09-13T17:02:06.265Z"), "what" : "moveChunk.commit", "ns" : "wnkyjknp2ahihwx5p9xwcjdtosnuvo4n8z2n43tmbe4v875r.PlayerData", "details" : { "min" : { "deviceId" : "87F15A64-E4C8-47FE-895E-4393B2C2D6B3" }, "max" : { "deviceId" : "88205348-749B-4D29-B458-8D382088DAE8" }, "from" : "SHARD02", "to" : "SHARD03" } }

may be something wrong happened before balancer decide to exec moveChunk command.

Comment by Jiangcheng Wu [ 09/Jan/15 ]

ps.
We use ntpdate to keep time right.

# m h  dom mon dow   command
5 */2 * * * /usr/sbin/ntpdate 0.asia.pool.ntp.org  && /sbin/hwclock -w

Generated at Thu Feb 08 03:42:15 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.