[SERVER-34498] mongo balancer often won't start back up Created: 16/Apr/18 Updated: 27/Oct/23 Resolved: 16/Apr/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Matthew Kruse | Assignee: | Kaloian Manassiev |
| Resolution: | Works as Designed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Participants: | |||||||||
| Description |
|
sh.status() output
No locks
This collection is sharded so it ought to be doing something. It's done nothing in over 4 days and I have multiple collections out of balance that it should be working on. mongos> db.chunks.aggregate([ { $match: {ns: "some.collection"}}, { $group:{ _id: "$shard", cnt: { $sum: 1}} }, {$sort: {"_id": 1}} ]); { "_id" : "rs0", "cnt" : 647 } { "_id" : "rs1", "cnt" : 648 } { "_id" : "rs2", "cnt" : 648 } { "_id" : "rs3", "cnt" : 648 } { "_id" : "rs4", "cnt" : 648 } { "_id" : "rs5", "cnt" : 648 } { "_id" : "rs6", "cnt" : 649 } { "_id" : "rs7", "cnt" : 648 } { "_id" : "rs8", "cnt" : 642 } { "_id" : "rs9", "cnt" : 591 }
mongos> printjson(sh.isBalancerRunning()); printjson(sh.isBalancerRunning()); new ISODate(); printjson(sh.startBalancer()); new ISODate(); db.version() 2018-04-16T10:19:51.289-0700 E QUERY [thread1] Error: assert.soon failed, msg:Waited too long for lock balancer to change to state undefined : mongos> db.version()
|
| Comments |
| Comment by Matthew Kruse [ 16/Apr/18 ] | ||||||||||||||||||||||
|
Seeing chunks move now as well. | ||||||||||||||||||||||
| Comment by Matthew Kruse [ 16/Apr/18 ] | ||||||||||||||||||||||
|
Ah that worked with a 3.6 shell, thanks. | ||||||||||||||||||||||
| Comment by Kaloian Manassiev [ 16/Apr/18 ] | ||||||||||||||||||||||
|
From the call stack it looks like you are using an older shell. The 3.6 shell doesn't have any calls to waitForDLock and directly invokes the command: https://github.com/mongodb/mongo/blob/r3.6.2/src/mongo/shell/utils_sh.js#L169 Can you please try with the 3.6 shell? | ||||||||||||||||||||||
| Comment by Matthew Kruse [ 16/Apr/18 ] | ||||||||||||||||||||||
|
I think its more complicated then that, I've been running
It always returns this error
| ||||||||||||||||||||||
| Comment by Kaloian Manassiev [ 16/Apr/18 ] | ||||||||||||||||||||||
|
The mode:off setting means that the balancer is disabled and this value is written when sh.stopBalancer() is run. This is a new setting we introduced in 3.4, but we masked it behind the balancerStart/balancerStop/balancerStatus commands so it doesn't need to be inspected directly. It looks like a bug that sh.getBalancerState() doesn't consult the balancerStatus command and looks directly at the settings. You can re-enable the balancer by running sh.startBalancer(). In the mean time I will file a separate ticket to fix getBalancerState. | ||||||||||||||||||||||
| Comment by Matthew Kruse [ 16/Apr/18 ] | ||||||||||||||||||||||
| ||||||||||||||||||||||
| Comment by Kaloian Manassiev [ 16/Apr/18 ] | ||||||||||||||||||||||
|
Can you please let me know what are the contents of the config.settings collection?
If the balancer is not explicitly stopped, then it is possible that there is a balancer window configured. | ||||||||||||||||||||||
| Comment by Matthew Kruse [ 16/Apr/18 ] | ||||||||||||||||||||||
|
The mongoc primary is just repeating those above messages over and over now. | ||||||||||||||||||||||
| Comment by Matthew Kruse [ 16/Apr/18 ] | ||||||||||||||||||||||
|
Did that
The primary mongo c thinks balacning is disabled. Right after I pulled the config entries above, I ran this on mongos
| ||||||||||||||||||||||
| Comment by Matthew Kruse [ 16/Apr/18 ] | ||||||||||||||||||||||
|
The only thing abnormal I've come across is that the logging partition on the server one of the mongo c processes was running on filled up. Cloud manager health checks didn't trip and I could still log into the server and run queries. I restarted the mongo c process and cleared up the disk space and I still see this on a mongos:
| ||||||||||||||||||||||
| Comment by Kaloian Manassiev [ 16/Apr/18 ] | ||||||||||||||||||||||
|
Hi mkruse@adobe.com, Would it be possible to increase the sharding component log level on the primary node of the config server, let it run for a few minutes and then attach the logs? This is the command:
Given the distribution above, you are right that the balancer should be moving chunks to shard rs9. -Kal. | ||||||||||||||||||||||
| Comment by Matthew Kruse [ 16/Apr/18 ] | ||||||||||||||||||||||
|
I've seen the balancer get hung exactly like this on 3.6, 3.4, 3.2 and 3.0. I'd expect the balancer process to recognize a hung state and resolve itself quickly. It doesn't seem capable of doing that in all balancer failure cases. Prior to the balancer getting hung in this state, I disabled it, ran a bunch of moveChunk commands and attempted to turn it back on. | ||||||||||||||||||||||
| Comment by Matthew Kruse [ 16/Apr/18 ] | ||||||||||||||||||||||
|
I've also manually deleted the balancer lock in the config.locks collection. That didn't help either. | ||||||||||||||||||||||
| Comment by Matthew Kruse [ 16/Apr/18 ] | ||||||||||||||||||||||
|
I've bounced all mongo s and mongo c in the cluster, no luck getting whatever is broke to resolve itself. |