|
Hello,
I think our first priority here should be to get your cluster upgraded to 2.4 and eventually 2.6, and I think its not unreasonable to assume that a case SERVER-14996 is at play. After the upgrade, if we still see the issue, we can re-investigate.
- Can you verify that all three config servers are running and active? Without all three, the config data becomes read only, and it's impossible to create new locks, or remove old locks. In this situation, it's also impossible for chunk migrations to succeed. If all three are active and you cannot successfully stop the balancer, then you can shut down the last config server to effectively stop migrations, which are the important part of the upgrade process.
- Do you see evidence that the balancer is actually migrating chunks? If you see active chunk migrations in the logs, you can wait for them to finish, and fail because of the limited availability of the config server. You could also step down the primary of one of the shards to terminate the migration. When there are no active migrations and the config server is read-only, then no new migrations can start.
- You should also check that your config servers are not out of sync.
Once you've completed these steps, you can follow the upgrade process, first to 2.4 and then to 2.6 and possibly to 3.0.
I hope this helps. If you have problems with the upgrade process, you can contact support. If you still see this error after you're done upgrading, I think it makes sense to open a new ticket. I will close this ticket for now.
Regards,
sam
|
|
Here are the logs of 08 dec 2014 when the issue started: 666.txt
We have Splunk and the logs were exported from there.
Splunk Request (we filtered some output):
lock host="*.live" AND NOT successfully sourcetype="mongodbrouterbidb" AND NOT "metadata lock is taken"
|
|
|
P.S.: Unfortunately, I cannot provide logs for you because they were rotated.
Can I find something in MMS?
|
|
Ramon Fernandez, as Alex said - yes, we replaced the config servers as it is described into the manual of MongoDB. During the operation only one server was down.
As for upgrade - we are going to upgrade to 2.4.12 for now. Maybe to 2.6.* little later.
I'm afraid that the upgrade from 2.2 to 2.4 won't be successful, because of this issue with the balancer and the lock we have.
|
|
We encountered this issue once we replaced config servers due to a hardware upgrade imposed by AWS. We are are interested in in an upgrade, but at this point we are just interested in an understanding of what potential work around may be. There is no need for us to run the balancer but we can't seem to set it off as mentioned above.
|
|
andrey@idle-games.com, this could be the same situation reported in SERVER-14996, where if the first config server goes down while the balancer is running, it can result in stale locks which don't get released even after the config server is brought back up. Have you had any issues with your first config server? Can you post the logs for that server?
In SERVER-14996 there's a script to trigger the issue, but I haven't tested it in 2.2.3. Since the issue does appear on 2.4 but not on 2.6 it's not clear at this point if there will be any further work on 2.4, but it's unlikely any tentative fixes for 2.4 would be backported to 2.2. Would you be interested in considering an upgrade to a later version of MongoDB if this is an issue for you?
|
|
Dan Pasette - seems like you marked the issue as "debugging with submitter" but there is no request from anyone.
How can we move forward with this?
|
|
Tried this:
mongos> sh.setBalancerState(true)
|
mongos> db.getSisterDB("config").settings.find({_id:"balancer"} )
|
{ "_id" : "balancer", "stopped" : false }
|
mongos> db.getSisterDB("config").settings.find({_id:"balancer"} )
|
{ "_id" : "balancer", "stopped" : false }
|
mongos> sh.getBalancerState()
|
true
|
mongos> sh.isBalancerRunning()
|
true
|
mongos> sh.stopBalancer()
|
Waiting for active hosts...
|
Waiting for active host celery009:37017 to recognize new settings... (ping : Thu Dec 25 2014 08:57:21 GMT-0800 (PST))
|
Waited for active ping to change for host celery009:37017, a migration may be in progress or the host may be down.
|
Waiting for the balancer lock...
|
in the logs on celery009:
Thu Dec 25 08:57:21 [Balancer] forcing lock 'balancer/atc005:37017:1415292838:1804289383' because elapsed time 1477385281 > takeover time 900000
|
Thu Dec 25 08:57:22 [Balancer] lock 'balancer/atc005:37017:1415292838:1804289383' successfully forced
|
Thu Dec 25 08:57:22 [Balancer] distributed lock 'balancer/celery009:37017:1411743445:1804289383' acquired, ts : 549c41f2c16e738217f43164
|
Thu Dec 25 08:57:22 [Balancer] ns: live.sng_log going to move { _id: "live.sng_log-shard_key_ObjectId('2335745e4962db9522f7d164')", lastmod: Timestamp 5000|382, lastmodEpoch: ObjectId('000000000000000000000000'), ns: "live.sng_log", min: { shard_key: ObjectId('2335745e4962db9522f7d164') }, max: { shard_key: ObjectId('23ee6fd74e961fbe1f8c1fe3') }, shard: "mongo-livelog-c" } from: mongo-livelog-c to: mongo-livelog-a tag []
|
Thu Dec 25 08:57:22 [Balancer] ns: live.player_segments going to move { _id: "live.player_segments-shard_hash_ObjectId('555555555555555555555555')", lastmod: Timestamp 4000|3472, lastmodEpoch: ObjectId('51faa782e67a4928ba6ae95b'), ns: "live.player_segments", min: { shard_hash: ObjectId('555555555555555555555555') }, max: { shard_hash: ObjectId('5572465d72b783872a330dda') }, shard: "mongo-livelog-b" } from: mongo-livelog-b to: mongo-livelog-c tag []
|
Thu Dec 25 08:57:22 [Balancer] ns: live.player_chat_2 going to move { _id: "live.player_chat_2-shard_hash_MinKey", lastmod: Timestamp 3000|2, lastmodEpoch: ObjectId('51b787cb3bea320fe276bcc7'), ns: "live.player_chat_2", min: { shard_hash: MinKey }, max: { shard_hash: ObjectId('000002bd63830cf94ca58b68') }, shard: "mongo-livelog-a" } from: mongo-livelog-a to: mongo-livelog-b tag []
|
Thu Dec 25 08:57:22 [Balancer] moving chunk ns: live.sng_log moving ( ns:live.sng_log at: mongo-livelog-c:mongo-livelog-c/172.30.64.132:27018,172.30.71.231:27018 lastmod: 5|382||000000000000000000000000 min: { shard_key: ObjectId('2335745e4962db9522f7d164') } max: { shard_key: ObjectId('23ee6fd74e961fbe1f8c1fe3') }) mongo-livelog-c:mongo-livelog-c/172.30.64.132:27018,172.30.71.231:27018 -> mongo-livelog-a:mongo-livelog-a/mongo-livelog-a-1:27018,mongo-livelog-a-2:27018
|
|
so atc005 diappeared, but now celery009 is there and cannot be removed.
|
mongos> db.getSisterDB("config").settings.find({_id:"balancer"})
|
{ "_id" : "balancer", "stopped" : true }
|
|
Generated at Thu Feb 08 03:41:52 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.