[SERVER-7821] MongoS blocks all requests to sharded collection Created: 03/Dec/12  Updated: 08/Mar/13  Resolved: 24/Feb/13

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 2.2.1
Fix Version/s: None

Type: Bug Priority: Blocker - P1
Reporter: Grégoire Seux Assignee: David Hows
Resolution: Incomplete Votes: 0
Labels: mongos
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Linux Centos5/6


Attachments: File changelog_us     PNG File deadlock.png     Text File lock.log     File mongoS.gz     Zip Archive mongod-shard4-shard6-logs.zip     Text File moveChunk_abort_ version_issue_20121211.txt     File sh_status.out     Zip Archive shard4_log_extract.zip    
Issue Links:
Related
is related to SERVER-7922 All operations blocked on one sharded... Closed
Operating System: Linux
Participants:

 Description   

Three times on the last 48 hours, all our mongoS were deadlocked on serving requests to sharded collection.
Using out-of-prod mongoS to a sharded collection was blocked to.

During the lock, it was possible to : use db.currentOp() and show collection on the sharded database.

Restarting all mongoS was the only way to get out of this.

In attachment, a cleaned log of one mongoS during the last failure.



 Comments   
Comment by David Hows [ 14/Dec/12 ]

Hi Kelbert,

Thanks for all the details.

I've been looking into your MMS instance and have found a few things that need your attention.

I cannot find the 2 SV nodes for set6 in MMS, one of the two SV nodes appears to be the primary and there is currently no replication data for the two NY nodes.

From what i've found in your logs it appears that the migrations from Shard4 to Shard6 are failing when attempting to confirm that the migration was successfully written to a majority of its secondaries, in this case this means having written it to both the NY and SV sites and that migration is failing on this replication timeout.

My suggestion for now would be to look at enabling secondaryThrottle on your balancer by issuing the following commands:

use config;
db.settings.update ({ "_id" : "balancer"}, {$set:{_secondaryThrottle" : true }, true})

secondaryThrottle will change the migration settings such that when migrating a chunk your system should only attempt to replicate to one node rather than a majority.

Cheers,

David

Comment by Klébert Hodin [ 13/Dec/12 ]

David,

My answers :

Do you have MMS installed?
Yes. https://mms.10gen.com/host/list/4f8d732587d1d86fa8b99c12

Is there any replication lag between your nodes in Shard6?
No lag in this shard.

Can you give me a little background on your clusters setup and layout?

  • 8 shards
  • each shard is a 4 servers replicaset (2 DC, 2 host per DC)
  • all in mongodb 2.2.2
  • collection that fails contains 11 469 docs only.
  • collection is blocked (no find nor update) every morning until we restart shard4 mongod

Can you provide the output of sh.status()?
File attached.
Problem occured before adding 8th shard.

Comment by David Hows [ 13/Dec/12 ]

Hi Kelbert,

From 06:46:37 till 06:47:08 your shards were in the critical section of migration, which subsequently failed by timing out after 30 seconds. This is likely the cause of the unresponsive behaviour at that time.

The migration appears to have failed as it could not adequately replicate the full migration to the three secondaries it attempted to ensure replication on.

Do you have MMS installed? Is there any replication lag between your nodes in Shard6? Can you give me a little background on your clusters setup and layout? Can you provide the output of sh.status()?

Cheers,

David

Comment by Klébert Hodin [ 12/Dec/12 ]

Hi David,

This issue happened again this morning, no update nor find could be done on counters.statistics collection.
You can find attached logs of this morning of these 2 shards, issue starts at Wed Dec 12 06:47:09.

Klébert

Comment by David Hows [ 12/Dec/12 ]

Hi Kelbert,

It looks like that freeze is due to to the system being in the "critical section" from the small log snippet you have sent I can see your system exiting the critical section at 05:50:48 but cannot see when we entered, could you add more log details?

From the logs i can also see that the migration failed as it was not accepted by the "to" shard.

Tue Dec 11 05:50:48 [conn370166] moveChunk migrate commit not accepted by TO-shard: { active: true, ns: "counters.statistics", from: "shard4/mdbcis4-01-ny.criteo.prod:27021,mdbcis4-01-sv.criteo.prod:27021,mdbcis4-02-ny.criteo.prod:27021,mdbcis4-02-sv.criteo.prod:27021", min: { _id: "7f8f7d68-5ad5-49e6-92a6-aff7181d6c7e" }, max: { _id: "bfbef656-d8c3-4742-89fc-716a4b3d4c54" }, shardKeyPattern: { _id: 1 }, state: "fail", errmsg: "", counts: { cloned: 19935, clonedBytes: 8525712, catchup: 0, steady: 0 }, errmsg: "", ok: 0.0 } resetting shard version to: 0|0||000000000000000000000000

Can you attach logs from Shard4 and Shard6 around this time? As these appear to be the "to" and "from" shards.

Cheers,

David

Comment by Klébert Hodin [ 11/Dec/12 ]

Hi David,

It happened again this morning at 5:50 am (more info in log file attached).

Comment by Grégoire Seux [ 11/Dec/12 ]

Hello David,

we don't have freeze anymore. Since the upgrade (2.2.1 => 2.2.2) the issue has shifted to being unable to update some collections. Changelog attached

Comment by David Hows [ 11/Dec/12 ]

Hi Kelbert, Gregorie,

I can see that there is a large migration which subsequently fails and this looks to have changed your chunk version.

Restarting the replica set would have fixed the issue as this would have cleared the mongod's internal cache.

Do you stil get these freezes subsequent to the restart?

Would you be able to attach the contents of the changelog collection in the config database on your mongos?

Cheers,

David

Comment by Klébert Hodin [ 10/Dec/12 ]

This issue seems to be linked to one mongod log message : " warning: aborted moveChunk because official version less than mine? "

It first appears at line 14 862.

Comment by Grégoire Seux [ 10/Dec/12 ]

upgrade moved the issue. Now some collections have trouble updating documents. Restart of the replicaset fix the issue

Comment by Grégoire Seux [ 04/Dec/12 ]

Our dba observed that it is probably linked to another issue on mongoD : https://jira.mongodb.org/browse/SERVER-7034 because it happens on one mongoD roughly at the same time than this issue.
Unlike Spencer said, this happens very frequently and cannot be considered as acceptable for our production environment.

We will upgrade to 2.2.2 today and see what the result is.

Comment by Grégoire Seux [ 04/Dec/12 ]

Here is the log from this night. The same behavior occured aroudn 2:35am

Comment by Grégoire Seux [ 04/Dec/12 ]

Hello David,

we use mongo 2.2.1.

findOne on sharded collection hangs for more than 5 minutes (and judging by mms, indefinitely).
findOne on unsharded database works as usual

I've tried db.currentOp on the last time it occurs (without saving output) but I'll try to get next time.

Comment by David Hows [ 04/Dec/12 ]

Hi Grégoire,

Can you confirm which version of mongo you are using?

Would you be able to post a little more context around the log you have provided? The lines in the currently log dont show much.

Could you attach the output of db.currentOp() from the time when the system is unresponsive?

What happens if you try to do a simple findOne() on your system when this behaviour occurs? Does it just hang? Does the client get an error?

Thanks,

David

Comment by Grégoire Seux [ 03/Dec/12 ]

sed s/[dead]lock/unresponsive bahvior/
because I don't know if it is a dead lock or something else.

Generated at Thu Feb 08 03:15:41 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.