[SERVER-7821] MongoS blocks all requests to sharded collection Created: 03/Dec/12 Updated: 08/Mar/13 Resolved: 24/Feb/13 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 2.2.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker - P1 |
| Reporter: | Grégoire Seux | Assignee: | David Hows |
| Resolution: | Incomplete | Votes: | 0 |
| Labels: | mongos | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Linux Centos5/6 |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Operating System: | Linux | ||||||||
| Participants: | |||||||||
| Description |
|
Three times on the last 48 hours, all our mongoS were deadlocked on serving requests to sharded collection. During the lock, it was possible to : use db.currentOp() and show collection on the sharded database. Restarting all mongoS was the only way to get out of this. In attachment, a cleaned log of one mongoS during the last failure. |
| Comments |
| Comment by David Hows [ 14/Dec/12 ] | ||
|
Hi Kelbert, Thanks for all the details. I've been looking into your MMS instance and have found a few things that need your attention. I cannot find the 2 SV nodes for set6 in MMS, one of the two SV nodes appears to be the primary and there is currently no replication data for the two NY nodes. From what i've found in your logs it appears that the migrations from Shard4 to Shard6 are failing when attempting to confirm that the migration was successfully written to a majority of its secondaries, in this case this means having written it to both the NY and SV sites and that migration is failing on this replication timeout. My suggestion for now would be to look at enabling secondaryThrottle on your balancer by issuing the following commands:
secondaryThrottle will change the migration settings such that when migrating a chunk your system should only attempt to replicate to one node rather than a majority. Cheers, David | ||
| Comment by Klébert Hodin [ 13/Dec/12 ] | ||
|
David, My answers : Do you have MMS installed? Is there any replication lag between your nodes in Shard6? Can you give me a little background on your clusters setup and layout?
Can you provide the output of sh.status()? | ||
| Comment by David Hows [ 13/Dec/12 ] | ||
|
Hi Kelbert, From 06:46:37 till 06:47:08 your shards were in the critical section of migration, which subsequently failed by timing out after 30 seconds. This is likely the cause of the unresponsive behaviour at that time. The migration appears to have failed as it could not adequately replicate the full migration to the three secondaries it attempted to ensure replication on. Do you have MMS installed? Is there any replication lag between your nodes in Shard6? Can you give me a little background on your clusters setup and layout? Can you provide the output of sh.status()? Cheers, David | ||
| Comment by Klébert Hodin [ 12/Dec/12 ] | ||
|
Hi David, This issue happened again this morning, no update nor find could be done on counters.statistics collection. Klébert | ||
| Comment by David Hows [ 12/Dec/12 ] | ||
|
Hi Kelbert, It looks like that freeze is due to to the system being in the "critical section" from the small log snippet you have sent I can see your system exiting the critical section at 05:50:48 but cannot see when we entered, could you add more log details? From the logs i can also see that the migration failed as it was not accepted by the "to" shard.
Can you attach logs from Shard4 and Shard6 around this time? As these appear to be the "to" and "from" shards. Cheers, David | ||
| Comment by Klébert Hodin [ 11/Dec/12 ] | ||
|
Hi David, It happened again this morning at 5:50 am (more info in log file attached). | ||
| Comment by Grégoire Seux [ 11/Dec/12 ] | ||
|
Hello David, we don't have freeze anymore. Since the upgrade (2.2.1 => 2.2.2) the issue has shifted to being unable to update some collections. Changelog attached | ||
| Comment by David Hows [ 11/Dec/12 ] | ||
|
Hi Kelbert, Gregorie, I can see that there is a large migration which subsequently fails and this looks to have changed your chunk version. Restarting the replica set would have fixed the issue as this would have cleared the mongod's internal cache. Do you stil get these freezes subsequent to the restart? Would you be able to attach the contents of the changelog collection in the config database on your mongos? Cheers, David | ||
| Comment by Klébert Hodin [ 10/Dec/12 ] | ||
|
This issue seems to be linked to one mongod log message : " warning: aborted moveChunk because official version less than mine? " It first appears at line 14 862. | ||
| Comment by Grégoire Seux [ 10/Dec/12 ] | ||
|
upgrade moved the issue. Now some collections have trouble updating documents. Restart of the replicaset fix the issue | ||
| Comment by Grégoire Seux [ 04/Dec/12 ] | ||
|
Our dba observed that it is probably linked to another issue on mongoD : https://jira.mongodb.org/browse/SERVER-7034 because it happens on one mongoD roughly at the same time than this issue. We will upgrade to 2.2.2 today and see what the result is. | ||
| Comment by Grégoire Seux [ 04/Dec/12 ] | ||
|
Here is the log from this night. The same behavior occured aroudn 2:35am | ||
| Comment by Grégoire Seux [ 04/Dec/12 ] | ||
|
Hello David, we use mongo 2.2.1. findOne on sharded collection hangs for more than 5 minutes (and judging by mms, indefinitely). I've tried db.currentOp on the last time it occurs (without saving output) but I'll try to get next time. | ||
| Comment by David Hows [ 04/Dec/12 ] | ||
|
Hi Grégoire, Can you confirm which version of mongo you are using? Would you be able to post a little more context around the log you have provided? The lines in the currently log dont show much. Could you attach the output of db.currentOp() from the time when the system is unresponsive? What happens if you try to do a simple findOne() on your system when this behaviour occurs? Does it just hang? Does the client get an error? Thanks, David | ||
| Comment by Grégoire Seux [ 03/Dec/12 ] | ||
|
sed s/[dead]lock/unresponsive bahvior/ |