[SERVER-7922] All operations blocked on one sharded collection Created: 13/Dec/12 Updated: 11/Jul/16 Resolved: 23/Jan/13 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Performance, Sharding, Stability |
| Affects Version/s: | 2.2.2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Klébert Hodin | Assignee: | David Hows |
| Resolution: | Done | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Linux Centos5/6 |
||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Operating System: | Linux | ||||||||||||
| Participants: | |||||||||||||
| Description |
|
Every morning since last week, all operations to a sharded collection are failing. Here are application side errors : '. (Response was { "err" : "setShardVersion failed host: mdbcis4-01-sv.criteo.prod:27021 { oldVersion: Timestamp 0|0, oldVersionEpoch: ObjectId('000000000000000000000000'), ns: \"counters.statistics\", version: Timestamp 4000|3, versionEpoch: ObjectId('000000000000000000000000'), globalVersion: Timestamp 6000|0, globalVersionEpoch: ObjectId('000000000000000000000000'), reloadConfig: true, errmsg: \"shard global version for collection is higher than trying to set to 'counters.statistics'\", ok: 0.0 }", "code" : 10429, "n" : 0, "ok" : 1.0 }). : MongoDB.Driver.SafeModeResult SendMessage(MongoDB.Driver.Internal.MongoRequestMessage, MongoDB.Driver.SafeMode) Server-side ones : Restarting mongod unlocks operations until next morning. In attachment, logs of servers involved in the moveChunk process (shard4 to shard6), sh_status output, changelog collection output Link to our MMS dashboard : https://mms.10gen.com/host/list/4f8d732587d1d86fa8b99c12 |
| Comments |
| Comment by Ian Whalen (Inactive) [ 29/Apr/13 ] |
|
Hi Javi, please open a new ticket in the SERVER project with any further details you can include. |
| Comment by Javi Martin [ 29/Apr/13 ] |
|
Hi! during the last 2 weeks I have had the same problem 4 times with the same mongoDB version... Removing data and re-syncing is the only way to solve this problem? Is there any other thing that I can do to solve it without re-syncing? Regards, |
| Comment by Klébert Hodin [ 23/Jan/13 ] |
|
I don't think so. Klébert |
| Comment by David Hows [ 23/Jan/13 ] |
|
Hi Klebert, Is there anything more we can do for you on this issue? It seems your re-sync has solved the problem and the older logs were unavailable. Cheers, David |
| Comment by Klébert Hodin [ 11/Jan/13 ] |
|
I mean no issue since we resync'd primaries mongod of shards 4 and 6. |
| Comment by David Hows [ 11/Jan/13 ] |
|
Hi Klebert, What data did you remove? Or do you just mean the resync? |
| Comment by Klébert Hodin [ 10/Jan/13 ] |
|
No it didn't. |
| Comment by David Hows [ 10/Jan/13 ] |
|
Hi Klebert, Has this issue continued occurring since your restart and has it continued subsequently? If so, can you attach a log for one of the instances that this issue occurs on from the time of the restart till after you have seen the issue occur? Thanks, David |
| Comment by Klébert Hodin [ 28/Dec/12 ] |
|
Hi David, It happened 15 days ago, we no longer have these log files. Klébert |
| Comment by David Hows [ 28/Dec/12 ] |
|
Hi Klebert, Can you attach the logs for the 24hrs following the restart? It should show the initial kickoff of the migration that failed that we were unable to see in the last set of logs. Cheers, David |
| Comment by Klébert Hodin [ 26/Dec/12 ] |
|
Yes we tried. It only fixed the issue for 24h (until it happened again the morning after). |
| Comment by Eliot Horowitz (Inactive) [ 25/Dec/12 ] |
|
Have you tried restart mongod on mdbcis4-01-sv.criteo.prod:27021? |
| Comment by Klébert Hodin [ 24/Dec/12 ] |
|
Hi David, Any updates on this issue ? Thanks, Klébert |
| Comment by Grégoire Seux [ 19/Dec/12 ] |
|
Hi David, resyncing servers (http://docs.mongodb.org/manual/administration/replica-sets/#resyncing-a-member-of-a-replica-set) is a hammer to catch a fly but solves most weird in mongodb so we use it quite often. |
| Comment by Klébert Hodin [ 19/Dec/12 ] |
|
http://docs.mongodb.org/manual/administration/replica-sets/#replica-set-auto-resync-stale-member |
| Comment by David Hows [ 19/Dec/12 ] |
|
Thanks Klébert, I'm just trying to follow what occurred to cause those versioning errors. What do you mean by full-resync'd? Cheers, David |
| Comment by Klébert Hodin [ 18/Dec/12 ] |
|
Hi David, Here's a dump config dbs. This bug didn't show up again since we full resynced shards 4 and 6. |
| Comment by David Hows [ 18/Dec/12 ] |
|
Hi Klébert, Grégoire, Thanks for the logs. I've been looking through these extra logs and can see a few interesting things in the logs relating to how shard versions are calculated. I'd like to confirm what is going on within your config servers, can you please dump the config database from all three of your config servers and attach it to the ticket? Additionally, do you have any logs prior to Fri Dec 7 04:02:19? Thanks, David |
| Comment by Klébert Hodin [ 17/Dec/12 ] |
|
Older logs from shard4. |
| Comment by Grégoire Seux [ 14/Dec/12 ] |
|
I'll give you all logs and things on Monday |
| Comment by Grégoire Seux [ 14/Dec/12 ] |
|
Hello David, |
| Comment by David Hows [ 14/Dec/12 ] |
|
Hi Grégoire, I've been following up on this with your logs from the other ticket. Would it be possible to get some of earlier logs from shard4? Currently the log starts at Wed Dec 12 04:02:17 and is in the middle of a chunk migration. I would like to see what happens at the start of this migration. Cheers, David |
| Comment by Grégoire Seux [ 13/Dec/12 ] |
|
No, this collection is upsert only (and some reads of course). |
| Comment by Scott Hernandez (Inactive) [ 13/Dec/12 ] |
|
Are you dropping the collection and recreating it at any point? Or restoring from a backup of the whole cluster? |