[SERVER-7922] All operations blocked on one sharded collection Created: 13/Dec/12  Updated: 11/Jul/16  Resolved: 23/Jan/13

Status: Closed
Project: Core Server
Component/s: Performance, Sharding, Stability
Affects Version/s: 2.2.2
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Klébert Hodin Assignee: David Hows
Resolution: Done Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Linux Centos5/6


Attachments: File changelog     File config_dump.tar     Zip Archive mongod-shard4-shard6-logs.zip     Zip Archive mongod-shard4_20121211.zip     File sh_status    
Issue Links:
Related
related to SERVER-7821 MongoS blocks all requests to sharded... Closed
related to SERVER-7034 timeouts for all connections in migra... Closed
Operating System: Linux
Participants:

 Description   

Every morning since last week, all operations to a sharded collection are failing.

Here are application side errors :
setShardVersion failed host: mdbcis4-01-sv.criteo.prod:27021

{ oldVersion: Timestamp 0|0, oldVersionEpoch: ObjectId('000000000000000000000000'), ns: "counters.statistics", version: Timestamp 4000|3, versionEpoch: ObjectId('000000000000000000000000'), globalVersion: Timestamp 6000|0, globalVersionEpoch: ObjectId('000000000000000000000000'), reloadConfig: true, errmsg: "shard global version for collection is higher than trying to set to 'counters.statistics'", ok: 0.0 }

'. (Response was { "err" : "setShardVersion failed host: mdbcis4-01-sv.criteo.prod:27021

{ oldVersion: Timestamp 0|0, oldVersionEpoch: ObjectId('000000000000000000000000'), ns: \"counters.statistics\", version: Timestamp 4000|3, versionEpoch: ObjectId('000000000000000000000000'), globalVersion: Timestamp 6000|0, globalVersionEpoch: ObjectId('000000000000000000000000'), reloadConfig: true, errmsg: \"shard global version for collection is higher than trying to set to 'counters.statistics'\", ok: 0.0 }

", "code" : 10429, "n" : 0, "ok" : 1.0 }). : MongoDB.Driver.SafeModeResult SendMessage(MongoDB.Driver.Internal.MongoRequestMessage, MongoDB.Driver.SafeMode)

Server-side ones :
warning: aborted moveChunk because official version less than mine?: official 5|1||000000000000000000000000 mine: 6|0||000000000000000000000000

Restarting mongod unlocks operations until next morning.

In attachment, logs of servers involved in the moveChunk process (shard4 to shard6), sh_status output, changelog collection output
In logs issue starts at Wed Dec 12 06:47:09, ends at Wed Dec 12 09:30:00 after restart.

Link to our MMS dashboard : https://mms.10gen.com/host/list/4f8d732587d1d86fa8b99c12
Problem occured before we added 8th shard and seems to be linked to previous bugs : https://jira.mongodb.org/browse/SERVER-7034 and https://jira.mongodb.org/browse/SERVER-7821



 Comments   
Comment by Ian Whalen (Inactive) [ 29/Apr/13 ]

Hi Javi, please open a new ticket in the SERVER project with any further details you can include.

Comment by Javi Martin [ 29/Apr/13 ]

Hi!

during the last 2 weeks I have had the same problem 4 times with the same mongoDB version... Removing data and re-syncing is the only way to solve this problem? Is there any other thing that I can do to solve it without re-syncing?

Regards,

Comment by Klébert Hodin [ 23/Jan/13 ]

I don't think so.
I'll open a new bug if necessary.

Klébert

Comment by David Hows [ 23/Jan/13 ]

Hi Klebert,

Is there anything more we can do for you on this issue?

It seems your re-sync has solved the problem and the older logs were unavailable.

Cheers,

David

Comment by Klébert Hodin [ 11/Jan/13 ]

I mean no issue since we resync'd primaries mongod of shards 4 and 6.

Comment by David Hows [ 11/Jan/13 ]

Hi Klebert,

What data did you remove? Or do you just mean the resync?

Comment by Klébert Hodin [ 10/Jan/13 ]

No it didn't.
It seems removing data and restarting mongod fixed this issue.

Comment by David Hows [ 10/Jan/13 ]

Hi Klebert,

Has this issue continued occurring since your restart and has it continued subsequently?

If so, can you attach a log for one of the instances that this issue occurs on from the time of the restart till after you have seen the issue occur?

Thanks,

David

Comment by Klébert Hodin [ 28/Dec/12 ]

Hi David,

It happened 15 days ago, we no longer have these log files.

Klébert

Comment by David Hows [ 28/Dec/12 ]

Hi Klebert,

Can you attach the logs for the 24hrs following the restart?

It should show the initial kickoff of the migration that failed that we were unable to see in the last set of logs.

Cheers,

David

Comment by Klébert Hodin [ 26/Dec/12 ]

Yes we tried. It only fixed the issue for 24h (until it happened again the morning after).

Comment by Eliot Horowitz (Inactive) [ 25/Dec/12 ]

Have you tried restart mongod on mdbcis4-01-sv.criteo.prod:27021?

Comment by Klébert Hodin [ 24/Dec/12 ]

Hi David,

Any updates on this issue ?

Thanks,

Klébert

Comment by Grégoire Seux [ 19/Dec/12 ]

Hi David,

resyncing servers (http://docs.mongodb.org/manual/administration/replica-sets/#resyncing-a-member-of-a-replica-set) is a hammer to catch a fly but solves most weird in mongodb so we use it quite often.

Comment by Klébert Hodin [ 19/Dec/12 ]

http://docs.mongodb.org/manual/administration/replica-sets/#replica-set-auto-resync-stale-member

Comment by David Hows [ 19/Dec/12 ]

Thanks Klébert,

I'm just trying to follow what occurred to cause those versioning errors.

What do you mean by full-resync'd?

Cheers,

David

Comment by Klébert Hodin [ 18/Dec/12 ]

Hi David,

Here's a dump config dbs.
We don't you have any logs prior to Fri Dec 7 04:02:19.

This bug didn't show up again since we full resynced shards 4 and 6.

Comment by David Hows [ 18/Dec/12 ]

Hi Klébert, Grégoire,

Thanks for the logs. I've been looking through these extra logs and can see a few interesting things in the logs relating to how shard versions are calculated.

I'd like to confirm what is going on within your config servers, can you please dump the config database from all three of your config servers and attach it to the ticket?

Additionally, do you have any logs prior to Fri Dec 7 04:02:19?

Thanks,

David

Comment by Klébert Hodin [ 17/Dec/12 ]

Older logs from shard4.

Comment by Grégoire Seux [ 14/Dec/12 ]

I'll give you all logs and things on Monday

Comment by Grégoire Seux [ 14/Dec/12 ]

Hello David,
could you request everything you need at once

Comment by David Hows [ 14/Dec/12 ]

Hi Grégoire,

I've been following up on this with your logs from the other ticket.

Would it be possible to get some of earlier logs from shard4? Currently the log starts at Wed Dec 12 04:02:17 and is in the middle of a chunk migration. I would like to see what happens at the start of this migration.

Cheers,

David

Comment by Grégoire Seux [ 13/Dec/12 ]

No, this collection is upsert only (and some reads of course).
No backup restoring

Comment by Scott Hernandez (Inactive) [ 13/Dec/12 ]

Are you dropping the collection and recreating it at any point? Or restoring from a backup of the whole cluster?

Generated at Thu Feb 08 03:15:57 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.