[SERVER-6781] Document in the wrong shard. Created: 16/Aug/12  Updated: 15/Feb/13  Resolved: 05/Sep/12

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 2.0.6
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Robert Jobson Assignee: Spencer Brody (Inactive)
Resolution: Duplicate Votes: 3
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Linux Ubuntu x86_64 GNU/Linux connected to from Perl code


Attachments: Zip Archive mongodb_router.log.zip    
Issue Links:
Duplicate
duplicates SERVER-4604 inserts need better handling of versi... Closed
Related
Operating System: Linux
Participants:

 Description   

documents getting put in the wrong chunk/shards.

example.

mongos> db.chunks.find(

{"_id" :/listing_fields217_559606/}

,

{"shard" :1,"max" :1,"min" :1}

);
{ "id" : "homes_didx.listing_fields217_559606-hash_fk_id\"9c715a8c\"", "min" :

{ "hash_fk_id" : "9c715a8c" }

, "max" : { "hash_fk_id" :

{ $maxKey : 1 }

}, "shard" : "shard0000" }
{ "_id" : "homes_didx.listing_fields217_559606-hash_fk_id_MinKey", "min" : { "hash_fk_id" :

{ $minKey : 1 }

}, "max" :

{ "hash_fk_id" : "9c715a8c" }

, "shard" : "shard0001" }

connecting directly to shard "shard0000"
> db.listing_fields217_559606.find({},

{fk_listing_id:1,hash_fk_id:1}

);

{ "_id" : ObjectId("502cf2eb4a2ec3a9630004ef"), "hash_fk_id" : "df82643a", "fk_listing_id" : "57558495" } { "_id" : ObjectId("502cf2eb9b14704161000b07"), "hash_fk_id" : "d0a7226e", "fk_listing_id" : "49808274" } { "_id" : ObjectId("502cf2ec9e27f6c451000a21"), "hash_fk_id" : "9c715a8c", "fk_listing_id" : "50225871" } { "_id" : ObjectId("502cf2ecc8e9f12752000a5b"), "hash_fk_id" : "cba357e7", "fk_listing_id" : "60142564" } { "_id" : ObjectId("502cf2ec72c5698975000b7c"), "hash_fk_id" : "aee39eaf", "fk_listing_id" : "60229041" } { "_id" : ObjectId("502cf2ed9cbffc6562000731"), "hash_fk_id" : "b1ff0637", "fk_listing_id" : "60115293" } { "_id" : ObjectId("502cf2ed47202c0a620007fa"), "hash_fk_id" : "c92a3f44", "fk_listing_id" : "60115290" } { "_id" : ObjectId("502cf2ee4e1e7558750007b9"), "hash_fk_id" : "09fefa68", "fk_listing_id" : "60156237" } { "_id" : ObjectId("502cf2effbb5c34a52000883"), "hash_fk_id" : "e56dadbb", "fk_listing_id" : "60227778" } { "_id" : ObjectId("502cf2f116641ca851000af0"), "hash_fk_id" : "dece11d2", "fk_listing_id" : "60245783" } { "_id" : ObjectId("502cf2f124860e6652000923"), "hash_fk_id" : "ab485082", "fk_listing_id" : "60245784" } { "_id" : ObjectId("502cf2f4f1d370f150000cfa"), "hash_fk_id" : "da66478e", "fk_listing_id" : "60249679" }

document

{ "_id" : ObjectId("502cf2ee4e1e7558750007b9"), "hash_fk_id" : "09fefa68", "fk_listing_id" : "60156237" }

will not be returned by the router and is effectively lost.

another example

mongos> db.chunks.find(

{"_id" :/listing_fields2002_558075/}

,

{"shard" :1,"max" :1,"min" :1}

);
{ "id" : "homes_didx.listing_fields2002_558075-hash_fk_id\"79aae84e\"", "min" :

{ "hash_fk_id" : "79aae84e" }

, "max" : { "hash_fk_id" :

{ $maxKey : 1 }

}, "shard" : "shard0000" }
{ "_id" : "homes_didx.listing_fields2002_558075-hash_fk_id_MinKey", "min" : { "hash_fk_id" :

{ $minKey : 1 }

}, "max" :

{ "hash_fk_id" : "79aae84e" }

, "shard" : "shard0001" }

again on shard0000
> db.listing_fields2002_558075.find({},

{hash_fk_id:1}

);

{ "_id" : ObjectId("502c25a6b0e521034400182c"), "hash_fk_id" : "a644d3b6" } { "_id" : ObjectId("502c25a69dac33355300192e"), "hash_fk_id" : "da0b5d5f" } { "_id" : ObjectId("502c25a74a52eb911a001904"), "hash_fk_id" : "8e99b228" } { "_id" : ObjectId("502c25a7060c531e52001d54"), "hash_fk_id" : "79aae84e" } { "_id" : ObjectId("502c25a87100fffd52001729"), "hash_fk_id" : "426a4c58" } { "_id" : ObjectId("502c25a966efa23943001f88"), "hash_fk_id" : "fdfd2540" } { "_id" : ObjectId("502c25aa81b3e55e19001c3d"), "hash_fk_id" : "8e2a987c" } { "_id" : ObjectId("502c25aab39a049b52001ca9"), "hash_fk_id" : "7b8cfd62" } { "_id" : ObjectId("502c25aa350b025a43001b0f"), "hash_fk_id" : "bbbb5771" } { "_id" : ObjectId("502c25aaa2b66cfb19001951"), "hash_fk_id" : "9fbf4b04" } { "_id" : ObjectId("502c25aa7c1d19001e00056d"), "hash_fk_id" : "966bb669" } { "_id" : ObjectId("502c25abfee20d0919001aeb"), "hash_fk_id" : "97ea1fa9" }

document

{ "_id" : ObjectId("502c25a87100fffd52001729"), "hash_fk_id" : "426a4c58" }

will not be returned by the router and is effectively lost.

Let me know what additional information I can provide.



 Comments   
Comment by Spencer Brody (Inactive) [ 27/Aug/12 ]

Looking at the config dump you attached, it seems that a migration finished at the exact moment that the documents in question were created. I believe you are hitting SERVER-4604, which is a race condition where in certain rare conditions a document inserted while its chunk is being moved can be left stranded on the donor shard. This has been fixed for the upcoming 2.2 release, but unfortunately cannot be backported to the 2.0 series since it's a major code change.

A workaround for this is is to keep the balancer off during heavy inserts, then turn it on to rebalance while inserts are paused. We apologize for this bug, 2.2 should be released very soon with the bug fixed.

Comment by Robert Jobson [ 27/Aug/12 ]

We used to but ran into trouble with another bug so the collection names are now all unique. That change was made before this error came up. Specifically these records should be in the collections the errors that alerted us to this problem were because we cannot find them.

Note that we have upgraded to the latest stable version 2.0.7 since this ticket was opened. Also we do not still have logs from the 15th-16th and we have turned sharding off from the time this incident was revealed.

Comment by Spencer Brody (Inactive) [ 27/Aug/12 ]

Hmm... Do recreate any of these collections with the same name after they've been dropped?

Do you have the mongod logs from whichever nodes were PRIMARY on the 15th/16th from that time frame on all shards?

Comment by Robert Jobson [ 27/Aug/12 ]

No, I considered this possibility when you first suggested that records could be leftovers from migrations and verified that we do not delete from these collections. As far as I know we never delete from these collections but drop them on successful completion.

Comment by Spencer Brody (Inactive) [ 27/Aug/12 ]

Do you delete documents from the collection(s) in question? Is it possible that those documents have simply been deleted as part of normal operation? As I mentioned before, failed migrations can leave "orphan" documents behind on shards that aren't the one that owns that chunk range, and from the config dump you attached, it looks like you do have a lot of failed migrations going on. If a stale version of those documents was left behind on one shard, then the up-to-date version was deleted off the shard that actually owned that document's chunk, then you could get to a situation like this where a document shows up when querying the shard directly, but doesn't exist when querying through the mongos, because it's actually been deleted.

Comment by Robert Jobson [ 20/Aug/12 ]

config dump added to private as requested.

Community PrivateSUPPORT-323
Core ServerSERVER-6781 private - documents in the wrong shard.

https://jira.mongodb.org/browse/SUPPORT-323

Comment by Robert Jobson [ 17/Aug/12 ]

zipped copy of the mongos logs from the 15th and 16th.

Comment by Robert Jobson [ 17/Aug/12 ]

We only just noticed this yesterday.

> ObjectId("502cf2ee4e1e7558750007b9").getTimestamp();
ISODate("2012-08-16T13:17:34Z")

ObjectId("502c25a87100fffd52001729").getTimestamp();
ISODate("2012-08-15T22:41:44Z")

We rotate the logs daily, I will attach the appropriate two days.

I'll get back to you early next week about the dump.

Comment by Spencer Brody (Inactive) [ 17/Aug/12 ]

How long ago did this start showing up? Were there any failures or problems with the cluster before you noticed this that may be related?

Do you have mongos logs running from before the problem manifested to after it did? Ideally with a few hours of context on either side of when the problem showed up.

Finally, can you attach a dump of your config database by running mongodump against against a config server?

If you'd rather not attach that to this publically-viewable ticket, you can create a ticket in the "Community Private" project, attach the logs/dump there, then post a link to the Community Private ticket here. Tickets in the Community Private project will only be viewable to the reporter and to employees of 10gen.

Comment by Robert Jobson [ 17/Aug/12 ]

These are not duplicates they only exist on the wrong shard. Firstly because they do not show up in queries from the router(mongos) but also querying the correct shard directly does not return them. We first noticed this problem because the .count() returned from the router did not match the records returned and the process designed to use these records could not find them and returned errors.

example 1

shard0000
> db.listing_fields217_559606.find(

{ "hash_fk_id" : "09fefa68"}

,

{hash_fk_id:1}

);

{ "_id" : ObjectId("502cf2ee4e1e7558750007b9"), "hash_fk_id" : "09fefa68" }

shard0001

db.listing_fields217_559606.find(

{ "hash_fk_id" : "09fefa68"}

,

{hash_fk_id:1}

);
<nothing returned>

example 2

shard0000
> db.listing_fields2002_558075.find(

{"hash_fk_id" : "426a4c58"}

,

{hash_fk_id:1}

);

{ "_id" : ObjectId("502c25a87100fffd52001729"), "hash_fk_id" : "426a4c58" }

shard0001

db.listing_fields2002_558075.find(

{"hash_fk_id" : "426a4c58"}

,

{hash_fk_id:1}

);

<nothing returned>

Comment by Spencer Brody (Inactive) [ 17/Aug/12 ]

Sometimes documents can be left behind on the shard they don't belong due to failed migrations. When this happens, however, the document should still remain on the shard that it belongs on, these orphans are basically out-of-date duplicate data. When you find a document living on the wrong shard, can you try querying the other shard for a document with the same _id and confirm that the document on the wrong shard is just a duplicate?

Generated at Thu Feb 08 03:12:41 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.