[SERVER-32761] Missing document Created: 18/Jan/18  Updated: 27/Oct/23  Resolved: 21/Mar/18

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 3.4.10
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Tudor Aursulesei Assignee: Kaloian Manassiev
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Participants:

 Description   

I've moved some data manually, using moveChunk, and a document is missing. I can find it manually, by querying a shard which has it, but when i make the same query from the sharded shell it's not there. The balancer is now disabled.

mongos> db.smtp_in_auth.find({"_id": ObjectId('5a590cf4016548d302f5eae0')})
mongos> db.smtp_in_auth.find({"_id": ObjectId('5a590cf4016548d302f5eae0')}).explain()
{
        "queryPlanner" : {
                "mongosPlannerVersion" : 1,
                "winningPlan" : {
                        "stage" : "SINGLE_SHARD",
                        "shards" : [
                                {
                                        "shardName" : "rs4",
                                        "connectionString" : "rs4/XXX:10104,XXY:10104",
                                        "serverInfo" : {
                                                "host" : "XYZ",
                                                "port" : 10104,
                                                "version" : "3.4.10",
                                                "gitVersion" : "078f28920cb24de0dd479b5ea6c66c644f6326e9"
                                        },
                                        "plannerVersion" : 1,
                                        "namespace" : "db.col",
                                        "indexFilterSet" : false,
                                        "parsedQuery" : {
                                                "_id" : {
                                                        "$eq" : ObjectId("5a590cf4016548d302f5eae0")
                                                }
                                        },
                                        "winningPlan" : {
                                                "stage" : "SHARDING_FILTER",
                                                "inputStage" : {
                                                        "stage" : "IDHACK"
                                                }
                                        },
                                        "rejectedPlans" : [ ]
                                }
                        ]
                }
        },
        "ok" : 1
}
mongos>
 
rs4:PRIMARY> use db
switched to db db
rs4:PRIMARY> db.col.findOne({"_id": ObjectId('5a590cf4016548d302f5eae0')})
null
 
rs1:PRIMARY> use db
switched to db db
rs1:PRIMARY> db.col.findOne({"_id": ObjectId('5a590cf4016548d302f5eae0')})
{ "_id" : ObjectId("5a590cf4016548d302f5eae0") }

So the cluster thinks that that _id is located on rs4, when it actually resides on rs1. I know from experience that restarting everything a couple of times fix this issue, but i'd rather not do that.



 Comments   
Comment by Kaloian Manassiev [ 21/Mar/18 ]

Hi thestick613,

You are right, I was confused by the ticket description. This is most likely orphaned document on rs1, which some time in the past has been moved to rs4, but its cleanup on rs1 didn't complete. Because of this it is still visible there.

Since this behaviour is expected, I am closing this ticket as 'Works as Designed'.

Best regards,
-Kal.

Comment by Tudor Aursulesei [ 15/Mar/18 ]

I've read about this for a while. Isn't this just an orphaned document?

https://docs.mongodb.com/manual/reference/glossary/#term-orphaned-document
https://docs.mongodb.com/manual/reference/command/cleanupOrphaned/

Comment by Kaloian Manassiev [ 09/Mar/18 ]

Hi thestick613,

There still are some documents that i'm able to find on a shard, but unable to find on the full cluster.

If this problem is still reproducible, can you please provide us with the following information:

  1. One such query, which exhibits this problem and the results it returns both when run against mongos (where it doesn't return results) and against the shard primary (where it returns results).
  2. The explain output of this query against the same mongos.
  3. The output of the getShardVersion command both against the same mongos and shard primary. This is the command: {{ db.adminCommand({getShardVersion:<db.collection>}) }}.
  4. Dump of your config database.

Thank you in advance.

Best regards,
-Kal.

Comment by Tudor Aursulesei [ 08/Mar/18 ]

The script isn't very reliable, because documents disappear between searching them on the shard and searching them on the 'cluster'/mongos. I've changed it to iterate all documents on a shard, and then search them with find().limit(1).explain(). If the query planner router gives me a different shard than the one i'm currently processing, i run moveChunk. I've found a few more 'zombie' documents using this approach. Still not sure i've fixed all the discrepancies, because documents come and go.

Comment by Tudor Aursulesei [ 06/Mar/18 ]

No, i wasn't running with slaveOk=true.
There is nothing relevant about this in the logs.

There still are some documents that i'm able to find on a shard, but unable to find on the full cluster. I know that querying specific shards while clustered is not a good idea. I've been running a script which tries to find such occurrences, and when it finds it, it runs moveChunk so that the missing document would reside on the shard it was found.

Comment by Kaloian Manassiev [ 23/Feb/18 ]

Hi thestick613,

My apologies for the delayed response.

Running moveChunk with the balancer enabled, while not recommended, should not cause any correctness problems.

From the explain output, I see that the version you were using is 3.4.10. Please correct me if I am wrong (I also updated the ticket's affected version field).

I have a couple of follow-up questions:

  • Are you running the find command with skaveOk=true by any chance? Up until version 3.6.0, using slaveOk turns off routing information validation on the shards and may result in incorrect routing.
  • Do you happen to still have the logs from around the time of this incident from rs1 and rs4 primaries' and also from the mongos, which could not find the document? If so, can you please upload them?

Thank you in advance.

Best regards,
-Kal.

Comment by Tudor Aursulesei [ 18/Jan/18 ]

I might have triggered this myself, by running moveChunk with the balancer enabled. I'm currently trying to find documents that exist on a replica set and not exist in a shard and i'm running moveChunk with the found _id towards the replica set that should have them.

Comment by Tudor Aursulesei [ 18/Jan/18 ]

I've managed to fix this by issuing two new moveComands in python

print pm.admin.command('moveChunk', '%s.%s' % (database, collection), find={'_id': bson.ObjectId('5a590cf4016548d302f5eae0')}, to='rs4')
print pm.admin.command('moveChunk', '%s.%s' % (database, collection), find={'_id': bson.ObjectId('5a590cf4016548d302f5eae0')}, to='rs1')

I'm not sure which one of it did the trick, but it's okay now. Restarting mongo config servers didn't to anything.

Generated at Thu Feb 08 04:31:13 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.