[SERVER-30708] _id index returning more than one document with same _id in aggregations and counts. Created: 17/Aug/17  Updated: 23/Jul/18  Resolved: 20/Jun/18

Status: Closed
Project: Core Server
Component/s: Aggregation Framework, Sharding
Affects Version/s: 3.4.6
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Lucas Assignee: Kelsey Schubert
Resolution: Done Votes: 3
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Ubuntu 14.04


Attachments: PNG File aggregation_gui1.png     PNG File aggregation_gui2.png     PNG File aggregation_shell.png     File explain_primary_aggregation.json     File explain_primary_find.json     File explain_secondary_aggregation.json     File explain_secondary_find.json     File queries.7z     File sh.status.tar.gz    
Issue Links:
Related
related to SERVER-3645 Sharded collection counts (on primary... Closed
related to SERVER-5931 Secondary reads in sharded clusters n... Closed
Operating System: ALL
Steps To Reproduce:

I don't know. This was found by one of my employees and I don't know how to explain to my customers or even to my colleagues.

Participants:

 Description   

I've found a strange behavior in one of my sharded clusters.

I will resume this report because I really don't know how to explain this, I just need to know what kind of data is needed to debug this behavior and help to fix this asap, because it is impacting some of my customers.

Basically, I have a sharded cluster UNDER PRODUCTION with 7 nodes with replication factor of 3, 6 mongos and 3 config servers.

All those nodes were created using the binary version 3.4.4 but they already passed though different versions (trying to escape of another bug): 3.4.0, 3.4.3 and 3.4.6. Currently on version 3.4.6

My shard key is composed by a hashed index on a custom string field named "pid". This field is basically one object identification but can repeat for a thousand times, for example.

My database has 500 millions documents.

All insertions in this database are performed by applications using java driver version 3.4.2. We had some old applications using java driver 3.2.2 many months ago, but this cluster was created just a month ago and I really don't believe those applications added any documents in this cluster. The above example is about one document added to the database with a java driver 3.4.2.

Explaining the problem with commands:

This find on mongo shell returns only one document:

mongos> db.investigation_cards.find({_id : ObjectId("5988e4ea8584c230ad486e43")}, {_id : 1, pid : 1})
{ "_id" : ObjectId("5988e4ea8584c230ad486e43"), "pid" : "10155585763579589_1923551154567182" }

This count, otherwise, returns two documents in mongo shell:

mongos> db.investigation_cards.count({_id : ObjectId("5988e4ea8584c230ad486e43")})
2

What is really impacting my customers is that if I execute the aggregation showed below in my application, some documents are returned twice. In one MongoDB GUI we can see the same result but in others the document doesn't returns twice (pictures aggregation_*).

db.investigation_cards.aggregate(
[
	{ "$match" : { _id : ObjectId("5988e4ea8584c230ad486e43")}},
	{ "$project" : { _id : 1, "pid" : 1}},
]
)



 Comments   
Comment by Kelsey Schubert [ 21/Mar/18 ]

Hi lucasoares,

To clarify, as there are a few different issues going on here:

  • SERVER-5931 resolves the issue of queries returning orphan documents that are being run on sharded secondaries in 3.6 and later.
  • SERVER-3645 resolves the issue of counts with a query predicate on sharded collection being incorrect (even when executed against a primary) in the upcoming 4.0 release and later.
  • SERVER-33753 is open to track the issue where counts without a query predicate on a sharded collection may be inaccurate (even when executed against a primary).

Starting in MongoDB 4.0, I would expect that that any count command performed with a predicate would return accurate results regardless of whether it is executed against a primary or secondary.

Please let me know if you have any additional questions or if this ticket can be resolved.

Thank you,
Kelsey

Comment by Lucas [ 20/Mar/18 ]

I read on SERVER-5931 they only tracks counts without any query criteria, is that correct?

This ticket is tracking count() - which uses metadata for collection to quickly get the total count of documents.

If so, SERVER-5931 doesn't track our problem, since we are getting wrong results from counts with a specified query.

Comment by Lucas [ 20/Mar/18 ]

Sorry for the delay.

Yes, you are correct.

We are getting wrong reads in counts/finds/aggregations on secondary (you said it is fine now in 3.6) and wrong count results on primaries. Both tracked on issues that you comment before. Thanks.

Comment by Kelsey Schubert [ 16/Feb/18 ]

Hi lucasoares,

Thank you for the additional details. The behavior you describe your latest comment is expected and tracked in SERVER-3645 and SERVER-5931 (fixed in MongoDB 3.6). Therefore, I'd suggest that you consider upgrading to MongoDB 3.6. In MongoDB 3.4 and below, only primaries are sharding-aware and will filter out orphans.

In your original ticket description, you mention that aggregation results on the primary may return duplicated documents. My understanding is that you haven't observed this behavior since the original report. Is this correct?

Thanks again,
Kelsey

Comment by Lucas [ 06/Feb/18 ]

explain_primary_find.json explain_secondary_aggregation.json explain_secondary_find.json sh.status.tar.gz explain_primary_aggregation.json

Hello kelsey.schubert.

The aggregation appears to have the wrong results only in secondary reads.

We are constantly facing this issue here. We now have a customized migration window (only from 00:00 to 06:00) to minimize this problem, but our customers are complaining about wrong results in finds and aggregations when reading from secondaries.

I will provide several outputs again, either for explain and sharding status.

To find a query with duplicated _id documents, I executed this query:

db.investigation_cards.aggregate({
	$match: {
		account_id: 6347511650320384,
		universe_id: 6513148473376768
	}
}, {
	$group: {
		_id: '$_id',
		count: {
			$sum: 1
		}
	}
}, {
	$match: {
		count: {
			$gt: 1
		}
	}
})

I found many documents, but I will use this object to explain and execute queries:

{ "_id" : ObjectId("5a21945c11a29609886cdffb"), "pid" : "1660268284693453997_5939071950" }

I will use a 'mongos' to connect and execute all operations.

Count in primary results in two documents.
Count in secundary results in two documents.

Find in primary results in one document.
Find in secondary results in two documents.

Aggregation in primary results in one document.
Aggregation in secondary results in two documents.

Files uploaded:

explain_secondary_find.json
explain_primary_find.json
Find explain in secondary and primary using this command:

db.investigation_cards.find({ "_id" : ObjectId("5a21945c11a29609886cdffb")}).explain("executionStats")

explain_secondary_aggregation.json
explain_primary_aggregation.json
Aggregation explain in secondary and primary using this command:

db.investigation_cards.aggregate([{$match : { "_id" : ObjectId("5a21945c11a29609886cdffb")}}, {$group : {_id : '$_id', count : {$sum : 1}}}], {explain : true})

sh.status.tar.gz
Sharding status of my cluster.

Additional info:

  1. Our environment are complex and we are constantly updating our database (several applications updating documents).
  2. I saw many aborted migrations. I'm not sure about why those migrations are failing. We have a 10Gbps network and those migrations should be fast enough to don't fail.
  3. We added two nodes more than one month ago. After this we have much more issues with duplications.
Comment by Kelsey Schubert [ 26/Dec/17 ]

Hi lucasoares,

Thank you for providing this information, we're tracking the issue you're observing the the counts in SERVER-3645. Unfortunately, this ticket does not explain the behavior you previously observed with the aggregation commands. If you observe this behavior again, would you please provide the explain the output again so we can continue to investigate that issue?

Thanks,
Kelsey

Comment by Lucas [ 21/Dec/17 ]

Hello. I found duplicated documents here and I will execute find, count and aggregation (and their respectives explains) with primary and secondary preferred. To remember:
shard_key = pid (hashed)
unique index = account_id, universe_id, pid, channel

The results are in the uploaded file called "queries.7z".

Will be all queries executed against the mongos with secondary and primary read preference.

This time only counts returned wrong results with primary read preference. I really don't know why.

Thanks

Comment by Kelsey Schubert [ 21/Dec/17 ]

Hi lucasoares,

I'd like to investigate this behavior in more detail:

Our system has an annotation system where any customer can apply update operations on their documents. When you asked me about secondary reads, I realize that in secondary reads, even find operations returned duplicated documents but in primary reads, only counts and aggregations get a wrong result (find only brings correct documents).

I assumed these queries were being run through the mongos, and so I would like the explain from the mongos with primary read preference. Please let me know if my assumption is mistaken.

Thank you,
Kelsey

Comment by Lucas [ 19/Dec/17 ]

Hello kelsey.schubert, I will stop my script and I will wait for duplicated documents to execute the explain command.

But just to clarify. You want me to execute the explain command agaist the primary of the "wrong" replicaset? Or against the 'mongos' but with primary read preference?

Thanks.

Comment by Kelsey Schubert [ 14/Dec/17 ]

Hi lucasoares,

Would you please provide the explain output of the aggregation and find commands executed against the primary?

While we expect that stale data may be returned from a sharded secondary, I would not expect the aggregation commands to return orphan documents on the primary. Thus far we haven't been able to reproduce this specific behavior, but the explain output may help us understand why these commands are providing different results.

Thank you,
Kelsey

Comment by Lucas [ 06/Dec/17 ]

Hello, do you guys have any update for this issue? I can't add more shards to my cluster because this will cause a huge negative impact to my system.

I'm thinking about a oplog reader to monitor operations of the cluster migration, and then automatically remove the duplication documents in the original shard.

Comment by Lucas [ 15/Nov/17 ]

Hello again!

I'm found quite interesting news about this issue.

Last week I added two new shards into my cluster. Many customers of my analytics software were receiving wrong and inconsistent data because of the duplicate data. Basically when we added those new shards, this problem got worse, and now we know more about it.

Our system has an annotation system where any customer can apply update operations on their documents. When you asked me about secondary reads, I realize that in secondary reads, even find operations returned duplicated documents but in primary reads, only counts and aggregations get a wrong result (find only brings correct documents).

In my environment we have several application servers constantly updating documents in our database, and if this update happens simultaneous with the balancer, a duplication is created and make our data inconsistent.

We tried to run the cleanupOrphaned command, but this command cleans those duplicates slowly, and caused damage on our server performance. We developed an application to cleanup those duplicates without this command, simply with those operations:
1. Perform a find operation to find all documents with same _id.
2. Remove all documents with same _id retaining in application memory the newest document.
3. Insert the newest document with the same _id into the collection.

With that we were able to remove all duplicated documents in our database, but they are still repairing (because we have 2 new shards still in the balancer process).

I stopped the balancer and create a balancer window out of normal working time. Unfortunately this will not fix the entire problem, since we have applications running into those time updating our documents.

This behavior it's really inadmissible and may cause problems every time we try to scale our database with new shards.

This problem aren't occurring only when we add more shards to our cluster, doing that just make this worse.

Comment by Lucas [ 24/Oct/17 ]

1. Hello. I connected through the mongo shell to execute those commands, and I haven't executed any command with secondary reads.

2. I executed the command 'db.adminCommand(

{flushRouterConfig:1}

)' and the count still returns 2 documents.

3. I will not use the cleanup command for now because I'm under a database internal operation here in my company and I can't afford using my time in this right now, since this can impact my shard performance.

Comment by Kelsey Schubert [ 05/Sep/17 ]

Hi lucasoares,

Thank you for the additional information, we're looking into the data you've provided.

  • Would you please clarify whether you're using slaveOK (secondary reads) for the commands that return duplicate documents?
  • Would you please execute db.adminCommand({flushRouterConfig:1}) on the mongos running these commands and let us know if it resolves the issue?
  • If it does not, we recommend using the cleanupOrphaned command to remove these documents as we continue to investigate this issue on our side.

Kind regards,
Thomas

Comment by Lucas [ 05/Sep/17 ]

Hello. Sorry for the delay.

I haven't tested this issue in every mentioned version, just the final one.

Repl_0:PRIMARY> db.investigation_cards.getShardVersion()
{
        "configServer" : "ClusterConfigs/config1:7308,config2:7308,config3:7308",
        "inShardedMode" : false,
        "mine" : Timestamp(0, 0),
        "global" : Timestamp(9276, 8637),
        "ok" : 1
}

And the mongodump you request was uploaded to the scure portal.

There is any workaround to fix this without impacting the performance on my cluster?

Comment by Kelsey Schubert [ 24/Aug/17 ]

Hi lucasoares,

Thanks for the additional information. The orphan document may be on Repl_0, as the result of a failed migration or unclean shutdown. In this case, the aggregation query is targeting each shard since match stage does not contain the shard key prefix. However, despite the query being sent to each shard, we do not expect the aggregation query to return the orphan document.

We're working to reproduce this issue. To help us continue to investigate, would you please answer the following questions?

I've created a secure portal for you to use to upload the dump.

Thank you,
Thomas

Comment by Lucas [ 22/Aug/17 ]

Hello anonymous.user.

Here you are:
Shard containing an affected document:

{ "pid" : NumberLong("-9192494240312909235") } -->> { "pid" : NumberLong("-9191783001783409169") } on : HIDDENClusterRepl_3 Timestamp(9254, 0) 

Connecting directly to the primary of HIDDENClusterRepl_3 and executing queries:

HIDDENClusterRepl_3:PRIMARY> db.investigation_cards.find({_id : ObjectId("5988e4ea8584c230ad486e43")}, {_id : 1, pid : 1})
{ "_id" : ObjectId("5988e4ea8584c230ad486e43"), "pid" : "10155585763579589_1923551154567182" }

HIDDENClusterRepl_3:PRIMARY> db.investigation_cards.aggregate([{ "$match" : { _id : ObjectId("5988e4ea8584c230ad486e43")}}, { "$project" : { _id : 1, "pid" : 1}}])
{ "_id" : ObjectId("5988e4ea8584c230ad486e43"), "pid" : "10155585763579589_1923551154567182" }

I found a inconsistent shard executing those same commands in every single shard. Even with hash pointing to the Repl_3, another shard have the same element. Look:

Repl_0 (Inconsistent one):

HIDDENClusterRepl_0:PRIMARY> db.investigation_cards.find({_id : ObjectId("5988e4ea8584c230ad486e43")}, {_id : 1, pid : 1})
{ "_id" : ObjectId("5988e4ea8584c230ad486e43"), "pid" : "10155585763579589_1923551154567182" }
 
HIDDENClusterRepl_0:PRIMARY> db.investigation_cards.count({_id : ObjectId("5988e4ea8584c230ad486e43")})
1
 
HIDDENClusterRepl_0:PRIMARY> db.investigation_cards.aggregate( [ { "$match" : { _id : ObjectId("5988e4ea8584c230ad486e43")}}, { "$project" : { _id : 1, "pid" : 1}}, ] )
{ "_id" : ObjectId("5988e4ea8584c230ad486e43"), "pid" : "10155585763579589_1923551154567182" }

Repl_1 (OK)

HIDDENClusterRepl_1:PRIMARY> db.investigation_cards.find({_id : ObjectId("5988e4ea8584c230ad486e43")}, {_id : 1, pid : 1})
HIDDENClusterRepl_1:PRIMARY> db.investigation_cards.count({_id : ObjectId("5988e4ea8584c230ad486e43")})
0
HIDDENClusterRepl_1:PRIMARY> db.investigation_cards.aggregate( [ { "$match" : { _id : ObjectId("5988e4ea8584c230ad486e43")}}, { "$project" : { _id : 1, "pid" : 1}}, ] )

Repl_2 (OK)

HIDDENClusterRepl_2:PRIMARY> db.investigation_cards.find({_id : ObjectId("5988e4ea8584c230ad486e43")}, {_id : 1, pid : 1})
HIDDENClusterRepl_2:PRIMARY> db.investigation_cards.count({_id : ObjectId("5988e4ea8584c230ad486e43")})
0
HIDDENClusterRepl_2:PRIMARY> db.investigation_cards.aggregate( [ { "$match" : { _id : ObjectId("5988e4ea8584c230ad486e43")}}, { "$project" : { _id : 1, "pid" : 1}}, ] )

Repl_3 (OK, with the element)

HIDDENClusterRepl_3:PRIMARY> db.investigation_cards.find({_id : ObjectId("5988e4ea8584c230ad486e43")}, {_id : 1, pid : 1})
{ "_id" : ObjectId("5988e4ea8584c230ad486e43"), "pid" : "10155585763579589_1923551154567182" }
HIDDENClusterRepl_3:PRIMARY> db.investigation_cards.count({_id : ObjectId("5988e4ea8584c230ad486e43")})
1
HIDDENClusterRepl_3:PRIMARY> db.investigation_cards.aggregate( [ { "$match" : { _id : ObjectId("5988e4ea8584c230ad486e43")}}, { "$project" : { _id : 1, "pid" : 1}}, ] )
{ "_id" : ObjectId("5988e4ea8584c230ad486e43"), "pid" : "10155585763579589_1923551154567182" }

Repl_4 (OK)

HIDDENClusterRepl_4:PRIMARY> db.investigation_cards.find({_id : ObjectId("5988e4ea8584c230ad486e43")}, {_id : 1, pid : 1})
HIDDENClusterRepl_4:PRIMARY> db.investigation_cards.count({_id : ObjectId("5988e4ea8584c230ad486e43")})
0
HIDDENClusterRepl_4:PRIMARY> db.investigation_cards.aggregate( [ { "$match" : { _id : ObjectId("5988e4ea8584c230ad486e43")}}, { "$project" : { _id : 1, "pid" : 1}}, ] )

Repl_5 (OK)

HIDDENClusterRepl_5:PRIMARY> db.investigation_cards.find({_id : ObjectId("5988e4ea8584c230ad486e43")}, {_id : 1, pid : 1})
HIDDENClusterRepl_5:PRIMARY> db.investigation_cards.count({_id : ObjectId("5988e4ea8584c230ad486e43")})
0
HIDDENClusterRepl_5:PRIMARY> db.investigation_cards.aggregate( [ { "$match" : { _id : ObjectId("5988e4ea8584c230ad486e43")}}, { "$project" : { _id : 1, "pid" : 1}}, ] )

Repl_6 (OK)

HIDDENClusterRepl_6:PRIMARY> db.investigation_cards.find({_id : ObjectId("5988e4ea8584c230ad486e43")}, {_id : 1, pid : 1})
HIDDENClusterRepl_6:PRIMARY> db.investigation_cards.count({_id : ObjectId("5988e4ea8584c230ad486e43")})
0
HIDDENClusterRepl_6:PRIMARY> db.investigation_cards.aggregate( [ { "$match" : { _id : ObjectId("5988e4ea8584c230ad486e43")}}, { "$project" : { _id : 1, "pid" : 1}}, ] )

mongos (duplication in counts and in some occasions, in the aggregations)

mongos> db.investigation_cards.find({_id : ObjectId("5988e4ea8584c230ad486e43")}, {_id : 1, pid : 1})
{ "_id" : ObjectId("5988e4ea8584c230ad486e43"), "pid" : "10155585763579589_1923551154567182" }
 
mongos> db.investigation_cards.count({_id : ObjectId("5988e4ea8584c230ad486e43")})
2
 
mongos> db.investigation_cards.aggregate( [ { "$match" : { _id : ObjectId("5988e4ea8584c230ad486e43")}}, { "$project" : { _id : 1, "pid" : 1}}, ] )
{ "_id" : ObjectId("5988e4ea8584c230ad486e43"), "pid" : "10155585763579589_1923551154567182" }

How did that document get there even with a hashed 'pid' index?

Comment by Kelsey Schubert [ 21/Aug/17 ]

Hi lucasoares,

I'd like to determine whether these documents reside on the same node. To do this, would you please execute the following steps:

  1. Identify which shard containing an affected document:
    • Determine the hash of pid field of an affected document (I've computed it already for the document in your description)
      • Start a random local mongod with the following configuration

        setParameter:
                enableTestCommands: 1
        

      • Execute the following command

        db.runCommand({_hashBSONElement:"10155585763579589_1923551154567182"})
        {
        	"key" : "10155585763579589_1923551154567182",
        	"seed" : 0,
        	"out" : NumberLong("-9192485313660080638"),
        	"ok" : 1
        }
        

    • From a mongos connected to your sharded cluster, execute sh.status(true) and compare the ranges to the hash computed above to determine the shard containing the document
  2. Connect directly to the primary of the shard containing the document (not using mongos)
  3. Rerun the two commands in the description and provide the output in this ticket

    db.investigation_cards.find({_id : ObjectId("5988e4ea8584c230ad486e43")}, {_id : 1, pid : 1})
    db.investigation_cards.aggregate(
    [
    	{ "$match" : { _id : ObjectId("5988e4ea8584c230ad486e43")}},
    	{ "$project" : { _id : 1, "pid" : 1}},
    ]
    )
    

This information will allow us to focus our investigation into whether this behavior is the result of orphan documents.

Thanks for your help,
Thomas

Generated at Thu Feb 08 04:24:44 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.