[SERVER-25791] Unique index becomes inconsistent with collection data Created: 25/Aug/16  Updated: 18/Apr/17  Resolved: 29/Mar/17

Status: Closed
Project: Core Server
Component/s: Index Maintenance, Querying, WiredTiger, Write Ops
Affects Version/s: 3.2.8
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Šimun Mikecin Assignee: Eric Milkie
Resolution: Duplicate Votes: 3
Labels: uic
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File repro.sh    
Issue Links:
Duplicate
duplicates SERVER-28546 Documents can erroneously be unindexe... Closed
Related
Operating System: ALL
Steps To Reproduce:

Try scenario described in the description above.

Participants:

 Description   

I'm using an unique index:

    {
        "v" : 1,
        "unique" : true,
        "key" : {
            "state.agent" : 1
        },
        "name" : "state.agent_1",
        "ns" : "live.tasks.active",
        "partialFilterExpression" : {
            "state.value" : "assigned"
        },
        "background" : true
    }

My applications works in a way that many times an update (using $set and $unset) is sent that should be rejected with an ""E11000 duplicate key error collection: live.tasks.active index: state.agent_1" because of duplicate value in the above index.
After running for ~20 days I realized that above index has become inconsistent with collection data in a way that a query using _id show's a document that should be in this index:

db.tasks.active.find({_id:ObjectId("57bea0f21ebfd0b31f32da1c")})

but a query that shows all data in the index doesn't show that document:

db.tasks.active.find({"state.value":"assigned"}, {"state.agent": 1})

I have tried to do a db.tasks.active.reIndex(), but it failed with a message that db.tasks.active contains duplicate values for an above index so that above index cannot be re-added.
It seems that because of an error sometimes a duplicate index value is allowed to enter into a collection document instead of returing ""E11000 duplicate key error collection: live.tasks.active index: state.agent_1".
I suspect that after this an index becomes inconsistent: isn't updated everytime as it should.

Most times I do get E11000, but from time to time update doesn't return an error (as it should) and updates a document making a duplicate index key entry that is not visible in the index itself.



 Comments   
Comment by Eric Milkie [ 29/Mar/17 ]

We discovered a problem with partial indexes that could explain the behavior you are seeing. Please follow the linked ticket for further updates.

Comment by Eric Milkie [ 27/Mar/17 ]

Hi sime,
I've been attempting to reproduce this behavior but haven't been successful yet.
Are you still experiencing this issue? If possible, could you run a query to get how many documents are indexed by your partial index, and how many documents ought to be indexed by your partial index? You can use an index hint to determine this. My suspicion is that a subset of documents that should be going into the index are not, even though the documents pass the partial filter expression.

Comment by Šimun Mikecin [ 07/Sep/16 ]

Thomas,

1. I can confirm that there were no unclean shutdowns
2. We are using 3 drivers (2 for PHP and one for C++):

PHP: mongo
MongoDB Support => enabled
Version => 1.6.10
Streams Support => enabled
SSL Support => enabled
Supported Authentication Mechanisms
MONGODB-CR => enabled
SCRAM-SHA-1 => enabled
MONGODB-X509 => enabled
GSSAPI (Kerberos) => disabled
PLAIN => disabled
Directive => Local Value => Master Value
mongo.allow_empty_keys => 0 => 0
mongo.chunk_size => 261120 => 261120
mongo.cmd => $ => $
mongo.default_host => localhost => localhost
mongo.default_port => 27017 => 27017
mongo.is_master_interval => 15 => 15
mongo.long_as_object => 0 => 0
mongo.native_long => 1 => 1
mongo.ping_interval => 5 => 5

PHP: mongodb
mongodb support => enabled
mongodb version => 1.1.6
mongodb stability => stable
libmongoc version => 1.3.5
libbson version => 1.3.5

C++: libmongodb-2.4.14-1.el6.x86_64 (RPM package from RHEL 6)

3. I can confirm that we are not using bulk upserts, but we are doing updates that affect multiple documents.

Comment by Kelsey Schubert [ 06/Sep/16 ]

Hi sime,

Thank you for uploading the files, we are still working on reproducing this issue to understand the root cause.

  1. Would you please confirm that no unclean shutdowns occurred between 2016-08-25T14:27:01 and when this issue was observed?
  2. Which driver are you using? Which specific version?
  3. Does your application perform bulk upserts?

Manual intervention is required to resolve this inconsistent state. I would recommend the following approach:

  1. run an aggregation command to identify the documents that violate the unique constraint
  2. delete or modify the duplicate documents to resolve the duplicate key error
  3. reindex the unique index

Please let me know if you have any questions about this approach.

Thank you,
Thomas

Comment by Šimun Mikecin [ 29/Aug/16 ]

I have just uploaded all log files I have and content of diagnostic.data to the upload portal.

Currently, there is still invalid situation: query that uses an unique index doesn't return the same number of documents as a query that doesn't use an index.

Comment by Kelsey Schubert [ 26/Aug/16 ]

Hi sime,

Thank you for dump, I have moved it to secure upload portal where it will be only visible to MongoDB employees investigating the issue, and I've confirmed documents exist that violate the unique index's restriction.

It appears the problematic documents were inserted between 2016-08-25T14:27:01.000Z and 2016-08-25T14:27:03.000Z.

So we can continue to investigate, would you please upload the complete logs and an archive of the diagnostic.data to the upload portal?

As Ramon mentioned, if you are able to provide code that reproduces this issue it would greatly speed our investigation.

Kind regards,
Thomas

Comment by Šimun Mikecin [ 26/Aug/16 ]

Last night we did an upgrade to version 3.2.9.
Today I see some inconsistency with unique index.
I run the same query: once without using any index and the other time using default unique index.
Queries do not return the same number of documents (16 vs 14), which means that unique index is not consistent with data in collection.
As you can see from the results below, two documents would have a duplicate value in the unique index. So obviously an update operation that adds a duplicate value to a document was allowed instead of returning E11000 for those two.

Query with not using an index:

db.getCollection('tasks.active').find({"state.value":"assigned"}, {"state.agent":1}).sort({"state.agent":1}).hint( { $natural: 1 } )
Results:
/* 1 */
{
    "_id" : ObjectId("57bef6a8e8bb8f5dd12364fb"),
    "state" : {
        "agent" : "hrabtj"
    }
}
 
/* 2 */
{
    "_id" : ObjectId("57bf0037e8bb8fc1d42364a6"),
    "state" : {
        "agent" : "hrafnv"
    }
}
 
/* 3 */
{
    "_id" : ObjectId("57bfd7d3b5913263e54c0c98"),
    "state" : {
        "agent" : "hrafrl"
    }
}
 
/* 4 */
{
    "_id" : ObjectId("57bfd9bbb59132e042324a82"),
    "state" : {
        "agent" : "hraglt"
    }
}
 
/* 5 */
{
    "_id" : ObjectId("57bef6a6e8bb8f5dd12364d0"),
    "state" : {
        "agent" : "hragst"
    }
}
 
/* 6 */
{
    "_id" : ObjectId("57bf0036e8bb8fc1d4236491"),
    "state" : {
        "agent" : "hrajct"
    }
}
 
/* 7 */
{
    "_id" : ObjectId("57bf0037e8bb8fc1d4236499"),
    "state" : {
        "agent" : "hrajct"
    }
}
 
/* 8 */
{
    "_id" : ObjectId("57bfd988b591327443057de3"),
    "state" : {
        "agent" : "hralcv"
    }
}
 
/* 9 */
{
    "_id" : ObjectId("57bfd9c5b591326d2607aab0"),
    "state" : {
        "agent" : "hralnl"
    }
}
 
/* 10 */
{
    "_id" : ObjectId("57bef6a8e8bb8f5dd12364fe"),
    "state" : {
        "agent" : "hramrg"
    }
}
 
/* 11 */
{
    "_id" : ObjectId("57bf0035e8bb8fc1d4236464"),
    "state" : {
        "agent" : "hrapcc"
    }
}
 
/* 12 */
{
    "_id" : ObjectId("57bf0036e8bb8fc1d423647c"),
    "state" : {
        "agent" : "hrapcc"
    }
}
 
/* 13 */
{
    "_id" : ObjectId("57bfda19b591327442557b1a"),
    "state" : {
        "agent" : "hrarml"
    }
}
 
/* 14 */
{
    "_id" : ObjectId("57bfd977b5913263e54c0cc9"),
    "state" : {
        "agent" : "hrasvt"
    }
}
 
/* 15 */
{
    "_id" : ObjectId("57bfda0db591325c691ba134"),
    "state" : {
        "agent" : "hrathm"
    }
}
 
/* 16 */
{
    "_id" : ObjectId("57bfd97fb591325c691ba0b4"),
    "state" : {
        "agent" : "hrazvi"
    }
}

Query with using an index:

db.getCollection('tasks.active').find({"state.value":"assigned"}, {"state.agent":1}).sort({"state.agent":1})
Results:
/* 1 */
{
    "_id" : ObjectId("57bef6a8e8bb8f5dd12364fb"),
    "state" : {
        "agent" : "hrabtj"
    }
}
 
/* 2 */
{
    "_id" : ObjectId("57bf0037e8bb8fc1d42364a6"),
    "state" : {
        "agent" : "hrafnv"
    }
}
 
/* 3 */
{
    "_id" : ObjectId("57bfd7d3b5913263e54c0c98"),
    "state" : {
        "agent" : "hrafrl"
    }
}
 
/* 4 */
{
    "_id" : ObjectId("57bfd9bbb59132e042324a82"),
    "state" : {
        "agent" : "hraglt"
    }
}
 
/* 5 */
{
    "_id" : ObjectId("57bef6a6e8bb8f5dd12364d0"),
    "state" : {
        "agent" : "hragst"
    }
}
 
/* 6 */
{
    "_id" : ObjectId("57bf0036e8bb8fc1d4236491"),
    "state" : {
        "agent" : "hrajct"
    }
}
 
/* 7 */
{
    "_id" : ObjectId("57bfd988b591327443057de3"),
    "state" : {
        "agent" : "hralcv"
    }
}
 
/* 8 */
{
    "_id" : ObjectId("57bfd9c5b591326d2607aab0"),
    "state" : {
        "agent" : "hralnl"
    }
}
 
/* 9 */
{
    "_id" : ObjectId("57bef6a8e8bb8f5dd12364fe"),
    "state" : {
        "agent" : "hramrg"
    }
}
 
/* 10 */
{
    "_id" : ObjectId("57bf0036e8bb8fc1d423647c"),
    "state" : {
        "agent" : "hrapcc"
    }
}
 
/* 11 */
{
    "_id" : ObjectId("57bfda19b591327442557b1a"),
    "state" : {
        "agent" : "hrarml"
    }
}
 
/* 12 */
{
    "_id" : ObjectId("57bfd977b5913263e54c0cc9"),
    "state" : {
        "agent" : "hrasvt"
    }
}
 
/* 13 */
{
    "_id" : ObjectId("57bfda0db591325c691ba134"),
    "state" : {
        "agent" : "hrathm"
    }
}
 
/* 14 */
{
    "_id" : ObjectId("57bfd97fb591325c691ba0b4"),
    "state" : {
        "agent" : "hrazvi"
    }
}

I have attached a mongodump of this collection at approximatelly the same time this queries were run (be aware that at the same time this colection was constantly updated by other applications).

Comment by Šimun Mikecin [ 25/Aug/16 ]

Since this was a production environment that had to be resolved ASAP: I had to replace this collection with another collection (renamed to the name of the old collection) and recreate indexes. We haven't found a problem since (this was done today).
So unfortunatelly I don't have a problematic collection anymore.

But I have previous results of querying an oplog.rs collections that will give you an idea of document schema and sample $set and $unset operations runned by application code (since documents contain real customer data I've had to replace those with XXX):

db.oplog.rs.find({'ts' : {$gte : Timestamp(1454939212, 1)}, "ns":"live.tasks.active", 'op' : 'u', 'o2._id' : ObjectId("57977bd4e8bb8f7a9f6ce3d4")})
 
/* 1 */
{
    "ts" : Timestamp(6322296562424741, 23),
    "t" : NumberLong(3),
    "h" : NumberLong(-4990470266453813779),
    "v" : 2,
    "op" : "u",
    "ns" : "live.tasks.active",
    "o2" : {
        "_id" : ObjectId("57977bd4e8bb8f7a9f6ce3d4")
    },
    "o" : {
        "$set" : {
            "state.agent" : "hrajct",
            "state.value" : "assigned",
            "doContact.0.assignedTo" : "hrajct",
            "state.modifiedAt" : ISODate("2016-08-24T09:39:38.163+02:00")
        }
    }
}
 
/* 2 */
{
    "ts" : Timestamp(6322296579604611, 49),
    "t" : NumberLong(3),
    "h" : NumberLong(-8043877984795584249),
    "v" : 2,
    "op" : "u",
    "ns" : "live.tasks.active",
    "o2" : {
        "_id" : ObjectId("57977bd4e8bb8f7a9f6ce3d4")
    },
    "o" : {
        "$set" : {
            "state.modifiedAt" : ISODate("2016-08-24T09:39:42.350+02:00"),
            "doContact" : [ 
                {
                    "type" : "mobile",
                    "dest" : "XXX",
                    "startTime" : "0830",
                    "endTime" : "1930"
                }, 
                {
                    "type" : "mobile",
                    "dest" : "XXX",
                    "startTime" : "0830",
                    "endTime" : "1930"
                }, 
                {
                    "type" : "fixed",
                    "dest" : "XXX",
                    "startTime" : "0830",
                    "endTime" : "1930"
                }, 
                {
                    "type" : "mobile",
                    "dest" : "XXX",
                    "startTime" : "0830",
                    "endTime" : "1930"
                }
            ],
            "contacted" : [ 
                {
                    "value" : {
                        "type" : "mobile",
                        "dest" : "XXX",
                        "startTime" : "0830",
                        "endTime" : "1930",
                        "assignedTo" : "hrajct"
                    },
                    "meta" : {
                        "paired" : false,
                        "status" : "A",
                        "at" : ISODate("2016-08-24T09:39:39.000+02:00"),
                        "sessionId" : ObjectId("57bd4f3b1ebfd0b31f3093b7"),
                        "assignmentIndex" : 0
                    }
                }
            ],
            "assignments.0.agent" : "hrajct",
            "assignments.0.startTime" : ISODate("2016-08-24T09:39:38.000+02:00"),
            "assignments.0.endTime" : ISODate("2016-08-24T09:39:42.000+02:00"),
            "assignments.0.nextContactTime" : ISODate("2016-08-24T09:49:42.000+02:00"),
            "assignments.0.priority" : 20,
            "assignments.0.action" : "",
            "state.value" : "waiting",
            "order.nextContactTime" : ISODate("2016-08-24T09:49:42.000+02:00")
        }
    }
}
 
/* 3 */
{
    "ts" : Timestamp(6322300445075177, 27),
    "t" : NumberLong(3),
    "h" : NumberLong(-9091646345984665981),
    "v" : 2,
    "op" : "u",
    "ns" : "live.tasks.active",
    "o2" : {
        "_id" : ObjectId("57977bd4e8bb8f7a9f6ce3d4")
    },
    "o" : {
        "$set" : {
            "state.value" : "assigned",
            "doContact.0.assignedTo" : "hrajct",
            "state.modifiedAt" : ISODate("2016-08-24T09:54:42.381+02:00")
        }
    }
}
 
/* 4 */
{
    "ts" : Timestamp(6322300449370144, 95),
    "t" : NumberLong(3),
    "h" : NumberLong(-1261390581929134473),
    "v" : 2,
    "op" : "u",
    "ns" : "live.tasks.active",
    "o2" : {
        "_id" : ObjectId("57977bd4e8bb8f7a9f6ce3d4")
    },
    "o" : {
        "$set" : {
            "state.modifiedAt" : ISODate("2016-08-24T09:54:43.901+02:00"),
            "state.value" : "waiting"
        },
        "$unset" : {
            "state.agent" : true,
            "doContact.0.assignedTo" : true
        }
    }
}
 
db.oplog.rs.find({'ts' : {$gte : Timestamp(1454939212, 1)}, "ns":"live.tasks.active",
$or: [{'o2._id' : ObjectId("57bd5217e8bb8ffe142364cf")},
      {'o._id' : ObjectId("57bd5217e8bb8ffe142364cf")}]})
 
/* 1 */
 
{
    "ts" : Timestamp(6322299714930737, 7),
    "t" : NumberLong(3),
    "h" : NumberLong(2008123626750443334),
    "v" : 2,
    "op" : "i",
    "ns" : "live.tasks.active",
    "o" : {
        "_id" : ObjectId("57bd5217e8bb8ffe142364cf"),
        "campaignId" : ObjectId("57bd42d7e8bb8f70b8236460"),
        "userDefined" : {
            "customerId" : "5ccc1c1d98f082d5a1528a91942f90af34cdb3d5",
            "customerName" : "XXX",
            "relations" : [ 
                {
                    "type" : "TOPIC_REL",
                    "own_model" : "Telephone_Model_Dialer_Queue",
                    "own_backend" : "Sql",
                    "own_degree" : "sibling",
                    "related_model" : "CPM_Model_Topic",
                    "related_backend" : "Sql",
                    "related_id" : "1764022"
                }
            ],
            "topicId" : "1764022"
        },
        "order" : {
            "priority" : 20,
            "nextContactTime" : ISODate("2016-08-24T09:51:51.000+02:00")
        },
        "doContact" : [ 
            {
                "type" : "mobile",
                "dest" : "XXX",
                "startTime" : "0830",
                "endTime" : "1930"
            }, 
            {
                "type" : "mobile",
                "dest" : "XXX",
                "startTime" : "0830",
                "endTime" : "1930"
            }, 
            {
                "type" : "mobile",
                "dest" : "XXX",
                "startTime" : "0830",
                "endTime" : "1930"
            }, 
            {
                "type" : "fixed",
                "dest" : "XXX",
                "startTime" : "0830",
                "endTime" : "1930"
            }, 
            {
                "type" : "mobile",
                "dest" : "XXX",
                "startTime" : "0830",
                "endTime" : "1930"
            }
        ],
        "assignments" : []
    }
}
 
/* 2 */
{
    "ts" : Timestamp(6322299714930737, 8),
    "t" : NumberLong(3),
    "h" : NumberLong(8093883448311754725),
    "v" : 2,
    "op" : "u",
    "ns" : "live.tasks.active",
    "o2" : {
        "_id" : ObjectId("57bd5217e8bb8ffe142364cf")
    },
    "o" : {
        "$set" : {
            "endTime" : ISODate("2017-12-31T01:00:00.000+02:00")
        }
    }
}
 
/* 3 */
{
    "ts" : Timestamp(6322299714930737, 27),
    "t" : NumberLong(3),
    "h" : NumberLong(-7093720264812335756),
    "v" : 2,
    "op" : "u",
    "ns" : "live.tasks.active",
    "o2" : {
        "_id" : ObjectId("57bd5217e8bb8ffe142364cf")
    },
    "o" : {
        "$set" : {
            "state.value" : "waiting",
            "state.modifiedAt" : ISODate("2016-08-24T09:51:52.443+02:00"),
            "insertedAt" : ISODate("2016-08-24T09:51:52.443+02:00")
        }
    }
}
 
/* 4 */
{
    "ts" : Timestamp(6322364590411743, 47),
    "t" : NumberLong(3),
    "h" : NumberLong(1934933769302960743),
    "v" : 2,
    "op" : "u",
    "ns" : "live.tasks.active",
    "o2" : {
        "_id" : ObjectId("57bd5217e8bb8ffe142364cf")
    },
    "o" : {
        "$set" : {
            "state.agent" : "hrasnm",
            "state.value" : "assigned",
            "doContact.0.assignedTo" : "hrasnm",
            "state.modifiedAt" : ISODate("2016-08-24T14:03:37.708+02:00")
        }
    }
}

Comment by Ramon Fernandez Marina [ 25/Aug/16 ]

simun.mikecin@me.com, is it possible for you to share code that reproduces the behavior you describe?

Can you send us a sample of the $set and $unset update operations? Also, can you upload a few sample problematic documents? For example, the one with _id: ObjectId("57bea0f21ebfd0b31f32da1c") and the output of db.tasks.active.find({"state.value":"assigned"}, {"state.agent": 1}) above.

Alternatively, if you can share your whole dbpath or a mongodump of the affected collection that may make things easier: I've created a safe, secure upload portal for that.

Thanks,
Ramón.

Comment by Šimun Mikecin [ 25/Aug/16 ]

Just for info: this is a Replica Set with only one server in the Replica Set (because we need oplog.rs).
I think this could be related to https://jira.mongodb.org/browse/SERVER-23688

Generated at Thu Feb 08 04:10:14 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.