ISSUE SUMMARY
Creating a hashed index on a field imposes the constraint that this field cannot contain an array. This constraint is not correctly enforced in some versions of the server if the hashed index is against a dotted field, and the array is present mid-path in the to-be-indexed document.
For example, a hashed index {"a.b": "hashed"} would incorrectly index documents having array at "a", instead of throwing an error and rejecting the write operation. Hashed indexes are typically only used to support a shard key, and validation on mongos prevents these invalid documents from being inserted or created via an update. But there are still plausible cases in which corruption of the hashed index may have occurred:
- If a pre-existing collection is sharded with a hashed shard key against a dotted path (e.g. {"a.b": "hashed"} in the example above). This requires creating a hashed index on the collection. If the collection already contains documents which violate the array constraint, then those documents will be indexed incorrectly. MongoS validation will not be able to detect this, though future attempts to insert invalid documents will be rejected.
- If the user created the hashed index on a single replica set, or bypassed the mongoS and wrote documents directly to mongoD in a sharded cluster, then any documents of the invalid form outlined above may be silently mis-indexed.
Users running on a sharded cluster who created their hashed index on an empty collection and who have not bypassed mongoS to write documents directly to a shard will not be affected by this issue.
USER IMPACT
The documents containing arrays along the index key path (except the terminal path component) will be indexed incorrectly and can lead to missing query results.
RECOVERY STEPS
Users can determine if their hashed indexes have been corrupted by this issue by running the validate command on the corresponding collection, after upgrading to a minor version that contains the fix.
To address the existing corruption, users will need to either delete all the illegal documents or update them such that the resulting documents no longer have an array at any point along the index path. Users can find documents which may have an illegal array using a {$type: 'array'} predicate. The documents identified by the {$type: 'array'} query should then be deleted or updated by _id.
Note that users can only update a shard key value on version 4.2. For 4.0 and older versions, users will have to delete the documents. Following deletion, the documents may be reformatted to eliminate the illegal array paths and then re-inserted.
AFFECTED VERSIONS
This issue affects all supported versions prior to 4.2.2, 4.0.14, 3.6.15, and 3.4.24.
FIX VERSION
The fix will be included in 4.2.2, 4.0.14, 3.6.16 and 3.4.24.
Original Description
Creating a hashed index on a field imposes the constraint that this field cannot contain an array:
> db.c.drop() > db.c.createIndex({a: "hashed"}) > db.c.insert({a: [1]}) WriteResult({ "nInserted" : 0, "writeError" : { "code" : 16766, "errmsg" : "Error: hashed indexes do not currently support array values" } })
This constraint is not correctly enforced if the hashed index is against a dotted field, and the array is present mid-path in the to-be-indexed document:
> db.c.drop() > db.c.createIndex({"a.b": "hashed"}) > db.c.insert({a: [{b: 1}]}) WriteResult({ "nInserted" : 1 }) // Instead of succeeding, this should result in an error!
The key generation implementation calls dotted_path_support::extractElementAtPath(), which returns an empty BSONElement if there is an array along the path. In downstream code, this empty BSONElement causes us to insert a null key into the index. The result is a corrupt index that can lead to missing query results:
> db.c.find({"a.b": 1}) // This query should return the document, but it returns nothing! > db.c.dropIndexes() { "nIndexesWas" : 2, "msg" : "non-_id indexes dropped for collection", "ok" : 1 } > db.c.find({"a.b": 1}) { "_id" : ObjectId("5da76ea3365c4b34d3b15c76"), "a" : [ { "b" : 1 } ] }
Note that we get the correct query result only after dropping the corrupt index.
Although this is both an index corruption and a query correctness issue, the issue cannot be encountered when the hashed index is supporting the shard key – shard key fields cannot be arrays. The primary use case for hashed indexes is hashed sharding, so this may be an uncommon issue for hashed indexes that exist in the wild.
I have only tested 4.2.0 and a recent version of master, but I suspect that this bug affects all stable versions. The incorrect key generation code has not been substantially altered recently.
- related to
-
SERVER-44571 Documents involved in SERVER-44050 corruption scenario cannot be updated or deleted after upgrade
- Closed