[SERVER-48471] Hashed indexes may be incorrectly marked multikey and be ineligible as a shard key Created: 28/May/20  Updated: 29/Oct/23  Resolved: 08/Jan/21

Status: Closed
Project: Core Server
Component/s: Sharding, Storage
Affects Version/s: 4.2.0, 4.4.0
Fix Version/s: 4.9.0, 4.2.12, 4.4.4

Type: Bug Priority: Major - P3
Reporter: Gavin AIken Assignee: Louis Williams
Resolution: Fixed Votes: 5
Labels: sharding-wfbf-day
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File mongod.log     File mongod.log - config server     Text File mongos.log     File mongos.log.gz    
Issue Links:
Backports
Depends
Related
related to SERVER-50792 Return more useful errors when a shar... Closed
is related to SERVER-49984 Skip side writes if both keys and mul... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.4, v4.2
Sprint: Execution Team 2020-09-07, Execution Team 2021-01-11
Participants:
Case:

 Description   

If an index build on a hashed index is concurrent with any collection writes, the index may be incorrectly marked multikey. As a result, this index can not be used as the shard key on a sharded collection. 

 

Original title: shardCollection fails with "couldn't find valid index for shard key" despite index existing

Original description:

I'm attempting to shard a number of existing collections. They were all created and populated in a standalone mongod which has now been converted to a sharded cluster. Some of the collections have sharded successfully, others have failed with the above error. A possible cause of this was that the indexes were in the process of being created (with the background: true option) when the shardCollection command was first run, due to the script used not waiting for the background creation to complete.



 Comments   
Comment by Githook User [ 12/Jan/21 ]

Author:

{'name': 'Louis Williams', 'email': 'louis.williams@mongodb.com', 'username': 'louiswilliams'}

Message: SERVER-48471 Hashed indexes may be incorrectly marked multikey and be ineligible as a shard key

(cherry picked from commit 8fa255bc05cd99ff649bd0ef9407417b7e439b5e)
(cherry picked from commit 39f10d9d3bf7783593dbc7083d1afaec7369c4ca)
Branch: v4.2
https://github.com/mongodb/mongo/commit/858d7caf8b79fd43a2054f4960ab205208699b0f

Comment by Githook User [ 08/Jan/21 ]

Author:

{'name': 'Louis Williams', 'email': 'louis.williams@mongodb.com', 'username': 'louiswilliams'}

Message: SERVER-48471 Hashed indexes may be incorrectly marked multikey and be ineligible as a shard key

(cherry picked from commit 8fa255bc05cd99ff649bd0ef9407417b7e439b5e)
Branch: v4.4
https://github.com/mongodb/mongo/commit/39f10d9d3bf7783593dbc7083d1afaec7369c4ca

Comment by Githook User [ 08/Jan/21 ]

Author:

{'name': 'Louis Williams', 'email': 'louis.williams@mongodb.com', 'username': 'louiswilliams'}

Message: SERVER-48471 Hashed indexes may be incorrectly marked multikey and be ineligible as a shard key
Branch: master
https://github.com/mongodb/mongo/commit/8fa255bc05cd99ff649bd0ef9407417b7e439b5e

Comment by Louis Williams [ 05/Jan/21 ]

It appears as though hashed indexes (actually, all non-BTree indexes) can be incorrectly marked multikey when they are not. This is due to a bug in hybrid index builds and manifests when concurrent writes are received during an index build. Even if an index has trivially-empty multikey paths originating from concurrent writes, we still mark the index as multikey. I reproduced this locally by inserting a document into a collection while building a hashed index.

In general, an index is allowed to be multikey when it doesn't have any multikey records. Indexes become multikey when the first multikey record is inserted, but once an index is multikey it stays that way forever. Hashed indexes, however, may never be multikey at any point.

Code detail:

After an index build is complete, we set the multikey state based on whether multikey paths were recorded on intercepted documents. This is incorrect. We should only set the multikey state if these paths are also not trivial. This bug only manifests because we make an optimization to bypass updating multikey paths for regular, b-tree, indexes.

Comment by Eric Milkie [ 08/Sep/20 ]

In order to diagnose this, we created SERVER-50792, which will provide more detail into why an appropriate index doesn't seem to exist for the to-be-sharded collection, and we'll backport this at least as far back as 4.2.

Comment by Žygimantas Stauga [ 31/Aug/20 ]

I just did an annual exercise to resync all shards replica members. And now, I was able to shard collection, which previously got an error. It's something to work with until the solution is released.

Comment by Žygimantas Stauga [ 25/Aug/20 ]

We have the same issue. We were able to shard collection on our development and testing environments, but not on the production. We just upgraded to the from 4.2.x to 4.4.0 version and unfortunately issue still there. After upgrade tried to drop the index and recreate, but that didn't help. Drop collection is not an option for us. 

Several collections are affected.

Comment by Evgeny Ivanov [ 29/Jul/20 ]

Have the same error.

My steps;

  1. Collection product already exists and populated. Index by SiteID ({ "SiteID" : 1 }
    , { "name" : "ix_SiteID" }) exists_. SiteID_ also included in index ({ "ImageHash" : 1, "SiteID" : 1 }
    ,{ "name" : "ix_ImageHash_SiteID" }).
  2. Drop index db.product.dropIndex("ix_SiteID")
  3. Recreate index db.product.createIndex({ "SiteID" : "hashed" },{ "name" : "ix_SiteID_hashed" })
  4. Shard collection sh.shardCollection("myDB.product", {"SiteID":"hashed"})

An above error occured.

Comment by Sam Dufel [ 16/Jul/20 ]

Is there any chance this will get fixed this year? I have a number of large collections on my production cluster that need to be sharded, and this has been blocking me since the forced upgrade to 4.2

Comment by Sam Dufel [ 30/Jun/20 ]

I have been having the same issue since upgrading to 4.2.

When I attempt to shard a collection, it shows the following in the mongos log:

2020-06-30T19:21:53.426+0000 I SH_REFR [ConfigServerCatalogCacheLoader-1446] Refresh for collection <namespace> took 0 ms and found the collection is not sharded

The shardCollection command fails with the following:

{{
{
"ok" : 0.0,
"errmsg" : "couldn't find valid index for shard key",
"code" : 96,
"codeName" : "OperationFailed",
"operationTime" : Timestamp(1593544913, 31),
"$clusterTime" : {
"clusterTime" : Timestamp(1593544913, 31),
"signature" : {
"hash" :

{ "$binary" : "uoptv0Gw4rQlAGFu9mXmlawRrZs=", "$type" : "00" }

,
"keyId" : NumberLong(6782612259353919519)
}
}
}
}}

Comment by Carl Champain (Inactive) [ 22/Jun/20 ]

Hi gavin.aiken@netcuras.com,

We're passing this ticket along to the appropriate team for further investigation. Updates will be posted on this ticket as they happen.

Thank you,
Carl
 

Comment by Gavin AIken [ 18/Jun/20 ]

Some more information to add - I have about 9 different collections, with the same basic schema and indexes, which all failed to shard on the same key, with the same error.

 

This morning I recreated one of those collections using mongoexport to copy the data out, then I dropped the collection, recreated it using mongoimport, and rebuilt the indexes. After waiting for the indexes to finish building, I re-ran the shard command which was failing, and it succeeded.

 

So it seems that somehow the server has cached the fact that the collections in question did not have the correct index in place when the shard command was first run, and that dropping and recreating the collection resets that.

 

 

With this workaround I should be able to resolve the problem by performing the same export/drop/import routine on all the affected collections and move forwards.

 

I will keep one collection in the state it is in for any further testing you want me to perform in order to get to the bottom of this...

 

Comment by Gavin AIken [ 17/Jun/20 ]

OK no problem, have done that. These were the steps I took:

  • rotated the log files on the mongos, both shard mongod's and the config mongod.
  • connected to the mongos using the mongo shell, ran sh.status(), then dropped and re-created the index, ran the sh.shardCollection command (which failed with the same error), and then ran sh.status again.
  • connected directly to the 2 shards, used the db in question, and ran db.diagnosticsmetrics.getIndexes()
  • grabbed the log files from the 4 servers as mentioned above

I have uploaded the 4 log files, and the output from the 3 different mongo shell sessions, to the secure upload portal as requested. Hopefully the file names are self-explanatory:

mongo shell output - mongos.txt
mongo shell output - shard - slc-stage-mongo11.txt
mongo shell output - shard - slc-stage-mongo12.txt
mongod.log-slc-stage-mongo11
mongod.log-slc-stage-mongo12
mongod.log-slc-stage-mongoc11
mongos.log

Comment by Carl Champain (Inactive) [ 16/Jun/20 ]

Hi gavin.aiken@netcuras.com,

To investigate a specific issue as a bug we would want to understand in detail what has happened:

  • You mentioned that the log files have rotated, so do you think you could reproduce the behavior and share the full mongod.log and mongos.log files
  • Please share the output of sh.status()
  • Please connect directly to each shard and share the outputs of getIndexes()

We've created a secure upload portal for you. Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time.

Kind regards,
Carl
 

Comment by Gavin AIken [ 10/Jun/20 ]

Looks like I can't upload the full log files. I don't have the mongod.log for the time when I first ran the sharding command, the log files have already been rotated and removed, so that's impossible. I do have the full mongos.log but even gzipped it is too big to upload. I have removed some lines from it which were obviously irrelevant, particularly the ones for connections opening and closing such as these below, and uploaded the rest as mongos.log.gz:

 

2020-06-10T12:57:33.694-0600 I ACCESS [conn7837] Successfully authenticated as principal monitor on admin from client 172.28.16.20:60013
2020-06-10T12:57:33.738-0600 I NETWORK [conn7837] end connection 172.28.16.20:60013 (7 connections now open)
2020-06-10T12:57:34.686-0600 I NETWORK [listener] connection accepted from 172.28.16.20:60025 #7838 (8 connections now open)
2020-06-10T12:57:34.686-0600 I NETWORK [conn7838] received client metadata from 172.28.16.20:60025 conn7838: { driver:{ name: "nodejs", version: "3.5.7" }, os: \{ type: "Linux", name: "linux", architecture: "x64", version: "3.10.0-327.3.1.el7.x86_64" }, platform: "'Node.js v10.13.0, LE (legacy)" }

Comment by Gavin AIken [ 10/Jun/20 ]

I've just added a section of the mongod.log for the primary config server as well, in case that is useful.

Comment by Gavin AIken [ 10/Jun/20 ]

I've attached sections of the log files where I ran the  sh.shardCollection command a few seconds after the start of the logs, and then included all the lines for a few minutes afterwards. Let me know if you need complete log files for the last few days instead and I can provide them ASAP.

Comment by Gavin AIken [ 10/Jun/20 ]

Do you want the whole files for the last few days, or just an extract from the files when I run the above commands?

 

I tried watching the logs when I run the above and the only relevant line I see is this:

 

2020-06-10T12:50:20.352-0600 I SH_REFR [ConfigServerCatalogCacheLoader-3273] Refresh for collection metric.diagnosticsmetrics took 1 ms and found the collection is not sharded

Comment by Carl Champain (Inactive) [ 09/Jun/20 ]

Hi gavin.aiken@netcuras.com,

To help us investigate what's happening, can you please provide the mongod.log and mongos.log files covering this behavior?

Thank you,
Carl

 

Comment by Gavin AIken [ 07/Jun/20 ]

It is definitely not a typo. The script worked successfully for about half of the collections with the same schema, and failed on others. However, for what it's worth, here is the output from the manual commands showing that it fails:

mongos> use metric
switched to db metric
 
mongos> db.diagnosticsmetrics.createIndex({ item: 'hashed' })
{
	"raw" : {
		"slc-stage-mongo11:27017" : {
			"numIndexesBefore" : 4,
			"numIndexesAfter" : 4,
			"note" : "all indexes already exist",
			"ok" : 1
		}
	},
	"ok" : 1,
	"operationTime" : Timestamp(1591549268, 1),
	"$clusterTime" : {
		"clusterTime" : Timestamp(1591549268, 1),
		"signature" : {
			"hash" : BinData(0,"x8qUBOA9PgeFqQUSGBOOUbGn1aA="),
			"keyId" : NumberLong("6829344041860071455")
		}
	}
}
 
mongos> sh.shardCollection('metric.diagnosticsmetrics', { item: 'hashed' })
{
	"ok" : 0,
	"errmsg" : "couldn't find valid index for shard key",
	"code" : 96,
	"codeName" : "OperationFailed",
	"operationTime" : Timestamp(1591549304, 4),
	"$clusterTime" : {
		"clusterTime" : Timestamp(1591549304, 4),
		"signature" : {
			"hash" : BinData(0,"Gd/VpV/fzSc0qF+t1TmxbGlVt+E="),
			"keyId" : NumberLong("6829344041860071455")
		}
	}
}

 

Comment by Dmitry Agranat [ 07/Jun/20 ]

Hi gavin.aiken@netcuras.com, thank you for the report.

Could you reproduce this issue manually (w/o using your script) and post the results here? In addition, please post all commands including index creation and sh.shardCollection. I suspect there is indeed some issue with the script or a typo.

Thanks,
Dima

Comment by Gavin AIken [ 30/May/20 ]

I tried dropping and recreating the hashed index on "item" on the "diagnosticsmetrics" collection, waiting for the index build to complete before trying to shard the collection again, but I still get the same error, "couldn't find valid index for shard key".

I have checked the log on the mongos, both mongod, and the config server mongod, and see no relevant errors when I run the command.

Comment by Gavin AIken [ 28/May/20 ]

Sorry, I accidentally created this issue before I had finished filling in all the fields and there doesn't seem to be any way for me to edit it now?

 

Here's an example of a collection which is exhibiting the problem:

 

mongos> use metric
switched to db metric
mongos> db.diagnosticsmetrics.getIndices()
[
	{
		"v" : 2,
		"key" : {
			"_id" : 1
		},
		"name" : "_id_",
		"ns" : "metric.diagnosticsmetrics"
	},
	{
		"v" : 2,
		"key" : {
			"company" : 1,
			"timeMs" : 1
		},
		"name" : "company_1_timeMs_1",
		"ns" : "metric.diagnosticsmetrics",
		"background" : true
	},
	{
		"v" : 2,
		"unique" : true,
		"key" : {
			"item" : 1,
			"timeMs" : 1
		},
		"name" : "item_1_timeMs_1",
		"ns" : "metric.diagnosticsmetrics",
		"background" : true
	},
	{
		"v" : 2,
		"key" : {
			"item" : "hashed"
		},
		"name" : "item_hashed",
		"ns" : "metric.diagnosticsmetrics",
		"background" : true
	}
]
mongos> sh.shardCollection('metric.diagnosticsmetrics', { item: 'hashed' })
{
	"ok" : 0,
	"errmsg" : "couldn't find valid index for shard key",
	"code" : 96,
	"codeName" : "OperationFailed",
	"operationTime" : Timestamp(1591705370, 5),
	"$clusterTime" : {
		"clusterTime" : Timestamp(1591705370, 5),
		"signature" : {
			"hash" : BinData(0,"aBxREKsRZ8JLyK5PCYJ05KVWTWw="),
			"keyId" : NumberLong("6829344041860071455")
		}
	}
}

 

This is on version 4.2.6 on RedHat Linux.

Generated at Thu Feb 08 05:17:13 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.