[SERVER-48471] Hashed indexes may be incorrectly marked multikey and be ineligible as a shard key Created: 28/May/20 Updated: 29/Oct/23 Resolved: 08/Jan/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding, Storage |
| Affects Version/s: | 4.2.0, 4.4.0 |
| Fix Version/s: | 4.9.0, 4.2.12, 4.4.4 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Gavin AIken | Assignee: | Louis Williams |
| Resolution: | Fixed | Votes: | 5 |
| Labels: | sharding-wfbf-day | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||
| Backport Requested: |
v4.4, v4.2
|
||||||||||||||||||||
| Sprint: | Execution Team 2020-09-07, Execution Team 2021-01-11 | ||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||
| Description |
|
If an index build on a hashed index is concurrent with any collection writes, the index may be incorrectly marked multikey. As a result, this index can not be used as the shard key on a sharded collection.
Original title: shardCollection fails with "couldn't find valid index for shard key" despite index existing Original description: I'm attempting to shard a number of existing collections. They were all created and populated in a standalone mongod which has now been converted to a sharded cluster. Some of the collections have sharded successfully, others have failed with the above error. A possible cause of this was that the indexes were in the process of being created (with the background: true option) when the shardCollection command was first run, due to the script used not waiting for the background creation to complete. |
| Comments |
| Comment by Githook User [ 12/Jan/21 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {'name': 'Louis Williams', 'email': 'louis.williams@mongodb.com', 'username': 'louiswilliams'}Message: (cherry picked from commit 8fa255bc05cd99ff649bd0ef9407417b7e439b5e) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 08/Jan/21 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {'name': 'Louis Williams', 'email': 'louis.williams@mongodb.com', 'username': 'louiswilliams'}Message: (cherry picked from commit 8fa255bc05cd99ff649bd0ef9407417b7e439b5e) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 08/Jan/21 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {'name': 'Louis Williams', 'email': 'louis.williams@mongodb.com', 'username': 'louiswilliams'}Message: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Louis Williams [ 05/Jan/21 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
It appears as though hashed indexes (actually, all non-BTree indexes) can be incorrectly marked multikey when they are not. This is due to a bug in hybrid index builds and manifests when concurrent writes are received during an index build. Even if an index has trivially-empty multikey paths originating from concurrent writes, we still mark the index as multikey. I reproduced this locally by inserting a document into a collection while building a hashed index. In general, an index is allowed to be multikey when it doesn't have any multikey records. Indexes become multikey when the first multikey record is inserted, but once an index is multikey it stays that way forever. Hashed indexes, however, may never be multikey at any point. Code detail: After an index build is complete, we set the multikey state based on whether multikey paths were recorded on intercepted documents. This is incorrect. We should only set the multikey state if these paths are also not trivial. This bug only manifests because we make an optimization to bypass updating multikey paths for regular, b-tree, indexes. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Eric Milkie [ 08/Sep/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
In order to diagnose this, we created | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Žygimantas Stauga [ 31/Aug/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I just did an annual exercise to resync all shards replica members. And now, I was able to shard collection, which previously got an error. It's something to work with until the solution is released. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Žygimantas Stauga [ 25/Aug/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
We have the same issue. We were able to shard collection on our development and testing environments, but not on the production. We just upgraded to the from 4.2.x to 4.4.0 version and unfortunately issue still there. After upgrade tried to drop the index and recreate, but that didn't help. Drop collection is not an option for us. Several collections are affected. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Evgeny Ivanov [ 29/Jul/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Have the same error. My steps;
An above error occured. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sam Dufel [ 16/Jul/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Is there any chance this will get fixed this year? I have a number of large collections on my production cluster that need to be sharded, and this has been blocking me since the forced upgrade to 4.2 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sam Dufel [ 30/Jun/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I have been having the same issue since upgrading to 4.2. When I attempt to shard a collection, it shows the following in the mongos log: 2020-06-30T19:21:53.426+0000 I SH_REFR [ConfigServerCatalogCacheLoader-1446] Refresh for collection <namespace> took 0 ms and found the collection is not sharded The shardCollection command fails with the following: {{ , | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Carl Champain (Inactive) [ 22/Jun/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
We're passing this ticket along to the appropriate team for further investigation. Updates will be posted on this ticket as they happen. Thank you, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Gavin AIken [ 18/Jun/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Some more information to add - I have about 9 different collections, with the same basic schema and indexes, which all failed to shard on the same key, with the same error.
This morning I recreated one of those collections using mongoexport to copy the data out, then I dropped the collection, recreated it using mongoimport, and rebuilt the indexes. After waiting for the indexes to finish building, I re-ran the shard command which was failing, and it succeeded.
So it seems that somehow the server has cached the fact that the collections in question did not have the correct index in place when the shard command was first run, and that dropping and recreating the collection resets that.
With this workaround I should be able to resolve the problem by performing the same export/drop/import routine on all the affected collections and move forwards.
I will keep one collection in the state it is in for any further testing you want me to perform in order to get to the bottom of this...
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Gavin AIken [ 17/Jun/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
OK no problem, have done that. These were the steps I took:
I have uploaded the 4 log files, and the output from the 3 different mongo shell sessions, to the secure upload portal as requested. Hopefully the file names are self-explanatory:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Carl Champain (Inactive) [ 16/Jun/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
To investigate a specific issue as a bug we would want to understand in detail what has happened:
We've created a secure upload portal for you. Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time. Kind regards, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Gavin AIken [ 10/Jun/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Looks like I can't upload the full log files. I don't have the mongod.log for the time when I first ran the sharding command, the log files have already been rotated and removed, so that's impossible. I do have the full mongos.log but even gzipped it is too big to upload. I have removed some lines from it which were obviously irrelevant, particularly the ones for connections opening and closing such as these below, and uploaded the rest as mongos.log.gz:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Gavin AIken [ 10/Jun/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I've just added a section of the mongod.log for the primary config server as well, in case that is useful. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Gavin AIken [ 10/Jun/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I've attached sections of the log files where I ran the sh.shardCollection command a few seconds after the start of the logs, and then included all the lines for a few minutes afterwards. Let me know if you need complete log files for the last few days instead and I can provide them ASAP. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Gavin AIken [ 10/Jun/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Do you want the whole files for the last few days, or just an extract from the files when I run the above commands?
I tried watching the logs when I run the above and the only relevant line I see is this:
2020-06-10T12:50:20.352-0600 I SH_REFR [ConfigServerCatalogCacheLoader-3273] Refresh for collection metric.diagnosticsmetrics took 1 ms and found the collection is not sharded | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Carl Champain (Inactive) [ 09/Jun/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
To help us investigate what's happening, can you please provide the mongod.log and mongos.log files covering this behavior? Thank you,
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Gavin AIken [ 07/Jun/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
It is definitely not a typo. The script worked successfully for about half of the collections with the same schema, and failed on others. However, for what it's worth, here is the output from the manual commands showing that it fails:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Dmitry Agranat [ 07/Jun/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi gavin.aiken@netcuras.com, thank you for the report. Could you reproduce this issue manually (w/o using your script) and post the results here? In addition, please post all commands including index creation and sh.shardCollection. I suspect there is indeed some issue with the script or a typo. Thanks, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Gavin AIken [ 30/May/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I tried dropping and recreating the hashed index on "item" on the "diagnosticsmetrics" collection, waiting for the index build to complete before trying to shard the collection again, but I still get the same error, "couldn't find valid index for shard key". I have checked the log on the mongos, both mongod, and the config server mongod, and see no relevant errors when I run the command. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Gavin AIken [ 28/May/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Sorry, I accidentally created this issue before I had finished filling in all the fields and there doesn't seem to be any way for me to edit it now?
Here's an example of a collection which is exhibiting the problem:
This is on version 4.2.6 on RedHat Linux. |