[SERVER-60041] different UUID in config.system.session causes failed chunk moves Created: 17/Sep/21 Updated: 01/May/23 Resolved: 01/May/23 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 4.0.27 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Kay Agahd | Assignee: | Randolph Tan |
| Resolution: | Incomplete | Votes: | 1 |
| Labels: | sharding, sharding-docs, sharding-nyc-subteam1 | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Operating System: | ALL | ||||||||
| Sprint: | Sharding 2022-02-21, Sharding 2022-03-07, Sharding NYC 2022-03-21, Sharding NYC 2022-04-04, Sharding NYC 2022-04-18, Sharding 2022-05-02, Sharding NYC 2022-05-16, Sharding NYC 2022-05-30, Sharding NYC 2022-06-13, Sharding 2022-06-27, Sharding 2022-07-11, Sharding 2022-07-25, Sharding 2022-08-08, Sharding 2022-08-22, Sharding 2022-09-05, Sharding 2022-09-19, Sharding 2022-10-03, Sharding 2022-10-17, Sharding NYC 2022-10-31, Sharding NYC 2022-11-14, Sharding NYC 2022-11-28, Sharding 2022-12-12, Sharding NYC 2022-12-26, Sharding NYC 2023-01-09, Sharding NYC 2023-01-23, Sharding NYC 2023-02-06, Sharding NYC 2023-02-20, Sharding NYC 2023-03-06, Sharding NYC 2023-03-20, Sharding NYC 2023-04-03, Sharding NYC 2023-04-17, Sharding NYC 2023-05-01, Sharding NYC 2023-05-15 | ||||||||
| Participants: | |||||||||
| Story Points: | 5 | ||||||||
| Description |
|
We are running a 3 shard cluster with the latest mongodb v4 (v4.0.27). In the logs of one shard we see constantly that chunks from the config.systen.session collection can't be moved due to different collection UUID's:
The log suggests to drop the collection but mongodb's documentation clearly states not to drop collections from the config database. We cannot even execute find on this collection because the user is not authorized:
However, the user has the following roles:
Obviously the suggestion from the log file to drop the collection seems not to work and also contradicts mongodb's documentation. |
| Comments |
| Comment by Randolph Tan [ 11/Feb/22 ] | ||||||||||||||||||||
|
It is not advisable to touch the config.system.sessions, but this case is extraordinary and perhaps the only way to get out of this situation. I suspect the reason why the user you were using to touch the config collection is getting auth errors is because the user is configured for the cluster. This user will only work when connected directly to mongos. If you were connecting directly to the shard, you will need a different user to be able to perform privileged tasks. If you're using keyFile, you can check this doc that explains how to add local users to the shard you're trying to connect to. If you're using x509, you can check this doc explaining how to connect to the shard with authentication. Once you confirmed that you can perform operations on the config.system.sessions, you can perform the steps below.
Note that this procedure can potentially cause active sessions to expire. It can also potentially cause some errors regarding sessions until the config server recreates the config.system.sessions again. | ||||||||||||||||||||
| Comment by Edwin Zhou [ 08/Oct/21 ] | ||||||||||||||||||||
|
Thank you for your patience as I further investigate this issue, and for following up with your discussion on our community forums. I agree that MongoDB should not be able to reach this configuration where the config.system.sessions collection has inconsistent UUIDs across shards. I'm passing this on to the appropriate team to investigate steps to remediate this issue. | ||||||||||||||||||||
| Comment by Konstantin Manchev [ 07/Oct/21 ] | ||||||||||||||||||||
|
This happen to me as well after sharded cluster upgrade from 3.6.10 to 4.0.16 ... | ||||||||||||||||||||
| Comment by Kay Agahd [ 07/Oct/21 ] | ||||||||||||||||||||
|
Looks like other users encountered this issue as well. To overcome the issue, they simply dropped this collection even though it contradicts mongodb's documentation. See: https://www.mongodb.com/community/forums/t/cannot-create-collection-config-system-sessions-because-we-already-have-an-identic/125229
| ||||||||||||||||||||
| Comment by Álvaro López López [ 07/Oct/21 ] | ||||||||||||||||||||
|
Hi Edwin and Kay, I also agree with Kay in that it seems to be a bug present on MongoDB 4.0.x or earlier versions. We have exactly the same issue and the operations were carried out following MongoDB recommendations. I guess we should find a way to recreate the collection in the affected shards with the proper UUID in a "clean" way checked by MongoDB, since it seems that they are "previous incarnations" created at some uncertain point in time. Unfortunately, we haven't been able to reproduce it in a test cluster either. Edwin, could you check on the MongoDB side how this can be done? Thanks in advance, Álvaro.
| ||||||||||||||||||||
| Comment by Kay Agahd [ 07/Oct/21 ] | ||||||||||||||||||||
|
Hi edwin.zhou, we do believe that's a MongoDB bug, as a DBS should never be in such a state, even if a user made an unintentional mistake. As I wrote already, we always followed MongoDB's documentation and therefore are not aware of any misconfiguration due to our actions. The DBS by itself should prevent to be in such a state. Also, the log message contradicts mongodb's documentation. At least this should be fixed. However, as you suggested, I posted the same issue on the MongoDB Developer Community Forums. | ||||||||||||||||||||
| Comment by Edwin Zhou [ 06/Oct/21 ] | ||||||||||||||||||||
|
For that question we'd like to encourage you to start by asking our community for help by posting on the MongoDB Developer Community Forums. If the discussion there leads you to suspect a bug in the MongoDB server, then we'd want to investigate it as a possible bug here in the SERVER project. Best, | ||||||||||||||||||||
| Comment by Kay Agahd [ 06/Oct/21 ] | ||||||||||||||||||||
|
Hi edwin.zhou, the cluster is running since 2013 already. During this time it has been migrated to several newer servers multiple times already, resulting in adding or removing some shards from time to time. Also MongoDB and OS versions have been upgraded continuously. Obviously, something got wrong during these changes even though we think we followed always the procedures recommended and documented by MongoDB. We understand that you want to reproduce it to be able to fix the root cause but we unfortunately don't know what might have caused the different collection UUID's among the shards. So the question rather is how do we get out of this misconfiguration? Thanks for your help! | ||||||||||||||||||||
| Comment by Edwin Zhou [ 06/Oct/21 ] | ||||||||||||||||||||
|
Hi kay.agahd@idealo.de and aloplop85@gmail.com, Thanks for your report and description for this behavior. I haven't been able to reproduce this behavior given the information you've provided so far. When adding a new shard to a cluster, MongoDB drops the new shard's system.sessions collection during the add process.. I suspect there may have been a misconfiguration during the add shard process. Can you provide the steps that occurred prior to observing this behavior? Best, | ||||||||||||||||||||
| Comment by Álvaro López López [ 23/Sep/21 ] | ||||||||||||||||||||
|
I confirm that it also occurs in 4.2.14 Community version, after upgrading a Shard Cluser with two shards from 4.0.23 Community, and it is not possible to delete documents in that collection. The following trace appears almost every 10 seconds:
Then, it is possible to check the different UUIDs in both Shards for "ns" : "config.system.sessions":
However, in config DB:
|