[SERVER-22462] Autosplitting failure caused by stale config in runCommand Created: 04/Feb/16  Updated: 06/Dec/22  Resolved: 28/Jul/17

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.2.1
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Goffert van Gool Assignee: [DO NOT USE] Backlog - Sharding Team
Resolution: Duplicate Votes: 6
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File repro.js    
Issue Links:
Duplicate
duplicates SERVER-28418 make the split command on mongod retu... Closed
is duplicated by SERVER-23500 could not autosplit collection :: cau... Closed
Related
related to SERVER-24148 splitVector should check if given chu... Closed
Assigned Teams:
Sharding
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:
Case:

 Description   

We are running multiple sharded mongo clusters, and recently one of our clusters started having an autosplitting issue.

Our mongos processes have been logging the following messages:

I SHARDING [conn14835] sharded connection to shard1/mongo-blob-1:27017,mongo-blob-2:27017 not being returned to the pool
W SHARDING [conn14835] could not autosplit collection database_name.collection_name :: caused by :: 9996 stale config in runCommand ( ns : database_name.collection_name, received : 2|4||56b053c081c73af0480d60fe, wanted : 2|7||56b053c081c73af0480d60fe, recv )

These messages always appear together and seem related. Only one of our clusters is affected. The warning appears with several databases and collections, but for others autosplitting seems to remain functional.

I have tried restarting each mongod and mongos process in this specific cluster, but nothing changed. I cannot find any issues with the config servers for this cluster either. We have a replicated config server setup (the 3.2 default).

Any advice on how to proceed? I assume this issue is an indication that something is wrong with my config cluster. Are there any diagnostics commands available to check the config cluster health? I would prefer to not have to resync my config cluster, as that would give me downtime on my service. Could simply restarting the config servers be sufficient?

I welcome any advice.



 Comments   
Comment by jiang chao [ 04/Nov/17 ]

Hi,

I got the same issue.
My application stopped writing data into MongoDB for hours until I restart mongos.
I'm not sure that is it possible to run out of connections because of this issue? because in the log : sharded connection to rep_1/VCM-16-1:8885,VCM-16-2:8885,VCM-16-3:8885 not being returned to the pool

Comment by Esha Maharishi (Inactive) [ 23/Jun/17 ]

Note that this issue was recently fixed on master and backported to 3.4 for the upcoming 3.4.6 (see linked issue SERVER-28418).

Comment by Randolph Tan [ 15/Dec/16 ]

Attached repro ticket that demonstrates a similar problem. Note: the script is written not in a way that it will throw an error when the bug manifests, but inspecting the shard logs will reveal multiple instances of "splitChunk cannot find chunk [{ x: MinKey },{ x: MaxKey }) to split, the chunk boundaries may be stale".

Comment by Ramon Fernandez Marina [ 19/Apr/16 ]

Hi anthony.pastor, sorry you're running into this and thanks for your offer to help. The issue is understood (see Randolph's response above) and does not affect correctness. We'd like to fix it in this development cycle, so feel free to watch this ticket for updates.

Cheers,
Ramón.

Comment by Anthony Pastor [ 19/Apr/16 ]

Hi,

We've the same issue.
Do you need any log, command to be run, etc. to help you investigate on this ?

Regards.

Comment by Randolph Tan [ 08/Apr/16 ]

Note: this warning message appears more often in v3.2 because mongos now explicitly attach the chunk versions to the splitChunk command.

Comment by Randolph Tan [ 05/Feb/16 ]

Hi,

There is nothing wrong with the config servers. The mongos that is logging the warning is just a little stale compared to the other mongos. I also found a bug in the auto split were mongos does not try to update it's metadata when getting this stale error. For the mean time, flushRouterConfig should flush the metadata in mongos and force it to refresh so you don't need to restart the mongos. This is only a temporary band aid until the mongos becomes stale again (note that this does not affect correctness as it's only stale with respect to the chunk boundaries but not where the data should reside).

Thanks!

Generated at Thu Feb 08 04:00:28 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.