We have been periodically experiencing what seems to be a deadlock on our mongo setup. Our setup (currently all on one physical machine):
3x mongod shard servers, wiredtiger
3x mongod config servers
We currently have 14 databases, most with 5 or 6 collections inside them. The system is heavily write based as we're collecting timeseries data. Total data size right now according to a "show dbs" is around 1.7 TB.
What happens is that "randomly" the system grinds to a halt. This can take a few hours or the longest stretch was around 5 days. The machine is perfectly responsive, but all mongo processes are using 0% CPU. Load drops to 0, all the usual stuff you'd expect. When performing a "db.currentOp()" we can see a lot of queries with long "microsecs_running" - typically all with an "opid" attribute referencing one of the shards. Attempting something like "show dbs" hangs in this state.
A "kill -9" of the shard in question will stop the process. Upon starting it up again everything resumes normally without having to touch any of the other processes. It is always one of the mongod shard processes, never mongos or mongod config.
We enabled level 2 verbosity logging to hopefully get a better view into what was going on. I am including an example of the "db.currentOp()" as well as the end of the log from when the process last hung. Please note that while there are "COLLSCANS" present in the currentOp output here, we have since tried indexing everything to cover all of our queries and the issue persists.
Please let me know if I can provide any more information or help. This is obviously quite a big issue to us as it happens about once every 1-2 days and requires human intervention to resolve.