Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Done
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 3.0.7
Component/s: Sharding, WiredTiger
Labels:
- RF

Operating System:
ALL
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

We have been periodically experiencing what seems to be a deadlock on our mongo setup. Our setup (currently all on one physical machine):

CentOS 6.6
mongodb 3.0.7
3x mongod shard servers, wiredtiger
3x mongod config servers
1x mongos

We currently have 14 databases, most with 5 or 6 collections inside them. The system is heavily write based as we're collecting timeseries data. Total data size right now according to a "show dbs" is around 1.7 TB.

What happens is that "randomly" the system grinds to a halt. This can take a few hours or the longest stretch was around 5 days. The machine is perfectly responsive, but all mongo processes are using 0% CPU. Load drops to 0, all the usual stuff you'd expect. When performing a "db.currentOp()" we can see a lot of queries with long "microsecs_running" - typically all with an "opid" attribute referencing one of the shards. Attempting something like "show dbs" hangs in this state.

A "kill -9" of the shard in question will stop the process. Upon starting it up again everything resumes normally without having to touch any of the other processes. It is always one of the mongod shard processes, never mongos or mongod config.

We enabled level 2 verbosity logging to hopefully get a better view into what was going on. I am including an example of the "db.currentOp()" as well as the end of the log from when the process last hung. Please note that while there are "COLLSCANS" present in the currentOp output here, we have since tried indexing everything to cover all of our queries and the issue persists.

Please let me know if I can provide any more information or help. This is obviously quite a big issue to us as it happens about once every 1-2 days and requires human intervention to resolve.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

currentOp.json
91 kB
Nov 23 2015 01:48:58 PM UTC
hung_shard.log
7 kB
Nov 23 2015 01:48:58 PM UTC
parsed_iostat.log
686 kB
Nov 30 2015 02:28:55 PM UTC
parsed_ss.log
14.27 MB
Nov 30 2015 02:28:55 PM UTC
Screen Shot 2016-01-20 at 3.09.29 pm.png
74 kB
Jan 20 2016 04:13:34 AM UTC

Assignee:: Ramon Fernandez Marina
Reporter:: Dan Doyle
Participants:: Bruce Lucas, Dan Doyle, Ramon Fernandez Marina, Stennie Steneker
Votes:: 0 Vote for this issue
Watchers:: 16 Start watching this issue

Created:: Nov 23 2015 01:48:58 PM UTC
Updated:: Feb 01 2016 10:48:39 PM UTC
Resolved:: Feb 01 2016 10:48:39 PM UTC

Details

Description

Attachments

Attachments

Forms

Activity

People

Dates