-
Type:
Bug
-
Resolution: Done
-
Priority:
Major - P3
-
None
-
Affects Version/s: 3.0.6
-
Component/s: JavaScript
-
None
-
ALL
-
None
-
3
-
None
-
None
-
None
-
None
-
None
-
None
I manage a sharded cluster for my company. That cluster is used by clients as a free cluster: they provision a db and can use it (with some limitations) with their applications.
I moved from 2.6 to 3.0.6 a week ago (on Thursday 2015-09-24), and ever since I have this strange behavior: after being elected as primary, a node will last a few hours (between 2 and 5) and then crash.
The crash is a segmentation fault.
We have systemd restarting the node automatically, and in the meantime, a new node is elected as primary and run for a few more hours then crashes, and another one is elected primary, etc.
The cluster is composed of 3 config servers, 3 mongos, and 5 mongod all within a single RS and handling a single shard.
The 5 mongod are 2 arbiters and 3 data nodes.
The 3 data nodes are 1 MMAPv1 and 2 wiredTiger.
All 3 data nodes crash a few hours after being elected master.
I attached the log of a primary starting 30 seconds before the segfault happens.
/sys/kernel/mm/transparent_hugepage/defrag does not exist on 2 of the 3 servers, and I set it to "never" on the third one.