[SERVER-28422] Cluster stuck because replication heartbeat does not detect hanging members Created: 21/Mar/17 Updated: 31/May/17 Resolved: 22/Mar/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication, Stability |
| Affects Version/s: | 3.2.8, 3.4.2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | VictorGP | Assignee: | Ramon Fernandez Marina |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Operating System: | ALL | ||||||||
| Steps To Reproduce: | I couldn't reproduce the same IO issue we experienced because the baremetal setup is complex and i probably require the same hardware, disks, raid controller, etc. But i managed to reproduce the exact same symptoms using NFS: (setup for Ubuntu) 1 - Setup a simple NFS server exporting an empty directory: https://help.ubuntu.com/community/SettingUpNFSHowTo 2- Install nfs-common and mount an NFS directory:
3- Create a replicaset of 2 members and an arbiter. One of the members, the PRIMARY, will have its storage.dbPath pointing to the NFS directory: /mongodata1 4- Write data to the replicaset and see everything works as expected, test the primary switchs, etc. 5- When the member that has the dbPath pointing to the NFS directory is the PRIMARY, in the NFS server stop the NFS daemon:
6- Keep writing in the replicaset. You will be able to write for some time (probably because it is still using the file system cache), but if you perform a 'show dbs' it hangs or after some seconds the writes will hang too. This, can be extended to a sharded cluster. Create another replicaset, create the shard using mongos, shard a collection, and write data to that collection from mongos. At this point, the whole cluster is unresponsive. Another important thing to note is that, with the IO locking issue we had, once it happened in a secondary member. This also made the whole replicaset stuck, and therefore the whole cluster too. I couldn't manage to reproduce this with NFS. I've reproduced this in 3.2.8 and in 3.4.2 |
||||||||
| Participants: | |||||||||
| Description |
|
We've hit a bug that has made our entire MongoDB cluster (15 baremetal replicasets of 2 members + arb each) unresponsive several times. Whenever an issue occurs that can make the mongod process hangs, the cluster gets stuck too, and this issue should be detected with the replication heartbeat and provoke a primary switch. In our case, we had IO issues, that made the mongod process locked waiting for IO and making it unresponsive, no queries could be performed to that member, they were hanging because of the IO wait. The heartbeat, according to the documentation, is doing a ping, i'm not sure what kind of ping, but this is not enough to detect a bad member, if the problem is IO (one of the main problems in databases) the ping and even a TCP connection work. And, in this case when a member is completely stuck and unresponsive, probably is worth considering removing it from the replicaset rather than just transitioning it to secondary, because all the replication threads between the primary and the secondary, will hang. |
| Comments |
| Comment by VictorGP [ 22/Mar/17 ] |
|
Yes, it looks like the same issue. I will comment there |
| Comment by Ramon Fernandez Marina [ 22/Mar/17 ] |
|
I believe the root cause is the same one as described in Regards, |