[SERVER-39371] Mongo cluster got stuck at "could not find member to sync from" Created: 04/Feb/19  Updated: 08/Feb/19  Resolved: 08/Feb/19

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Question Priority: Major - P3
Reporter: Hongkai Wu [X] Assignee: Eric Sedor
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Screen Shot 2019-02-06 at 9.59.07 AM.png     Zip Archive diagnostic.data.10.132.203.84.zip     PDF File log.pdf     Zip Archive mongod.10.132.203.84.log.zip    
Participants:

 Description   

I have a mongo cluster with 5 nodes. The primary somehow "half-die". From outside, the cluster does not have primary. Inside cluster, some node still can talk to it for some time and later says connection failed. After secondary fail to connect to primary, it tries to choose another member to sync from and cannot. And it got stuck there until engineer got involved and force delete/promote.

Version: 3.4

Cluster size: 5 nodes

mongo config:

systemLog:
destination: file
logAppend: true
path: /var/log/mongodb/mongod.log
quiet: true

net:
http:
enabled: true

  1. how the process runs
    processManagement:
    fork: true # fork and run in background
    pidFilePath: /var/run/mongodb/mongod.pid # location of pidfile

storage:
engine: 'wiredTiger'
dbPath: /mnt/mongodb

replication:
replSetName: snapshot-catalog-mongo
oplogSizeMB: 24000

operationProfiling:
slowOpThresholdMs: 10000
mode: off

Run mongo daemon as a service. The service file is configured as follows:

[Unit]
Description=MongoDB Database Server
After=network.target
Documentation=https://docs.mongodb.org/manual

[Service]
User=pure
Group=pure
Environment="OPTIONS=-f /etc/mongodb.conf"
ExecStartPre=/opt/pure/setup-mongo.sh
ExecStart=/usr/bin/mongod $OPTIONS
PermissionsStartOnly=true
PIDFile=/var/run/mongodb/mongod.pid
Type=forking

  1. file size
    LimitFSIZE=infinity
  2. cpu time
    LimitCPU=infinity
  3. virtual memory size
    LimitAS=infinity
  4. open files
    LimitNOFILE=64000
  5. processes/threads
    LimitNPROC=64000
  6. locked memory
    LimitMEMLOCK=infinity
  7. total threads (user+kernel)
    TasksMax=infinity
    TasksAccounting=false
  8. Recommended limits for for mongod as specified in
  9. http://docs.mongodb.org/manual/reference/ulimit/#recommended-settings
    Restart=on-failure
    RestartSec=5

[Install]
WantedBy=multi-user.target

Timeline:

1. [2019-01-16 ~1am] Primary node goes down. All secondary nodes do not realize the fact and choose to stay as secondary, so the cluster cannot take any traffic

From outside, the cluster lost primary 00:55AM

Replication Lag between 1-2 am

time window between 1-2

However, inside mongo cluster, some nodes are still in sync. The last node which lose connection to primary is 01:20AM.

 

Primary node's last log before crashing

Last log before crashing
2019-01-16T08:50:20.412+0000 I NETWORK [conn391702] received client metadata from 10.132.205.155:50350 conn391702: { driver:

{ name: "mongo-java-driver", version: "unknown" }

, os: { type: "Linux", name: "Linux", architecture: "amd64", version: "4.15.0-1021-aws" }, platform: "Java/Oracle Corporation/1.8.0_191-8u191-b12-0ubuntu0.18.04.1-b12" }
2019-01-16T08:50:25.410+0000 I NETWORK [conn391703] received client metadata from 10.132.211.83:54270 conn391703: { driver:

{ name: "PyMongo", version: "3.5.1" }

, os: { type: "Linux", name: "Ubuntu 16.04 xenial", architecture: "x86_64", version: "4.4.0-1072-aws" }, platform: "CPython 2.7.14.final.0" }

Secondary nodes' log are attached below

log.pdf
 
 

Question

1. What's the reason that primary node crashed?

It only primary a lot of binary codes <0x00> and that's it. What happened?

2. What's the reason that secondary nodes cannot select a new primary?

Secondaries keep saying it "could not find a member to sync from". Is that cluster in a weird state that primary can still talk to secondary and secondary cannot talk to primary? So secondaries still can know primary is alive and they do not elect a new primary, but they cannot sync data from primary. 



 Comments   
Comment by Eric Sedor [ 08/Feb/19 ]

You are most welcome Wu. If this happens again, any information about what happened with the disk will be helpful!

Comment by Hongkai Wu [X] [ 07/Feb/19 ]

Interesting. I cannot find related info in syslog or dmesg. I think I'm happy to know this is some issue related to writing to nvme0n1. If next time I meet the same issue I give you more information. Is there any other things we can dig? If not I think we can close the ticket!

Thanks again for your help!

Comment by Eric Sedor [ 06/Feb/19 ]

We are a little surprised to hear that given that write metrics out of the WiredTiger cache seem to correspond to writes to nvme1n1.

However, what we are able to say is that it looks like the trouble corresponds to issues with nvme0n1. Can you look into the syslog and dmesg for the time period to see if anything stands out?

Comment by Hongkai Wu [X] [ 06/Feb/19 ]

Thanks for taking a look, Eric. All the data files, index, etc are on that volume. We set the storage's dbpath to /mnt. nvme0n1 is mounted at /mnt.

Comment by Eric Sedor [ 06/Feb/19 ]

Thank you Kyle! Can you let us know what is on nvme0n1? We're curious because our initial look at the diagnostic.data suggests significant utilization of that volume.

Comment by Hongkai Wu [X] [ 05/Feb/19 ]

diagnostic.data.10.132.203.84.zip

mongod.10.132.203.84.log.zip

Hi @Eric, thanks for your reply. Unfortunately I lost all that day's diagnostics.data in secondary nodes. But I do find diagnostics.data on the crashed primary node. I attached it and mongod log above.

Kyle

Comment by Eric Sedor [ 04/Feb/19 ]

Hello and thanks for the detail so far.

Would you please archive (tar or zip) the $dbpath/diagnostic.data directories for all 5 nodes and attach them to this ticket?

As well, can you provide the logs of the crash from the Primary?

Thank you!

Generated at Thu Feb 08 04:51:50 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.