-
Type:
Question
-
Resolution: Done
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: Replication
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
I have a mongo cluster with 5 nodes. The primary somehow "half-die". From outside, the cluster does not have primary. Inside cluster, some node still can talk to it for some time and later says connection failed. After secondary fail to connect to primary, it tries to choose another member to sync from and cannot. And it got stuck there until engineer got involved and force delete/promote.
Version: 3.4
Cluster size: 5 nodes
mongo config:
systemLog:
destination: file
logAppend: true
path: /var/log/mongodb/mongod.log
quiet: true
net:
http:
enabled: true
- how the process runs
processManagement:
fork: true # fork and run in background
pidFilePath: /var/run/mongodb/mongod.pid # location of pidfile
storage:
engine: 'wiredTiger'
dbPath: /mnt/mongodb
replication:
replSetName: snapshot-catalog-mongo
oplogSizeMB: 24000
operationProfiling:
slowOpThresholdMs: 10000
mode: off
Run mongo daemon as a service. The service file is configured as follows:
[Unit]
Description=MongoDB Database Server
After=network.target
Documentation=https://docs.mongodb.org/manual
[Service]
User=pure
Group=pure
Environment="OPTIONS=-f /etc/mongodb.conf"
ExecStartPre=/opt/pure/setup-mongo.sh
ExecStart=/usr/bin/mongod $OPTIONS
PermissionsStartOnly=true
PIDFile=/var/run/mongodb/mongod.pid
Type=forking
- file size
LimitFSIZE=infinity - cpu time
LimitCPU=infinity - virtual memory size
LimitAS=infinity - open files
LimitNOFILE=64000 - processes/threads
LimitNPROC=64000 - locked memory
LimitMEMLOCK=infinity - total threads (user+kernel)
TasksMax=infinity
TasksAccounting=false - Recommended limits for for mongod as specified in
- http://docs.mongodb.org/manual/reference/ulimit/#recommended-settings
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
Timeline:
1. [2019-01-16 ~1am] Primary node goes down. All secondary nodes do not realize the fact and choose to stay as secondary, so the cluster cannot take any traffic
From outside, the cluster lost primary 00:55AM
Replication Lag between 1-2 am
time window between 1-2
However, inside mongo cluster, some nodes are still in sync. The last node which lose connection to primary is 01:20AM.
Primary node's last log before crashing
Last log before crashing
2019-01-16T08:50:20.412+0000 I NETWORK [conn391702] received client metadata from 10.132.205.155:50350 conn391702: { driver:
, os: { type: "Linux", name: "Linux", architecture: "amd64", version: "4.15.0-1021-aws" }, platform: "Java/Oracle Corporation/1.8.0_191-8u191-b12-0ubuntu0.18.04.1-b12" }
2019-01-16T08:50:25.410+0000 I NETWORK [conn391703] received client metadata from 10.132.211.83:54270 conn391703: { driver:
, os: { type: "Linux", name: "Ubuntu 16.04 xenial", architecture: "x86_64", version: "4.4.0-1072-aws" }, platform: "CPython 2.7.14.final.0" }
Secondary nodes' log are attached below
Question
1. What's the reason that primary node crashed?
It only primary a lot of binary codes <0x00> and that's it. What happened?
2. What's the reason that secondary nodes cannot select a new primary?
Secondaries keep saying it "could not find a member to sync from". Is that cluster in a weird state that primary can still talk to secondary and secondary cannot talk to primary? So secondaries still can know primary is alive and they do not elect a new primary, but they cannot sync data from primary.