XML

Word

Printable

JSON

Type: Question
Resolution: Done
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: Replication
Labels:
None

CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

I have a mongo cluster with 5 nodes. The primary somehow "half-die". From outside, the cluster does not have primary. Inside cluster, some node still can talk to it for some time and later says connection failed. After secondary fail to connect to primary, it tries to choose another member to sync from and cannot. And it got stuck there until engineer got involved and force delete/promote.

Version: 3.4

Cluster size: 5 nodes

mongo config:

systemLog:
destination: file
logAppend: true
path: /var/log/mongodb/mongod.log
quiet: true

net:
http:
enabled: true

how the process runs
processManagement:
fork: true # fork and run in background
pidFilePath: /var/run/mongodb/mongod.pid # location of pidfile

storage:
engine: 'wiredTiger'
dbPath: /mnt/mongodb

replication:
replSetName: snapshot-catalog-mongo
oplogSizeMB: 24000

operationProfiling:
slowOpThresholdMs: 10000
mode: off

Run mongo daemon as a service. The service file is configured as follows:

[Unit]
Description=MongoDB Database Server
After=network.target
Documentation=https://docs.mongodb.org/manual

[Service]
User=pure
Group=pure
Environment="OPTIONS=-f /etc/mongodb.conf"
ExecStartPre=/opt/pure/setup-mongo.sh
ExecStart=/usr/bin/mongod $OPTIONS
PermissionsStartOnly=true
PIDFile=/var/run/mongodb/mongod.pid
Type=forking

file size
LimitFSIZE=infinity
cpu time
LimitCPU=infinity
virtual memory size
LimitAS=infinity
open files
LimitNOFILE=64000
processes/threads
LimitNPROC=64000
locked memory
LimitMEMLOCK=infinity
total threads (user+kernel)
TasksMax=infinity
TasksAccounting=false
Recommended limits for for mongod as specified in
http://docs.mongodb.org/manual/reference/ulimit/#recommended-settings
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

Timeline:

1. [2019-01-16 ~1am] Primary node goes down. All secondary nodes do not realize the fact and choose to stay as secondary, so the cluster cannot take any traffic

From outside, the cluster lost primary 00:55AM

Replication Lag between 1-2 am

time window between 1-2

However, inside mongo cluster, some nodes are still in sync. The last node which lose connection to primary is 01:20AM.

Primary node's last log before crashing

Last log before crashing
2019-01-16T08:50:20.412+0000 I NETWORK [conn391702] received client metadata from 10.132.205.155:50350 conn391702: { driver:

{ name: "mongo-java-driver", version: "unknown" }

, os: { type: "Linux", name: "Linux", architecture: "amd64", version: "4.15.0-1021-aws" }, platform: "Java/Oracle Corporation/1.8.0_191-8u191-b12-0ubuntu0.18.04.1-b12" }
2019-01-16T08:50:25.410+0000 I NETWORK [conn391703] received client metadata from 10.132.211.83:54270 conn391703: { driver:

{ name: "PyMongo", version: "3.5.1" }

, os: { type: "Linux", name: "Ubuntu 16.04 xenial", architecture: "x86_64", version: "4.4.0-1072-aws" }, platform: "CPython 2.7.14.final.0" }

Secondary nodes' log are attached below

log.pdf

Question

1. What's the reason that primary node crashed?

It only primary a lot of binary codes <0x00> and that's it. What happened?

2. What's the reason that secondary nodes cannot select a new primary?

Secondaries keep saying it "could not find a member to sync from". Is that cluster in a weird state that primary can still talk to secondary and secondary cannot talk to primary? So secondaries still can know primary is alive and they do not elect a new primary, but they cannot sync data from primary.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

diagnostic.data.10.132.203.84.zip
37.14 MB
Feb 05 2019 01:39:00 AM UTC
log.pdf
143 kB
Feb 04 2019 07:37:23 PM UTC
mongod.10.132.203.84.log.zip
8.35 MB
Feb 05 2019 01:39:52 AM UTC
Screen Shot 2019-02-06 at 9.59.07 AM.png
130 kB
Feb 06 2019 05:59:31 PM UTC

Assignee:: Eric Sedor
Reporter:: Hongkai Wu [X]
Participants:: Eric Sedor, Hongkai Wu [X]
Votes:: 0 Vote for this issue
Watchers:: 6 Start watching this issue

Created:: Feb 04 2019 07:37:25 PM UTC
Updated:: Feb 08 2019 08:52:51 PM UTC
Resolved:: Feb 08 2019 08:52:51 PM UTC

Details

Description

Primary node's last log before crashing

Attachments

Attachments

Activity

People

Dates