-
Type: Bug
-
Resolution: Done
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
ALL
Greetings, we need assistance in determining the cause of our issue with Mongo cluster.
Cluster settings:
- 3 nodes
- Run on VMware vCenter VMs
- MongoDB shell version v4.2.2
Static hostname: m1-prod-vm-db-mongo02
Icon name: computer-vm
Chassis: vm
Machine ID: 8edfb1744b9947f4a12093a8259d93d0
Boot ID: aed3637a32ca449798827711859ac700
Virtualization: vmware
Operating System: Red Hat Enterprise Linux Server 7.6 (Maipo)
CPE OS Name: cpe:/o:redhat:enterprise_linux:7.6:GA:server
Kernel: Linux 3.10.0-957.27.2.el7.x86_64
Architecture: x86-64
During the execution of the script which updates the data in collections we encountered a failure of one of replica nodes. (update.txt)
To repair the cluster we performed the following:
- Changed the /etc/mongod.conf on failed node - switched the port to 27108 and commented replication settings to run in standalone mode.
- Executed the same script on the failed node in standalone mode, it finished successfully
- Synced the collections with data from healthy replica node using the script :
#!/bin/bash BEGIN_DATE=$1 LAST_DATE=$2 LOCAL_DATABASE=audit_prod REMOTE_DATABASE=audit_prod REMOTE_HOST=m1-prod-vm-db-mongo02 CURRENT_DATE=$BEGIN_DATE while [$CURRENT_DATE!= $LAST_DATE]; do echo "Текущая дата: $CURRENT_DATE" mongodump --db="${LOCAL_DATABASE}" --collection="${CURRENT_DATE}" --archive | ssh "${REMOTE_HOST}" -T "mongorestore --archive --port=27018" echo "Дата завершения сбора синхронизации: $CURRENT_DATE" CURRENT_DATE=$(date -d "${CURRENT_DATE} +1 день" +%F) Готово
- Synced the oplog collection from healthy replica to the failed one
mongodump -d local -c oplog.rs -o /data/dump/oplog scp -rp /data/dump/ <username>@<hostname>:/data/dump/ mongorestore -vvvv -d local --port=27018 / вывод данных/
- Restored the initial config on failed node.
The logs during the failure (logs.txt)