[SERVER-64034] Replica node failure due to script execution Created: 28/Feb/22  Updated: 02/Jun/22  Resolved: 29/Mar/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Anush Chinoian Assignee: Edwin Zhou
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File logs.txt     File mongod.conf     Text File update.txt    
Operating System: ALL
Participants:

 Description   

Greetings, we need assistance in determining the cause of our issue with Mongo cluster.

Cluster settings:

  • 3 nodes
  • Run on VMware vCenter VMs
  • MongoDB shell version v4.2.2

Static hostname: m1-prod-vm-db-mongo02
Icon name: computer-vm
Chassis: vm
Machine ID: 8edfb1744b9947f4a12093a8259d93d0
Boot ID: aed3637a32ca449798827711859ac700
Virtualization: vmware
Operating System: Red Hat Enterprise Linux Server 7.6 (Maipo)
CPE OS Name: cpe:/o:redhat:enterprise_linux:7.6:GA:server
Kernel: Linux 3.10.0-957.27.2.el7.x86_64
Architecture: x86-64

During the execution of the script which updates the data in collections we encountered a failure of one of replica nodes. (update.txt)

To repair the cluster we performed the following:

  1. Changed the /etc/mongod.conf on failed node - switched the port to 27108 and commented replication settings to run in standalone mode.
  1. Executed the same script on the failed node in standalone mode, it finished successfully 

 

  1. Synced the collections with data from healthy replica node using the script :

    #!/bin/bash 
    BEGIN_DATE=$1 
    LAST_DATE=$2 
    LOCAL_DATABASE=audit_prod 
    REMOTE_DATABASE=audit_prod 
    REMOTE_HOST=m1-prod-vm-db-mongo02 
    CURRENT_DATE=$BEGIN_DATE 
    while [$CURRENT_DATE!= $LAST_DATE]; 
    do echo "Текущая дата: $CURRENT_DATE"   
    mongodump --db="${LOCAL_DATABASE}" --collection="${CURRENT_DATE}" --archive | ssh "${REMOTE_HOST}" -T "mongorestore --archive --port=27018"   
    echo "Дата завершения сбора синхронизации: $CURRENT_DATE"   
    CURRENT_DATE=$(date -d "${CURRENT_DATE} +1 день" +%F) Готово

  1. Synced the oplog collection from healthy replica to the failed one

    mongodump -d local -c oplog.rs -o /data/dump/oplog  
     
    scp -rp /data/dump/ <username>@<hostname>:/data/dump/ 
     
    mongorestore -vvvv -d local --port=27018 / вывод данных/

  1. Restored the initial config on failed node.

The logs during the failure (logs.txt)



 Comments   
Comment by Edwin Zhou [ 29/Mar/22 ]

Hi anush.chinoian@gmail.com,

We haven’t heard back from you for some time, so I’m going to close this ticket. If this is still an issue for you, please provide additional information and we will reopen the ticket.

Best,
Edwin

Comment by Edwin Zhou [ 10/Mar/22 ]

Hi anush.chinoian@gmail.com,

Have you experienced this problem persisting? We look forward to receiving the requested data if this problem is still an issue for you.

Best,
Edwin

Comment by Dmitry Agranat [ 03/Mar/22 ]

anush.chinoian@gmail.com, unfortunately, w/o logs/diagnostics.data it is not possible to determine the cause of the issue. You can use this Jira search to inspect all potential causes for the reported issue.

Comment by Anush Chinoian [ 03/Mar/22 ]

Thanks for the reply. If the problem persists, we will upload diagnostic information and logs without personal data.

But based on the current incident, can you tell what could have been the cause in principle (for example, one of the bugs fixed in later versions)?

Comment by Dmitry Agranat [ 01/Mar/22 ]

Thanks for the update anush.chinoian@gmail.com, if this happens again, please save the requested data and upload it to the secure uploader.

Comment by Anush Chinoian [ 01/Mar/22 ]

We have a version 4.2.2

Comment by Anush Chinoian [ 28/Feb/22 ]

Sorry, but I can't download the full mongod.log logs because we use Mongodb as an audit database and the logs contain personal data. The problem was discovered on 21.02, on diagnostic.data files only for 24.02 and beyond

Comment by Dmitry Agranat [ 28/Feb/22 ]

Hi anush.chinoian@gmail.com,

Does the issue happen when MongoDB is being deployed on the latest version in 4.2, which is currently 4.2.18?

In order to investigate this issue, we will need some additional information. Would you please archive (tar or zip) the full mongod.log files and the $dbpath/diagnostic.data directory (the contents are described here) from all members of this replica set covering the time of this event and upload them to this support uploader location?

Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time.

Thanks,
Dima

Generated at Thu Feb 08 05:59:20 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.