Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-64034

Replica node failure due to script execution

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major - P3
    • Resolution: Done
    • None
    • None
    • None
    • None
    • ALL

    Description

      Greetings, we need assistance in determining the cause of our issue with Mongo cluster.

      Cluster settings:

      • 3 nodes
      • Run on VMware vCenter VMs
      • MongoDB shell version v4.2.2

      Static hostname: m1-prod-vm-db-mongo02
      Icon name: computer-vm
      Chassis: vm
      Machine ID: 8edfb1744b9947f4a12093a8259d93d0
      Boot ID: aed3637a32ca449798827711859ac700
      Virtualization: vmware
      Operating System: Red Hat Enterprise Linux Server 7.6 (Maipo)
      CPE OS Name: cpe:/o:redhat:enterprise_linux:7.6:GA:server
      Kernel: Linux 3.10.0-957.27.2.el7.x86_64
      Architecture: x86-64

      During the execution of the script which updates the data in collections we encountered a failure of one of replica nodes. (update.txt)

      To repair the cluster we performed the following:

      1. Changed the /etc/mongod.conf on failed node - switched the port to 27108 and commented replication settings to run in standalone mode.
      1. Executed the same script on the failed node in standalone mode, it finished successfully 

       

      1. Synced the collections with data from healthy replica node using the script :

        #!/bin/bash 
        BEGIN_DATE=$1 
        LAST_DATE=$2 
        LOCAL_DATABASE=audit_prod 
        REMOTE_DATABASE=audit_prod 
        REMOTE_HOST=m1-prod-vm-db-mongo02 
        CURRENT_DATE=$BEGIN_DATE 
        while [$CURRENT_DATE!= $LAST_DATE]; 
        do echo "Текущая дата: $CURRENT_DATE"   
        mongodump --db="${LOCAL_DATABASE}" --collection="${CURRENT_DATE}" --archive | ssh "${REMOTE_HOST}" -T "mongorestore --archive --port=27018"   
        echo "Дата завершения сбора синхронизации: $CURRENT_DATE"   
        CURRENT_DATE=$(date -d "${CURRENT_DATE} +1 день" +%F) Готово

      1. Synced the oplog collection from healthy replica to the failed one

        mongodump -d local -c oplog.rs -o /data/dump/oplog  
         
        scp -rp /data/dump/ <username>@<hostname>:/data/dump/ 
         
        mongorestore -vvvv -d local --port=27018 / вывод данных/

      1. Restored the initial config on failed node.

      The logs during the failure (logs.txt)

      Attachments

        1. logs.txt
          31 kB
        2. mongod.conf
          0.9 kB
        3. update.txt
          0.6 kB

        Activity

          People

            edwin.zhou@mongodb.com Edwin Zhou
            anush.chinoian@gmail.com Anush Chinoian
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: