Mongo cluster got stuck at "could not find member to sync from"

XMLWordPrintableJSON

    • Type: Question
    • Resolution: Done
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Replication
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      I have a mongo cluster with 5 nodes. The primary somehow "half-die". From outside, the cluster does not have primary. Inside cluster, some node still can talk to it for some time and later says connection failed. After secondary fail to connect to primary, it tries to choose another member to sync from and cannot. And it got stuck there until engineer got involved and force delete/promote.

      Version: 3.4

      Cluster size: 5 nodes

      mongo config:

      systemLog:
      destination: file
      logAppend: true
      path: /var/log/mongodb/mongod.log
      quiet: true

      net:
      http:
      enabled: true

      1. how the process runs
        processManagement:
        fork: true # fork and run in background
        pidFilePath: /var/run/mongodb/mongod.pid # location of pidfile

      storage:
      engine: 'wiredTiger'
      dbPath: /mnt/mongodb

      replication:
      replSetName: snapshot-catalog-mongo
      oplogSizeMB: 24000

      operationProfiling:
      slowOpThresholdMs: 10000
      mode: off

      Run mongo daemon as a service. The service file is configured as follows:

      [Unit]
      Description=MongoDB Database Server
      After=network.target
      Documentation=https://docs.mongodb.org/manual

      [Service]
      User=pure
      Group=pure
      Environment="OPTIONS=-f /etc/mongodb.conf"
      ExecStartPre=/opt/pure/setup-mongo.sh
      ExecStart=/usr/bin/mongod $OPTIONS
      PermissionsStartOnly=true
      PIDFile=/var/run/mongodb/mongod.pid
      Type=forking

      1. file size
        LimitFSIZE=infinity
      2. cpu time
        LimitCPU=infinity
      3. virtual memory size
        LimitAS=infinity
      4. open files
        LimitNOFILE=64000
      5. processes/threads
        LimitNPROC=64000
      6. locked memory
        LimitMEMLOCK=infinity
      7. total threads (user+kernel)
        TasksMax=infinity
        TasksAccounting=false
      8. Recommended limits for for mongod as specified in
      9. http://docs.mongodb.org/manual/reference/ulimit/#recommended-settings
        Restart=on-failure
        RestartSec=5

      [Install]
      WantedBy=multi-user.target

      Timeline:

      1. [2019-01-16 ~1am] Primary node goes down. All secondary nodes do not realize the fact and choose to stay as secondary, so the cluster cannot take any traffic

      From outside, the cluster lost primary 00:55AM

      Replication Lag between 1-2 am

      time window between 1-2

      However, inside mongo cluster, some nodes are still in sync. The last node which lose connection to primary is 01:20AM.

       

      Primary node's last log before crashing

      Last log before crashing
      2019-01-16T08:50:20.412+0000 I NETWORK [conn391702] received client metadata from 10.132.205.155:50350 conn391702: { driver:

      { name: "mongo-java-driver", version: "unknown" }

      , os: { type: "Linux", name: "Linux", architecture: "amd64", version: "4.15.0-1021-aws" }, platform: "Java/Oracle Corporation/1.8.0_191-8u191-b12-0ubuntu0.18.04.1-b12" }
      2019-01-16T08:50:25.410+0000 I NETWORK [conn391703] received client metadata from 10.132.211.83:54270 conn391703: { driver:

      { name: "PyMongo", version: "3.5.1" }

      , os: { type: "Linux", name: "Ubuntu 16.04 xenial", architecture: "x86_64", version: "4.4.0-1072-aws" }, platform: "CPython 2.7.14.final.0" }

      Secondary nodes' log are attached below

      log.pdf
       
       

      Question

      1. What's the reason that primary node crashed?

      It only primary a lot of binary codes <0x00> and that's it. What happened?

      2. What's the reason that secondary nodes cannot select a new primary?

      Secondaries keep saying it "could not find a member to sync from". Is that cluster in a weird state that primary can still talk to secondary and secondary cannot talk to primary? So secondaries still can know primary is alive and they do not elect a new primary, but they cannot sync data from primary. 

        1. diagnostic.data.10.132.203.84.zip
          37.14 MB
        2. log.pdf
          143 kB
        3. mongod.10.132.203.84.log.zip
          8.35 MB
        4. Screen Shot 2019-02-06 at 9.59.07 AM.png
          Screen Shot 2019-02-06 at 9.59.07 AM.png
          130 kB

            Assignee:
            Eric Sedor
            Reporter:
            Hongkai Wu [X]
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: