Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-7507

Random mongos failure to contact whole cluster

    • Type: Icon: Bug Bug
    • Resolution: Incomplete
    • Priority: Icon: Critical - P2 Critical - P2
    • None
    • Affects Version/s: 2.2.0
    • Labels:
    • Environment:
      AWS, Ubuntu 12.04.1 LTS
      2x shards (each shard consists of 2x replicas and 1x abriter)

      2x app servers (each running mongos)
      1x background worker (running mongos)
    • Linux


      During routine operation of our mongo cluster, the mongos process on one of our app servers became unresponsive (confirmed by ssh'ing to the app server, running mongo, and running 'show dbs').

      Attached is the mongos.log file from when the issue started, until after mongos was manually restarted and recovered. The machine maintained full network connectivity during this time, and DNS names were resolving in shell.

      During this time, the other app server and background worker show clean mongos.logs (just acquiring and unlocking the distributed lock).

      How can we prevent this happening in future? This kind of failure is critical for us, and I'm happy to help debug/diagnose it further.

        1. mongo_send_error.tar.gz
          5.81 MB
        2. mongos.log
          61 kB
        3. mongos-2.log
          36 kB

            randolph@mongodb.com Randolph Tan
            noizwaves noizwaves
            0 Vote for this issue
            7 Start watching this issue