Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-7507

Random mongos failure to contact whole cluster

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical - P2
    • Resolution: Incomplete
    • Affects Version/s: 2.2.0
    • Fix Version/s: None
    • Labels:
    • Environment:
      AWS, Ubuntu 12.04.1 LTS
      2x shards (each shard consists of 2x replicas and 1x abriter)

      2x app servers (each running mongos)
      1x background worker (running mongos)
    • Operating System:
      Linux

      Description

      Hi,

      During routine operation of our mongo cluster, the mongos process on one of our app servers became unresponsive (confirmed by ssh'ing to the app server, running mongo, and running 'show dbs').

      Attached is the mongos.log file from when the issue started, until after mongos was manually restarted and recovered. The machine maintained full network connectivity during this time, and DNS names were resolving in shell.

      During this time, the other app server and background worker show clean mongos.logs (just acquiring and unlocking the distributed lock).

      How can we prevent this happening in future? This kind of failure is critical for us, and I'm happy to help debug/diagnose it further.

        Attachments

        1. mongo_send_error.tar.gz
          5.81 MB
        2. mongos.log
          61 kB
        3. mongos-2.log
          36 kB

          Activity

            People

            Assignee:
            renctan Randolph Tan
            Reporter:
            noizwaves noizwaves
            Participants:
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved: