Details

    • Type: Bug Bug
    • Status: Closed Closed
    • Priority: Critical - P2 Critical - P2
    • Resolution: Duplicate
    • Affects Version/s: 2.0.3
    • Fix Version/s: None
    • Component/s: Replication/Pairing
    • Labels:
    • Environment:
      Linux 3.0.18
    • Backport:
      No
    • Operating System:
      Linux
    • Bug Type:
      Stability
    • # Replies:
      14
    • Last comment by Customer:
      true

      Description

      We newly initiated a replica set, but the to-be secondary never gets out of "RECOVERING" state, as the mongod process is killed by oom-killer in the middle of resync (seemingly last step of resync- when building secondary indexes, to be precise) and start from scratch every time.

      journal is turned on, vm.overcommit_memory is set to 1, as suggested before.

      Right now, testing "echo -17 > /proc/`cat /var/run/mongodb.pid`/oom_adj" (and "swapoff -a"), but every trial takes hours.

      The data size is 10x larger than the physical memory, it seems unlikely that simply doubling the RAM would fix the problem, as the heuristics of oom-killer is rather unpredictable.

      I'd like to know what triggers this failure, and what I should keep in mind.

      What should we do to get resync done?

        Issue Links

          Activity

          Hide
          Siddharth Singh (Inactive)
          added a comment -

          Unable to reproduce so far. Tried with a ~3.5GB index size on a VM with 1 GB machine. Will update as soon as I get a repro case.

          Show
          Siddharth Singh (Inactive)
          added a comment - Unable to reproduce so far. Tried with a ~3.5GB index size on a VM with 1 GB machine. Will update as soon as I get a repro case.
          Hide
          Siddharth Singh (Inactive)
          added a comment - - edited

          No repro yet. Tried it with index size of 3.5, 9 and 21 GB on a 1 GB machine and didn't see anything. I will go ahead and close this one for now as "Unable to Reproduce". Please feel free to reopen if you come across a reproducible test case. Thanks for reporting this.

          Show
          Siddharth Singh (Inactive)
          added a comment - - edited No repro yet. Tried it with index size of 3.5, 9 and 21 GB on a 1 GB machine and didn't see anything. I will go ahead and close this one for now as "Unable to Reproduce". Please feel free to reopen if you come across a reproducible test case. Thanks for reporting this.
          Hide
          Kenn Ejima
          added a comment -

          "swapoff -a" is to quickly let success or fail those experiments, that's all. I didn't want to let it run days or weeks by thrashing.

          what is your swap size? in our case, swap size was 256MB (default of Linode).

          Show
          Kenn Ejima
          added a comment - "swapoff -a" is to quickly let success or fail those experiments, that's all. I didn't want to let it run days or weeks by thrashing. what is your swap size? in our case, swap size was 256MB (default of Linode).
          Hide
          Siddharth Singh (Inactive)
          added a comment -

          I had turned swapoff to try repro it faster as well. Didn't see the repro though. It took a while for the secondaries to sycn but it happened successfully.

          Show
          Siddharth Singh (Inactive)
          added a comment - I had turned swapoff to try repro it faster as well. Didn't see the repro though. It took a while for the secondaries to sycn but it happened successfully.
          Hide
          Dan Pasette
          added a comment -

          This was found and fixed by SERVER-6414. Going into 2.0.7 and 2.2.0.

          Show
          Dan Pasette
          added a comment - This was found and fixed by SERVER-6414 . Going into 2.0.7 and 2.2.0.

            People

            • Votes:
              1 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:
                Days since reply:
                1 year, 40 weeks, 3 days ago
                Date of 1st Reply: