Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-17074

Sharded Replicaset - replicas fall behind (3.0.0-rc6)

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Duplicate
    • Affects Version/s: 3.0.0-rc6
    • Fix Version/s: None
    • Component/s: Replication, WiredTiger
    • Labels:
      None
    • Environment:
      Centos 6
    • Backwards Compatibility:
      Fully Compatible
    • Operating System:
      Linux
    • Steps To Reproduce:
      Hide

      Start sharded replica set with (we have 8) say, 2 shards, 1 primary and 1 replicaset.
      Pump in 4k updates/sec (each update is a push/pop on a 4kb doc).
      Watch replicaset have 0 updates/sec in mongostat and replication delay (via MMS) show increasing numbers. Occasionally large numbers of updates will show going through the replica, then stop again, but net replication delay alway increases with time.

      Show
      Start sharded replica set with (we have 8) say, 2 shards, 1 primary and 1 replicaset. Pump in 4k updates/sec (each update is a push/pop on a 4kb doc). Watch replicaset have 0 updates/sec in mongostat and replication delay (via MMS) show increasing numbers. Occasionally large numbers of updates will show going through the replica, then stop again, but net replication delay alway increases with time.

      Description

      We're seeing our replicaset not able to keep up with the primary in a peculiar way.

      Previously we were on 2.6 and the replication worked fine, no changes since then except upgrading to 3.0.0-rc6.

      I see (via mongostat) primaries getting approx. 4k updates/sec each times 8 shards; secondaries show 0 updates/sec. I stop the replica daemon, wipe the directory, and restart. The resync starts and executes properly, catching up and going into 'SEC' mode on mongostat. This lasts only several seconds before the updates/sec on SEC goes to 0. Primary is still 4k updates/sec.

      Logs on secondaries show lots of these kind of messages:

      2015-01-26T14:02:48.942-0600 I QUERY    [conn193] killcursors  keyUpdates:0 writeConflicts:0 numYields:0 11777ms
      2015-01-26T14:02:48.942-0600 I QUERY    [conn109] killcursors  keyUpdates:0 writeConflicts:0 numYields:0 11717ms
      2015-01-26T14:02:48.943-0600 I QUERY    [conn133] killcursors  keyUpdates:0 writeConflicts:0 numYields:0 11702ms
      2015-01-26T14:02:48.943-0600 I QUERY    [conn206] killcursors  keyUpdates:0 writeConflicts:0 numYields:0 11691ms
      2015-01-26T14:02:48.943-0600 I QUERY    [conn156] killcursors  keyUpdates:0 writeConflicts:0 numYields:0 11681ms
      2015-01-26T14:03:01.363-0600 I NETWORK  [conn218] end connection 10.235.67.65:18027 (113 connections now open)

      I've updated several times through rc4, rc5, rc6, and am now even running the nightly, all show the same behavior.

      Note this is a very write-intensive application. Data is stored on SSD's, journals on spinning disk, but I've tried moving journals to SSD and it hasn't helped.

        Attachments

          Issue Links

            Activity

              People

              • Votes:
                0 Vote for this issue
                Watchers:
                8 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: