Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Duplicate
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 3.0.0-rc6
Component/s: Replication, WiredTiger
Labels:
None
Environment:
Centos 6

Backwards Compatibility:
Fully Compatible
Operating System:
Linux
Steps To Reproduce:

Hide

Start sharded replica set with (we have 8) say, 2 shards, 1 primary and 1 replicaset.
Pump in 4k updates/sec (each update is a push/pop on a 4kb doc).
Watch replicaset have 0 updates/sec in mongostat and replication delay (via MMS) show increasing numbers. Occasionally large numbers of updates will show going through the replica, then stop again, but net replication delay alway increases with time.

Show
Start sharded replica set with (we have 8) say, 2 shards, 1 primary and 1 replicaset. Pump in 4k updates/sec (each update is a push/pop on a 4kb doc). Watch replicaset have 0 updates/sec in mongostat and replication delay (via MMS) show increasing numbers. Occasionally large numbers of updates will show going through the replica, then stop again, but net replication delay alway increases with time.
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

We're seeing our replicaset not able to keep up with the primary in a peculiar way.

Previously we were on 2.6 and the replication worked fine, no changes since then except upgrading to 3.0.0-rc6.

I see (via mongostat) primaries getting approx. 4k updates/sec each times 8 shards; secondaries show 0 updates/sec. I stop the replica daemon, wipe the directory, and restart. The resync starts and executes properly, catching up and going into 'SEC' mode on mongostat. This lasts only several seconds before the updates/sec on SEC goes to 0. Primary is still 4k updates/sec.

Logs on secondaries show lots of these kind of messages:

2015-01-26T14:02:48.942-0600 I QUERY    [conn193] killcursors  keyUpdates:0 writeConflicts:0 numYields:0 11777ms
2015-01-26T14:02:48.942-0600 I QUERY    [conn109] killcursors  keyUpdates:0 writeConflicts:0 numYields:0 11717ms
2015-01-26T14:02:48.943-0600 I QUERY    [conn133] killcursors  keyUpdates:0 writeConflicts:0 numYields:0 11702ms
2015-01-26T14:02:48.943-0600 I QUERY    [conn206] killcursors  keyUpdates:0 writeConflicts:0 numYields:0 11691ms
2015-01-26T14:02:48.943-0600 I QUERY    [conn156] killcursors  keyUpdates:0 writeConflicts:0 numYields:0 11681ms
2015-01-26T14:03:01.363-0600 I NETWORK  [conn218] end connection 10.235.67.65:18027 (113 connections now open)

I've updated several times through rc4, rc5, rc6, and am now even running the nightly, all show the same behavior.

Note this is a very write-intensive application. Data is stored on SSD's, journals on spinning disk, but I've tried moving journals to SSD and it hasn't helped.

duplicates

SERVER-16921 WT oplog bottleneck on secondary

Closed

Assignee:: Unassigned
Reporter:: Kevin J. Rice
Participants:: Eliot Horowitz, Kevin J. Rice
Votes:: 0 Vote for this issue
Watchers:: 7 Start watching this issue

Created:: Jan 27 2015 05:20:53 PM UTC
Updated:: Jul 09 2016 09:09:20 PM UTC
Resolved:: Jan 29 2015 02:26:43 PM UTC

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates