[SERVER-17074] Sharded Replicaset - replicas fall behind (3.0.0-rc6) Created: 27/Jan/15  Updated: 09/Jul/16  Resolved: 29/Jan/15

Status: Closed
Project: Core Server
Component/s: Replication, WiredTiger
Affects Version/s: 3.0.0-rc6
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Kevin J. Rice Assignee: Unassigned
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Centos 6


Issue Links:
Duplicate
duplicates SERVER-16921 WT oplog bottleneck on secondary Closed
Backwards Compatibility: Fully Compatible
Operating System: Linux
Steps To Reproduce:

Start sharded replica set with (we have 8) say, 2 shards, 1 primary and 1 replicaset.
Pump in 4k updates/sec (each update is a push/pop on a 4kb doc).
Watch replicaset have 0 updates/sec in mongostat and replication delay (via MMS) show increasing numbers. Occasionally large numbers of updates will show going through the replica, then stop again, but net replication delay alway increases with time.

Participants:

 Description   

We're seeing our replicaset not able to keep up with the primary in a peculiar way.

Previously we were on 2.6 and the replication worked fine, no changes since then except upgrading to 3.0.0-rc6.

I see (via mongostat) primaries getting approx. 4k updates/sec each times 8 shards; secondaries show 0 updates/sec. I stop the replica daemon, wipe the directory, and restart. The resync starts and executes properly, catching up and going into 'SEC' mode on mongostat. This lasts only several seconds before the updates/sec on SEC goes to 0. Primary is still 4k updates/sec.

Logs on secondaries show lots of these kind of messages:

2015-01-26T14:02:48.942-0600 I QUERY    [conn193] killcursors  keyUpdates:0 writeConflicts:0 numYields:0 11777ms
2015-01-26T14:02:48.942-0600 I QUERY    [conn109] killcursors  keyUpdates:0 writeConflicts:0 numYields:0 11717ms
2015-01-26T14:02:48.943-0600 I QUERY    [conn133] killcursors  keyUpdates:0 writeConflicts:0 numYields:0 11702ms
2015-01-26T14:02:48.943-0600 I QUERY    [conn206] killcursors  keyUpdates:0 writeConflicts:0 numYields:0 11691ms
2015-01-26T14:02:48.943-0600 I QUERY    [conn156] killcursors  keyUpdates:0 writeConflicts:0 numYields:0 11681ms
2015-01-26T14:03:01.363-0600 I NETWORK  [conn218] end connection 10.235.67.65:18027 (113 connections now open)

I've updated several times through rc4, rc5, rc6, and am now even running the nightly, all show the same behavior.

Note this is a very write-intensive application. Data is stored on SSD's, journals on spinning disk, but I've tried moving journals to SSD and it hasn't helped.



 Comments   
Comment by Eliot Horowitz (Inactive) [ 29/Jan/15 ]

I believe this is the same issue as in SERVER-16921 which was fixed in rc7.
Please note that to get the full fix you'll need to resync the wiredtiger node or do a --repair.
Please let us know if that doesn't resolve it.

Comment by Kevin J. Rice [ 27/Jan/15 ]

I've found that I had a bunch of ancillary processes doing many more updates and inserts on this replicaset, connecting via pymongo readpreference nearest.

When I turned off these other processes, the replicaset caught up. Note there were not long-runing queries; I ran the standard kill-long-running-queries javascript to killOp() those. So, it was just a bunch of queries running under 1 minute each.

So: revise replication instructions to include many large-volume queries (rp: nearest) pulling back lots of records via a non-shard-key index but issuing shard-key updates, plus querying likewise and inserting new records.

Generated at Thu Feb 08 03:43:13 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.