[SERVER-17074] Sharded Replicaset - replicas fall behind (3.0.0-rc6) Created: 27/Jan/15 Updated: 09/Jul/16 Resolved: 29/Jan/15 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication, WiredTiger |
| Affects Version/s: | 3.0.0-rc6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Kevin J. Rice | Assignee: | Unassigned |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Centos 6 |
||
| Issue Links: |
|
||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Operating System: | Linux | ||||||||
| Steps To Reproduce: | Start sharded replica set with (we have 8) say, 2 shards, 1 primary and 1 replicaset. |
||||||||
| Participants: | |||||||||
| Description |
|
We're seeing our replicaset not able to keep up with the primary in a peculiar way. Previously we were on 2.6 and the replication worked fine, no changes since then except upgrading to 3.0.0-rc6. I see (via mongostat) primaries getting approx. 4k updates/sec each times 8 shards; secondaries show 0 updates/sec. I stop the replica daemon, wipe the directory, and restart. The resync starts and executes properly, catching up and going into 'SEC' mode on mongostat. This lasts only several seconds before the updates/sec on SEC goes to 0. Primary is still 4k updates/sec. Logs on secondaries show lots of these kind of messages:
I've updated several times through rc4, rc5, rc6, and am now even running the nightly, all show the same behavior. Note this is a very write-intensive application. Data is stored on SSD's, journals on spinning disk, but I've tried moving journals to SSD and it hasn't helped. |
| Comments |
| Comment by Eliot Horowitz (Inactive) [ 29/Jan/15 ] |
|
I believe this is the same issue as in |
| Comment by Kevin J. Rice [ 27/Jan/15 ] |
|
I've found that I had a bunch of ancillary processes doing many more updates and inserts on this replicaset, connecting via pymongo readpreference nearest. When I turned off these other processes, the replicaset caught up. Note there were not long-runing queries; I ran the standard kill-long-running-queries javascript to killOp() those. So, it was just a bunch of queries running under 1 minute each. So: revise replication instructions to include many large-volume queries (rp: nearest) pulling back lots of records via a non-shard-key index but issuing shard-key updates, plus querying likewise and inserting new records. |