[SERVER-27952] Replication fails under heavy load - Oplog timeout should be configurable Created: 08/Feb/17 Updated: 01/Mar/17 Resolved: 01/Mar/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 2.6.0, 3.4.2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Michael Brenden | Assignee: | Unassigned |
| Resolution: | Duplicate | Votes: | 2 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Operating System: | ALL | ||||||||
| Participants: | |||||||||
| Description |
|
Renaming of https://jira.mongodb.org/browse/SERVER-19605 FROM INITIAL REPORT We just encountered a situation where all secondaries in two of our replica sets had ceased replication, and were 1-2 days behind the primary. This appears to have been caused in part by the fact that the initial oplog query from SECONDARY->PRIMARY times out after 30 seconds, but the oplog query takes > 5 minutes to run. Some searching led me to this JIRA FROM SUBSEQUENT REPORT This just bit us in one of our production replicasets. The primary was sufficiently busy, our oplog sufficiently large, and we inserted enough documents in a short enough period of time, that scanning the table to respond to the secondary started taking not only more than 30 seconds, but upwards of 200 seconds. Our secondary stopped sync'ing because of the receive timeouts and stayed that way until I compiled a custom mongod binary with tcp_timeout set to 600 seconds in oplogreader.h and ran that mongod binary on the secondary, after which it happily started synchronizing with the primary again. FROM SUBSEQUENT REPORT We have been hitting this problem continually for the last several months. Our cluster is currently 1+3 with 2.6.12 with the biggest baddest AWS IO optimized instance. Our data size is about 3 TB per cluster ingesting about 220-250K records/min....average record size is 500 bytes; range is 200-1000 bytes FROM SUBSEQUENT REPORT A/ Our "ingestion" in to the cluster includes inserts, updates AND deletes (yes, we are working to rework this part and not do a search and delete). FROM SUBSEQUENT REPORT Shut down all failing secondaries. }. Last fetched optime (with hash): { ts: Timestamp 1486533979000|316, t: 7 }[-4870081738431509318]. Restarts remaining: 3 |
| Comments |
| Comment by Spencer Brody (Inactive) [ 01/Mar/17 ] |
|
Hi, |
| Comment by Michael Brenden [ 28/Feb/17 ] |
|
Yes, but only our "weak" secondary see load avg going to ~4 (4 core CPU), and this only during the time of the mass-inserts. From above: "For first few seconds after start of mass inserts (and during the course of mass inserts, which takes several minutes), strong and weak secondaries fall behind no longer than 15 seconds and sporadically catch up. However, after about 30 seconds, the weak secondary begins to fall behind and rarely catches up (though we have seen it catch up a few times); strong secondary keeps falling behind by 15 seconds or so, and then catches up, repeat, repeat, ends up all caught up after mass insert, of up to 2 million documents. CPU load on PRIMARY (32 cores) never really changes during this entire time; CPU load on strong secondaries also never changes much; but CPU load on weak secondary goes to 4 or higher (4 cores) then settles down, after mass insert on PRIMARY ends and as failures continue to log into weak secondary." |
| Comment by Søren Laursen [ 28/Feb/17 ] |
|
I have these problems on a sharding consist of 3 shards with a replication set of 3 servers each. We have not loaded data in the cluster for the last 7 days. But some of the replication servers start to hit the wall with a load of 5 on a lot of: numYields:0 reslen:264 locks:{ Global: { acquireCount: { r: 2 }, acquireWaitCount: { r: 1 }, timeAcquiringMicros: { r: 105747798 }}, Database: { acquireCount: { R: 1 } } } protocol:op_query 105749ms numYields:0 reslen:293 locks:{ Global: { acquireCount: { r: 8 }, acquireWaitCount: { r: 1 }, timeAcquiringMicros: { r: 386168866 }}, Database: { acquireCount: { r: 3 } } } protocol:op_query 386173ms , acquireWaitCount: { r: 1 }, timeAcquiringMicros: { r: 35148776 }}, Database: { acquireCount: { r: 1 }}, Collection: { acquireCount: { r: 1 } } } protocol:op_query 35150ms numYields:0 reslen:293 locks:{ Global: { acquireCount: { r: 8 }, acquireWaitCount: { r: 1 }, timeAcquiringMicros: { r: 86066703 }}, Database: { acquireCount: { r: 3 } } } protocol:op_query 86071ms , acquireWaitCount: { r: 1 }, timeAcquiringMicros: { r: 346507407 }}, Database: { acquireCount: { r: 1 }}, Collection: { acquireCount: { r: 1 } } } protocol:op_query 346507ms The 10.230.80.203 is the mongos server, and there is nothing to see in the log. Attached a screenshot for the nagios monitoring.
Do You also see an increase load of the servers? Best regards Søren |
| Comment by Michael Brenden [ 09/Feb/17 ] |
|
We are also seeing this pattern in the PRIMARY log – 10.8.0.10 is the IP of a 'weak' slave (one that is prone to falling behind after mass del/insert batch, and rarely (almost never) catching up) – pattern repeats every 20 seconds: (1) 2017-02-09T01:08:45.061-0600 I NETWORK [thread1] connection accepted from 10.8.0.10:36848 #28738 (1 connection now open) (2) 2017-02-09T01:08:45.061-0600 I NETWORK [conn28738] received client metadata from 10.8.0.10:36848 conn28738: { driver: { name: "NetworkInterfaceASIO-RS", version: "3.4.2" }, os: { type: "Linux", name: "PRETTY_NAME="deb7"", architecture: "x86_64", version: "Kernel 3.2.0-4-amd64" }} (3) 2017-02-09T01:09:05.126-0600 I - [conn28738] AssertionException handling request, closing client connection: 6 socket exception [SEND_ERROR] for 10.8.0.10:36848 (4) 2017-02-09T01:09:05.126-0600 I - [conn28738] end connection 10.8.0.10:36848 (18 connections now open) |