[SERVER-27952] Replication fails under heavy load - Oplog timeout should be configurable Created: 08/Feb/17  Updated: 01/Mar/17  Resolved: 01/Mar/17

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.6.0, 3.4.2
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Michael Brenden Assignee: Unassigned
Resolution: Duplicate Votes: 2
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File screenshot-1.png    
Issue Links:
Duplicate
duplicates SERVER-19605 Oplog timeout should be configurable Closed
Operating System: ALL
Participants:

 Description   

Renaming of https://jira.mongodb.org/browse/SERVER-19605

FROM INITIAL REPORT

We just encountered a situation where all secondaries in two of our replica sets had ceased replication, and were 1-2 days behind the primary. This appears to have been caused in part by the fact that the initial oplog query from SECONDARY->PRIMARY times out after 30 seconds, but the oplog query takes > 5 minutes to run. Some searching led me to this JIRA SERVER-6733, where the timeout was reduced from 10 minutes to 30 seconds.
As a workaround, we are building a custom binary with an increased oplog timeout so that the initial oplog query is allowed to complete and so our secondaries have a chance to catch up.
Ideally, this value would be configurable with a flag or configuration option to avoid the need to recompile, and to allow users to customize the timeout for their particular situation.

FROM SUBSEQUENT REPORT

This just bit us in one of our production replicasets. The primary was sufficiently busy, our oplog sufficiently large, and we inserted enough documents in a short enough period of time, that scanning the table to respond to the secondary started taking not only more than 30 seconds, but upwards of 200 seconds. Our secondary stopped sync'ing because of the receive timeouts and stayed that way until I compiled a custom mongod binary with tcp_timeout set to 600 seconds in oplogreader.h and ran that mongod binary on the secondary, after which it happily started synchronizing with the primary again.

FROM SUBSEQUENT REPORT

We have been hitting this problem continually for the last several months. Our cluster is currently 1+3 with 2.6.12 with the biggest baddest AWS IO optimized instance. Our data size is about 3 TB per cluster ingesting about 220-250K records/min....average record size is 500 bytes; range is 200-1000 bytes
As Jonathan mentions above, ours is also a pretty busy cluster with high ingestion rates and our oplog size matches that. What we find is the following:
1/ Cluster is happy and replicating with lag < 5 seconds
2/ Sudden high ingestion happens - that is about 1.1-1.5x ingestion rate at steady happy state.
3/ getmore requests are delayed - and jumps to above 30 seconds. This causes secondaries to read timeout and start afresh.
4/ Starting afresh unfortunately seems to cause a complete oplog scan on primary - and it takes in the order of minutes given the ingestion rate. This query also times out on secondaries in 30 seconds.
5/ Each secondary repeatedly times out and issues a new query every 30 seconds. Primary accumulates all the queries and gets slower and slower.
Now unless the ingestion is totally stopped, there is no way to recover. We landed with huge number of queries pending on primary - had to stop the entire ingestion pipeline, kill all the long running oplog queries and then things started back up - but cluster is 18 hours behind.
Observations/Questions:
1/ The recovery step a secondary takes from a getmore request fails seems to cause even more load on primary. There has to be some design choices here to make to handle these errors more gracefully.
2/ Oplog full scan queries were increasing in duration forever in the above state (when ingestion was still happening). Is there some tuning that needs to be done to service these queries faster - since they seem to be critical to a secondary to get started?
3/ Is there any reason why the op log queries from secondary to primary don't have a query level timeout (like using maxTimeMS in 2.6.x) when the underlying connection read time out is 30 seconds?
In the mean time we will follow Jonathan's approach to bump the hard coded timeout to get more mileage.

FROM SUBSEQUENT REPORT

A/ Our "ingestion" in to the cluster includes inserts, updates AND deletes (yes, we are working to rework this part and not do a search and delete).
B/ When the sum of records (or documents) added/changed/deleted crosses around 370K/min, it alters the getmore responses so much so that it jumps past 30 seconds.
C/ In the most recent event, with inserts and updates running around 150K/min, a delete operation deleted about 650K records in 132 seconds. That was enough to push getmore response over the 30 second limit.
In the past sudden surge of inserts also have triggered similar behavior. Once this happened, with deletes continuing for another 10-15 minutes (we have a monitoring script that stops deletes if replication lag increases beyond a threshold) it was enough for oplog full scan queries to take a long time and from there everything went downhill even though all the deletes had stopped and only inserts were happening at a constant clip.

FROM SUBSEQUENT REPORT

Shut down all failing secondaries.
Flushed one "weak" secondary and did full resync.
On weak secondary: Moved all data indexes to another drive. Also moved local and journal dirs to another drive.
Kept "strong" (non-failing) secondaries running.
printSlaveReplicationInfo() on PRIMARY shows this "weak" secondary and all strong secondaries are 0 seconds behind PRIMARY.
Small operations work fine (del, updt, insert), hundreds or thousands, over hours, no problem.
So, everything looks fine so far.
Did truncate on collection having over 100k documents, then immediately a mass document insert (batched 100 documents at a time), probably inserting maybe 500 or 1k documents per second on PRIMARY. This is when the trouble always begins.
For first few seconds after start of mass inserts (and during the course of mass inserts, which takes several minutes), strong and weak secondaries fall behind no longer than 15 seconds and sporadically catch up. However, after about 30 seconds, the weak secondary begins to fall behind and rarely catches up (though we have seen it catch up a few times); strong secondary keeps falling behind by 15 seconds or so, and then catches up, repeat, repeat, ends up all caught up after mass insert, of up to 2 million documents. CPU load on PRIMARY (32 cores) never really changes during this entire time; CPU load on strong secondaries also never changes much; but CPU load on weak secondary goes to 4 or higher (4 cores) then settles down, after mass insert on PRIMARY ends and as failures continue to log into weak secondary.
Errors logging on weak secondary log (repeats every 10 seconds) :
[replication-2] Restarting oplog query due to error: ExceededTimeLimit: Operation timed out, request was RemoteCommand 13296 - target:10.11.22.33:27017 db:local expDate:2017-02-08T01:25:02.627-0500 cmd:{ getMore: 15975151664, collection: "oplog.rs", maxTimeMS: 5000, term: 7, lastKnownCommittedOpTime:

{ ts: Timestamp 1486464780000|35813, t: 7 }

}. Last fetched optime (with hash):

{ ts: Timestamp 1486533979000|316, t: 7 }

[-4870081738431509318]. Restarts remaining: 3



 Comments   
Comment by Spencer Brody (Inactive) [ 01/Mar/17 ]

Hi,
The feature request for making the oplog query timeout configurable has already been filed as SERVER-19605, therefore I am going to close this ticket as a duplicate of that one. There seems to be a separate issue described on this ticket which is that some of your secondaries are able to keep up fine while others are not. Investigating the difference in behavior between your secondaries would be a more involved discussion and would be better conducted in a MongoDB support channel such as the mongodb-user group or Stack Overflow with the mongodb tag. See also our Technical Support page for additional support resources.

Comment by Michael Brenden [ 28/Feb/17 ]

Yes, but only our "weak" secondary see load avg going to ~4 (4 core CPU), and this only during the time of the mass-inserts.

From above: "For first few seconds after start of mass inserts (and during the course of mass inserts, which takes several minutes), strong and weak secondaries fall behind no longer than 15 seconds and sporadically catch up. However, after about 30 seconds, the weak secondary begins to fall behind and rarely catches up (though we have seen it catch up a few times); strong secondary keeps falling behind by 15 seconds or so, and then catches up, repeat, repeat, ends up all caught up after mass insert, of up to 2 million documents. CPU load on PRIMARY (32 cores) never really changes during this entire time; CPU load on strong secondaries also never changes much; but CPU load on weak secondary goes to 4 or higher (4 cores) then settles down, after mass insert on PRIMARY ends and as failures continue to log into weak secondary."

Comment by Søren Laursen [ 28/Feb/17 ]

I have these problems on a sharding consist of 3 shards with a replication set of 3 servers each.

We have not loaded data in the cluster for the last 7 days. But some of the replication servers start to hit the wall with a load of 5 on a lot of:
2017-02-28T09:50:02.747+0100 I COMMAND [conn15345] command ais-arctic command: dbStats

{ dbstats: 1 }

numYields:0 reslen:264 locks:{ Global: { acquireCount:

{ r: 2 }

, acquireWaitCount:

{ r: 1 }

, timeAcquiringMicros:

{ r: 105747798 }

}, Database: { acquireCount:

{ R: 1 }

} } protocol:op_query 105749ms
2017-02-28T09:50:02.747+0100 I COMMAND [conn15315] command admin.$cmd command: listDatabases

{ listDatabases: 1 }

numYields:0 reslen:293 locks:{ Global: { acquireCount:

{ r: 8 }

, acquireWaitCount:

{ r: 1 }

, timeAcquiringMicros:

{ r: 386168866 }

}, Database: { acquireCount:

{ r: 3 }

} } protocol:op_query 386173ms
2017-02-28T09:50:02.747+0100 I - [conn15315] AssertionException handling request, closing client connection: 6 socket exception [SEND_ERROR] for 10.230.80.203:20464
2017-02-28T09:50:02.747+0100 I COMMAND [conn15364] command local.system.replset command: find { find: "system.replset", filter: {}, limit: 1, singleBatch: true } planSummary: COLLSCAN keysExamined:0 docsExamined:1 cursorExhausted:1 numYields:0 nreturned:1 reslen:973 locks:{ Global: { acquireCount:

{ r: 2 }

, acquireWaitCount:

{ r: 1 }

, timeAcquiringMicros:

{ r: 35148776 }

}, Database: { acquireCount:

{ r: 1 }

}, Collection: { acquireCount:

{ r: 1 }

} } protocol:op_query 35150ms
2017-02-28T09:50:02.748+0100 I COMMAND [conn15347] command admin.$cmd command: listDatabases

{ listDatabases: 1 }

numYields:0 reslen:293 locks:{ Global: { acquireCount:

{ r: 8 }

, acquireWaitCount:

{ r: 1 }

, timeAcquiringMicros:

{ r: 86066703 }

}, Database: { acquireCount:

{ r: 3 }

} } protocol:op_query 86071ms
2017-02-28T09:50:02.749+0100 I COMMAND [conn15322] command local.system.replset command: find { find: "system.replset", filter: {}, limit: 1, singleBatch: true } planSummary: COLLSCAN keysExamined:0 docsExamined:1 cursorExhausted:1 numYields:0 nreturned:1 reslen:973 locks:{ Global: { acquireCount:

{ r: 2 }

, acquireWaitCount:

{ r: 1 }

, timeAcquiringMicros:

{ r: 346507407 }

}, Database: { acquireCount:

{ r: 1 }

}, Collection: { acquireCount:

{ r: 1 }

} } protocol:op_query 346507ms
2017-02-28T09:50:02.749+0100 I - [conn15322] AssertionException handling request, closing client connection: 6 socket exception [SEND_ERROR] for 10.230.80.203:20524

The 10.230.80.203 is the mongos server, and there is nothing to see in the log.

Attached a screenshot for the nagios monitoring.

Do You also see an increase load of the servers?

Best regards

Søren

Comment by Michael Brenden [ 09/Feb/17 ]

We are also seeing this pattern in the PRIMARY log – 10.8.0.10 is the IP of a 'weak' slave (one that is prone to falling behind after mass del/insert batch, and rarely (almost never) catching up) – pattern repeats every 20 seconds:

(1) 2017-02-09T01:08:45.061-0600 I NETWORK [thread1] connection accepted from 10.8.0.10:36848 #28738 (1 connection now open)

(2) 2017-02-09T01:08:45.061-0600 I NETWORK [conn28738] received client metadata from 10.8.0.10:36848 conn28738: { driver:

{ name: "NetworkInterfaceASIO-RS", version: "3.4.2" }

, os:

{ type: "Linux", name: "PRETTY_NAME="deb7"", architecture: "x86_64", version: "Kernel 3.2.0-4-amd64" }

}

(3) 2017-02-09T01:09:05.126-0600 I - [conn28738] AssertionException handling request, closing client connection: 6 socket exception [SEND_ERROR] for 10.8.0.10:36848

(4) 2017-02-09T01:09:05.126-0600 I - [conn28738] end connection 10.8.0.10:36848 (18 connections now open)

Generated at Thu Feb 08 04:16:42 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.