[SERVER-44881] Giant slow oplog entries are being logged in full Created: 29/Nov/19  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: Replication
Affects Version/s: 4.2.1
Fix Version/s: None

Type: Improvement Priority: Minor - P4
Reporter: Michael Smith Assignee: Backlog - Replication Team
Resolution: Unresolved Votes: 0
Labels: former-quick-wins
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-43069 Ops that are slow to apply don't cond... Closed
related to SERVER-44948 High cache pressure in 4.2 replica se... Closed
related to SERVER-32146 Log slow oplog entry application Closed
Assigned Teams:
Replication
Participants:

 Description   

From the release notes:

Starting in version 4.2 (also available starting in 4.0.6), secondary members of a replica set now log oplog entries that take longer than the slow operation threshold to apply.

This seems reasonable, but it seems to be logging the entire contents of a batch-insert, which has adverse performance and privacy impacts. I get tons of logs like this:

2019-11-28T22:51:51.423+0000 I REPL [repl-writer-worker-2] warning: log line attempted (234kB) over max size (10kB), printing beginning and end ... applied op: CRUD { ts: [ Timestamp(1574981509, 1165), Timestamp(1574981509, 1184), Timestamp(1574981509, 1228), Timestamp(1574981509, 1235), Timestamp(1574981509, 1242), Timestamp(1574981509, 1285), Timestamp(1574981509, 1302), Timestamp(1574981509, 1318), Timestamp(1574981509, 1324), Timestamp(1574981509, 1341), Timestamp(1574981509, 1375), Timestamp(1574981509, 1376), Timestamp(1574981509, 1380), Timestamp(1574981510, 8), Timestamp(1574981510, 45), Timestamp(1574981510, 71), Timestamp(1574981510, 97), Timestamp(1574981510, 101), Timestamp(1574981510, 156), Timestamp(1574981510, 161), Timestamp(1574981510, 188), Timestamp(1574981510, 211), Timestamp(1574981510, 221), Timestamp(1574981510, 256), Timestamp(1574981510, 263), Timestamp(1574981510, 272), Timestamp(1574981510, 282), Timestamp(1574981510, 285), Timestamp(1574981510, 304), Timestamp(1574981510, 314), Timestamp(1574981510, 324), Timestamp(1574981510, 328), Timestamp(1574981510, 336), Timestamp(1574981510, 345), Timestamp(1574981510, 370), Timestamp(1574981510, 399), Timestamp(1574981510, 414), Timestamp(1574981510, 416), Timestamp(1574981510, 419), Timestamp(1574981510, 444), Timestamp(1574981510, 473), Timestamp(1574981510, 504), Timestamp(1574981510, 571), Timestamp(1574981510, 582), Timestamp(1574981510, 604), Timestamp(1574981510, 621), Timestamp(1574981510, 641), Timestamp(1574981510, 644), Timestamp(1574981510, 646), Timestamp(1574981510, 682), Timestamp(1574981510, 683), Timestamp(1574981510, 687), Timestamp(1574981510, 688), Timestamp(1574981510, 689), Timestamp(1574981510, 693), Timestamp(1574981510, 698), Timestamp(1574981510, 702), Timestamp(1574981510, 704), Timestamp(1574981510, 709), Timestamp(1574981510, 711), Timestamp(1574981510, 731), Timestamp(1574981510, 738), Timestamp(1574981510, 757), Timestamp(1574981510, 771) ], t: [ 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62, 62 ], o: [ { _id: ObjectId('5de04f852f19ef00016eff32'), }}{{<redacted>ment_id: ObjectId('5dd5bc5d0305cb0001ca5de2'), acc .......... ", so<redacted>} ], h: 0, v: 2, op: "i", ns: "<redacted>", ui: UUID("e11ee941-efd8-4880-a6d5-f92e472ecb86"), wall: new Date(1574981509985) }, took 542ms

Sure, I can and will bump up the slow operation threshold on the secondaries...  but I only want to know about slow operations at the namespace/operation level; I'd never want the full inserted data to be logged. I mean, at most maybe log the `_id`, that's enough to find the data if you really need it.

Note: The replica sets where I'm seeing this are still using featureCompatibilityVersion 4.0, if that makes a difference.



 Comments   
Comment by Asya Kamsky [ 19/Jan/20 ]

Note that SERVER-32146 was backported to 3.6 and 4.0 so this issue/complaint may come up in older versions as well.

Comment by Michael Smith [ 04/Dec/19 ]

I had initially opened this because I saw slow replication performance on secondaries running 4.2.1 with a 4.0.4 master, and I thought the extra logging was the cause of it. I've worked around that by increasing the slow operation threshold, and I don't think it's really affected performance either way.

Comment by Carl Champain (Inactive) [ 03/Dec/19 ]

Hi mzs,

Thanks for the report.

We understand that this behavior may cause trouble to you; I will pass this ticket along to the Replication team to let them know about it.
We'll take a look at this behavior, but the way we replicate operations on secondaries is by using batch inserts. We don't have a way to track which single operation was slow in a batch insert, that's why we print each operation in it to determine the slow ones.
 
Kind regards,
Carl

Generated at Thu Feb 08 05:07:15 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.