[SERVER-30947] checkOplogs function should dump more oplog entries on failure Created: 05/Sep/17  Updated: 30/Oct/23  Resolved: 11/Sep/17

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 3.4.16, 3.5.13

Type: Improvement Priority: Major - P3
Reporter: William Schultz (Inactive) Assignee: Katherine Walker (Inactive)
Resolution: Fixed Votes: 0
Labels: neweng
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Backwards Compatibility: Fully Compatible
Backport Requested:
v3.4
Sprint: Repl 2017-09-11, Repl 2017-10-02
Participants:

 Description   

In replsettest.js, the checkOplogs function verifies that the oplogs of each node in a replica set match. If there is a discrepancy between two oplogs, it will currently print the last 10 oplog entries of each node to the logs. A test may execute hundreds or thousands of operations, and this amount (10 entries) is somewhat arbitrary and not always helpful when trying to debug a failure. We should consider increasing this amount significantly, to 100 entries or 1000, or possibly just dumping the entire oplog of each node. This is (hopefully) not a check that fails often, so when it does, it would be nice to have as much debugging information as possible. Dumping the entire oplog of each node to the logs could aid with this.



 Comments   
Comment by Githook User [ 17/May/18 ]

Author:

{'email': 'katherine.walker@mongodb.com', 'username': 'kvwalker', 'name': 'kvwalker'}

Message: SERVER-30947 Increase dumpOplog size limit to 100 in checkOplogs

This reverts commit 820abe30691f09011183b63ab63cb1e9c43f3d9e.

(cherry picked from commit 52bbaa007cd84631d6da811d9a05b59f2dfad4f3)
Branch: v3.4
https://github.com/mongodb/mongo/commit/629fefcff3276c5665a9237d50032d1bd012393d

Comment by Ramon Fernandez Marina [ 11/Sep/17 ]

Author:

{'username': u'kvwalker', 'name': u'kvwalker', 'email': u'katherine.walker@mongodb.com'}

Message:SERVER-30947 Increase dumpOplog size limit to 100 in checkOplogs

This reverts commit 820abe30691f09011183b63ab63cb1e9c43f3d9e.
Branch:master
https://github.com/mongodb/mongo/commit/52bbaa007cd84631d6da811d9a05b59f2dfad4f3

Comment by Nathan Myers [ 11/Sep/17 ]

I cannot list the BFG tickets for the failures because there is no
easy way to go from a waterfall failure to the ticket that was created
for it. (EVG-1400 tracks progress toward implementing such a feature,
to which I encourage all of you to add your votes.) I suppose a
sufficiently clever JIRA search query would identify them.

In lieu of such a list, try
https://evergreen.mongodb.com/waterfall/mongodb-mongo-master?skip=8,
and note the failures that occurred in "! Enterprise RHEL 6.2",
"Enterprise Debian 8.1", "Enterprise SLES 12 s390x", "Enterprise
Ubuntu 14.04", "Enterprise Ubuntu 16.04 arm64", "SSL Amazon Linux",
"SSL OS X 10.10", "SSL Ubuntu 14.04", and "~ ASAN Enterprise SSL
Ubuntu 16.04 DEBUG", not matched by failures in the nine previous
builds.

Build failures are an exceptionally noisy signal, so there is a chance
that these failures would have happened anyway. But might the extra
logging be creating delays that push tests that are sensitive to
timing over the edge? Of course such tests should be fixed, but I'm
not holding my breath.

After backing the patch out, the new failures went away, and other new
ones appeared, although they did not appear in subsequent builds. I
suppose the only way to know whether the patch "caused" the failures
would be to apply it again. I leave that to your judgment.

Comment by William Schultz (Inactive) [ 11/Sep/17 ]

nathan.myers I am also confused by this revert. This commit was a one line change in our Javascript test framework. What are the failures you are referring to?

Comment by Max Hirschhorn [ 10/Sep/17 ]

Re-opening this ticket since the changes were reverted.

It appears to break too many SSL builds.

nathan.myers, given that ReplSetTest#checkOplogs() is a function to help ensure consistency of the oplog across a replica set and Katherine's change simply increased the number of oplog entries dumped as context upon failure, I find it unlikely that the changes from 1baf806 are responsible. Could you provide a link to the Evergreen failures you observed and let's figure out if there's another recent commit to mongodb/mongo that could be responsible?

Comment by Ramon Fernandez Marina [ 10/Sep/17 ]

Author:

{'username': u'nathan-myers-mongo', 'name': u'Nathan Myers', 'email': u'ncm@cantrip.org'}

Message:Revert "SERVER-30947 Increase dumpOplog size limit to 100 in checkOplogs"

This reverts commit 1baf806e71f2d4d2710b9c818b3f954557c4ad16.
It appears to break too many SSL builds.
Branch:master
https://github.com/mongodb/mongo/commit/820abe30691f09011183b63ab63cb1e9c43f3d9e

Comment by Ramon Fernandez Marina [ 08/Sep/17 ]

Author:

{'username': u'kvwalker', 'name': u'kvwalker', 'email': u'katherine.walker@10gen.com'}

Message:SERVER-30947 Increase dumpOplog size limit to 100 in checkOplogs
Branch:master
https://github.com/mongodb/mongo/commit/1baf806e71f2d4d2710b9c818b3f954557c4ad16

Comment by Spencer Brody (Inactive) [ 05/Sep/17 ]

Yeah, we should probably just remove the limit from the dumpOplog function and always print the whole thing.

Generated at Thu Feb 08 04:25:32 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.