Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 4.8.0, 4.4.2, 4.2.15
Affects Version/s: None
Component/s: Testing Infrastructure
Labels:
None

Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Backport Requested:

v4.4, v4.2
Sprint:
Execution Team 2020-10-05
Linked BF Score:
14
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

There was a build failure related to this test where oplog truncation was running concurrently while inserting a record that should cause the oplog to rollover.

Looking at the insertions the test performed we have the following oplog entries (some fields not shown due to the test using a projection):

[ 
 { "op" : "i", "ns" : "test.foo", "ts" : Timestamp(1598009744, 3), "t" : NumberLong(1) },
 { "op" : "i", "ns" : "test.foo", "ts" : Timestamp(1598009745, 1), "t" : NumberLong(1) }, 
 { "op" : "i", "ns" : "test.foo", "ts" : Timestamp(1598009747, 1), "t" : NumberLong(1) } 
]

The oplog truncation thread was truncating the oplog between RecordId's 0 and 6863399589169332000.
These have RecordId's 6863399589169332227, 6863399593464299521 and 6863399602054234113 respectively.
All of these have a RecordId higher then what the oplog truncation method was truncating, and so none of these oplog entries were truncated when it was expected for the first oplog entry to be truncated.

The third record inserted that was supposed to roll over the oplog failed to create a stone, causing the test to hang as the OplogCapMaintainerThread saw nothing to reclaim.

The test waits until there are two oplog entries remaining, but there were always three oplog entries in this run.

From my observation, based on when the oplog truncation thread was running and when the third record was inserted, I think we tried to create a new oplog stone while oplog truncation was running. The oplog truncation thread can hold a mutex for a short amount of time when calling either peekOldestStoneIfNeeded() or popOldestStone() in the reclaimOplog() function.

During this time, the third record insertion tried to create a new oplog stone but because of the possibility of the mutex being held by the oplog truncation thread, we returned early.

This without a doubt is a transient issue as the subsequent insertions would try to create the oplog stone. But this test does not perform any other insertions and expects that oplog stone to be created no matter what.

Assignee:: Gregory Wlodarek
Reporter:: Gregory Wlodarek
Participants:: Githook User, Gregory Wlodarek
Votes:: 0 Vote for this issue
Watchers:: 2 Start watching this issue

Created:: Sep 15 2020 06:17:25 PM UTC
Updated:: Oct 29 2023 10:03:09 PM UTC
Resolved:: Sep 25 2020 02:59:11 AM UTC
Confidence Status Last Update:: 18/Sep/20 6:13 PM

Details

Description

Attachments

Forms

Activity

People

Dates