[CDRIVER-2985] Investigate flaky test: /Client/last_write_date/pooled Created: 28/Feb/19  Updated: 28/Oct/23  Resolved: 29/Apr/20

Status: Closed
Project: C Driver
Component/s: libmongoc, tests
Affects Version/s: None
Fix Version/s: 1.17.0-beta2, 1.17.0

Type: Bug Priority: Major - P3
Reporter: Kevin Albertson Assignee: Kevin Albertson
Resolution: Fixed Votes: 0
Labels: flaky-tests
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by CDRIVER-3664 "Parse lastWriteDate" test requires a... Closed
Related
Epic Link: CDRIVER-2733

 Description   

It failed on Windows VS 2015:

 [2019/02/27 18:24:21.029] + ./src/libmongoc/Debug/test-libmongoc.exe -d -F test-results.json
 [2019/02/27 18:24:21.029] FAIL
 [2019/02/27 18:24:21.029] Assert Failure: 0 > 0
 [2019/02/27 18:24:21.029] C:\data\mci\4709715f64dc8f5af580425573d98743\mongoc\src\libmongoc\tests\test-mongoc-max-staleness.c:216  _test_last_write_date()

On this patch build.

The test does the following:

  • sets heartbeatFrequencyMS to 500. That means the topology scanner should send ismaster's to each server if they were last updated >= 500ms ago.
  • does server selection
  • sleeps for 2 seconds
  • does server selection again
  • checks that the last_write_date has increased since the first server selection

But the error indicates the server description was not updated in the second server selection attempt.

It's unclear to me why this could have failed, unless the sleep really wasn't sleeping for 2 seconds, or if there's some other time dependant bug.



 Comments   
Comment by Githook User [ 31/May/20 ]

Author:

{'name': 'Kevin Albertson', 'email': 'kevin.albertson@mongodb.com', 'username': 'kevinAlbs'}

Message: CDRIVER-2985 fix /Client/last_write_date/pooled
Branch: r1.16
https://github.com/mongodb/mongo-c-driver/commit/feb3bab7ed668db027c19e28b0b30f83702feca1

Comment by Githook User [ 25/May/20 ]

Author:

{'name': 'Kevin Albertson', 'email': 'kevin.albertson@mongodb.com', 'username': 'kevinAlbs'}

Message: CDRIVER-2985 fix /Client/last_write_date/pooled
Branch: r1.17
https://github.com/mongodb/mongo-c-driver/commit/ef9ac101184e4cfac3644c2cec77e85df33edec3

Comment by Githook User [ 29/Apr/20 ]

Author:

{'name': 'Kevin Albertson', 'email': 'kevin.albertson@mongodb.com', 'username': 'kevinAlbs'}

Message: CDRIVER-2985 fix /Client/last_write_date/pooled
Branch: master
https://github.com/mongodb/mongo-c-driver/commit/932540ede356483f61af9811ba1b89014deafb2f

Comment by Kevin Albertson [ 28/Apr/20 ]

Confirmed the hypothesis by downloading the mongodb directories attached to the patch build and inspecting the oplog just before the crash.

This appears to be due to a noop write in the oplog shortly before second insert in the test:

local> db.oplog.rs.find().sort({ts:-1})
{ "ts" : Timestamp(1587642727, 1), "t" : NumberLong(1), "h" : NumberLong("-3145035210528304586"), "v" : 2, "op" : "n", "ns" : "", "wall" : ISODate("2020-04-23T11:52:07.635Z"), "o" : { "msg" : "periodic noop" } }
{ "ts" : Timestamp(1587642725, 2), "t" : NumberLong(1), "h" : NumberLong("-2339260708452859779"), "v" : 2, "op" : "i", "ns" : "test.test_last_write_date_157177752_2840", "ui" : UUID("66705f00-c15c-48a8-8a4a-c920a57f541b"), "wall" : ISODate("2020-04-23T11:52:05.987Z"), "o" : { "_id" : ObjectId("5ea18165d0cf7c3936089e94") } }
{ "ts" : Timestamp(1587642725, 1), "t" : NumberLong(1), "h" : NumberLong("6206579937116025356"), "v" : 2, "op" : "n", "ns" : "", "wall" : ISODate("2020-04-23T11:52:05.634Z"), "o" : { "msg" : "periodic noop" } }
{ "ts" : Timestamp(1587642723, 4), "t" : NumberLong(1), "h" : NumberLong("-5631330227419792400"), "v" : 2, "op" : "i", "ns" : "test.test_last_write_date_157177752_2840", "ui" : UUID("66705f00-c15c-48a8-8a4a-c920a57f541b"), "wall" : ISODate("2020-04-23T11:52:03.984Z"), "o" : { "_id" : ObjectId("5ea18163d0cf7c3936089e93") } }

This appears to be due to a recent change in mongo orchestration to start the periodic noop writer with an interval of 1s instead of 10s: https://github.com/10gen/mongo-orchestration/commit/066ec58c817947ca136c25690a705fd4e4db2a7b

Though, this does beg the question whether there was a different reason this test was failing before.

Comment by Kevin Albertson [ 28/Apr/20 ]

This test comes from the max stalenes spec, the "Parse lastWriteDate" test: https://github.com/mongodb/specifications/blob/master/source/max-staleness/max-staleness-tests.rst#parse-lastwritedate

Observations

  • lastWriteDate has second granularity. If two inserts occurred within the same second, the same lastWriteDate would be reported.
  • I've only seen this happened on the "pooled" test.
  • The test selects a secondary, but this still persists even if the test targets the primary.
  • With lots of debug statements, I was able to capture the ismasters received from all three nodes during the test. Reference.

A timeline:
11:52:03.0975 test starts
11:52:03.0980 primary recieves lastWriteDate=1587642723000
11:52:03.0981 secondary receives lastWriteDate=1587642723000
11:52:05.0987 secondary receives lastWriteDate=1587642725000
11:52:06.0487 primary receives lastWriteDate=1587642725000
No server receives an ismaster reply containing a larger lastWriteDate for the remainder of the test.

Both server descriptions s0 and s1 have lastWriteDate of 1587642725000 in the end. This is first received two seconds after the test starts.

Hypothesis: If there was a write from something external (e.g. from the oplog noop writer, a session being deleted, etc.) to the oplog just before the first server selection, then the insert may be happening within the same lastWriteDate second. I can confirm this by dumping the oplog.

Generated at Wed Feb 07 21:16:52 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.