[SERVER-10344] Race condition when starting up new master/slave cluster. Was: repl4.js failing on Linux 64-bit Weekly Slow Tests Created: 26/Jul/13 Updated: 14/Apr/16 Resolved: 26/Aug/15 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Matt Kangas | Assignee: | Unassigned |
| Resolution: | Done | Votes: | 0 |
| Labels: | tertiary | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Steps To Reproduce: | time scons --dd --sharedclient all |
||||||||||||||||
| Participants: | |||||||||||||||||
| Description |
|
jstests/repl/repl4.js was failing when started under auth mode. I believe this is actually a timing bug in master/slave replication, not directly an auth issue. It seems to have something to do with setting up a slave that only syncs a single DB from the master. If you put a "sleep(5000)" after line 13 in repl4.js (the line that starts the primary), then the test passes. Also, if you switch the order of lines 19 and 20 (the lines that do writes into 2 dbs, one that's synced and one that isn't) then the test passes. When the test fails, this shows up in the logs of the slave:
ORIGINAL DESCRIPTION: Linux 64-bit Weekly Slow Tests Build #256 July 14 rev f204f7f Linux 64-bit Weekly Slow Tests Build #261 July 21 rev 9bf7075 Linux 64-bit Weekly Slow Tests Build #262 July 23 rev 37f7f30 (#263 was interrupted) Linux 64-bit Weekly Slow Tests Build #264 July 25 rev 25395ab All of these failed with a final error similar to
But prior to this failure, a bunch of these happen.
As far as I can tell, all that's happening and failing here is authentication attempts. |
| Comments |
| Comment by Spencer Brody (Inactive) [ 20/Oct/14 ] | ||||
|
Replication refactor didn't fix the problem. Assigning to Eric to re-triage. | ||||
| Comment by Eric Milkie [ 28/Aug/14 ] | ||||
|
Before 2.8 release, we should try re-enabling these tests and see if our refactor has solved the race condition. | ||||
| Comment by Andrew Morrow (Inactive) [ 22/Aug/13 ] | ||||
|
This issue was also causing repl_sync_only_db_with_special_chars.js to fail (see | ||||
| Comment by auto [ 22/Aug/13 ] | ||||
|
Author: {u'username': u'acmorrow', u'name': u'Andrew Morrow', u'email': u'acm@10gen.com'}Message: | ||||
| Comment by auto [ 31/Jul/13 ] | ||||
|
Author: {u'username': u'stbrody', u'name': u'Spencer T Brody', u'email': u'spencer@10gen.com'}Message: | ||||
| Comment by auto [ 31/Jul/13 ] | ||||
|
Author: {u'username': u'stbrody', u'name': u'Spencer T Brody', u'email': u'spencer@10gen.com'}Message: | ||||
| Comment by Matt Kangas [ 30/Jul/13 ] | ||||
|
After discussing with Spencer, we've identified (a) a probable server bug, which may or may not be worth fixing, and (b) a deficiency in "repl" tests that prevents us from working around it. The probable server bug: start up a master/slave pair of mongods, then write to the primary before the secondary is "ready" (for some definition of ready). The secondary gives up with "data too stale, halting replication". The test deficiency: "repl" jstests need a function analogous to the "awaitReplication" function available to "replsets" tests. Assuming we don't solve the server problem, here's how we might write "awaitReplication" for master/slave tests. Secondaries in M/S replication do not create oplogs by default, but they do have a marker of how far they have synced which could be used: the `db.sources` collection. It might be sufficient to test that `db.sources.find()` returns nonzero results. Or we may need to check the `syncedTo` field of the latest record on the secondary, and ensure it matches the primary's latest oplog entry. I am now also convinced that all failures in | ||||
| Comment by auto [ 29/Jul/13 ] | ||||
|
Author: {u'username': u'stbrody', u'name': u'Spencer T Brody', u'email': u'spencer@10gen.com'}Message: | ||||
| Comment by Matt Kangas [ 26/Jul/13 ] | ||||
|
Confirmed that I can repro the failure on Linux 64 debug only with --auth enabled. As you say, I think it's really a timing issue. The test was already disabled on Windows by Eric with the comment "skip racy test on Windows". Shall we skip this also when --auth is enabled? | ||||
| Comment by Spencer Brody (Inactive) [ 26/Jul/13 ] | ||||
|
The commit that broke this is https://github.com/mongodb/mongo/commit/3f4169d7144fe1300129f5b0ccda1b75b8f76cef. I believe this is actually a timing bug in master/slave replication, not directly an auth issue.
Which shows up in the log output shortly after the secondary is started. |