[SERVER-20773] core/fsync.js can leave test server fsyncLocked Created: 06/Oct/15  Updated: 06/Dec/22  Resolved: 19/Nov/21

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Kevin Pulo Assignee: Backlog - Server Tooling and Methods (STM) (Inactive)
Resolution: Won't Fix Votes: 0
Labels: stm
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-27227 Disable fail points on test failures ... Closed
Assigned Teams:
Server Tooling & Methods
Operating System: ALL
Participants:

 Description   

If jstests/core/fsync.js throws an assertion after calling fsyncLock, but before calling fsyncUnlock, the server will be left in an fsyncLocked state. This can be a problem if multiple tests are being run in parallel, or --continueOnFailure is being used (resmoke will get stuck waiting for the fsyncLocked mongod).

The exception to this is asserting if the fsyncLock has failed. This is safe as long as the server wasn't already fsyncLocked when fsyncLock was called (ie. testing recursive fsyncLock).

Many of the tests have the kind of form you would expect:

  • fsyncLock
  • do stuff/check something (ie. call assert())
  • fsyncUnlock

Better would be to rearrange things so as to never call assert() between fsyncLock/fsyncUnlock, eg:

  • fsyncLock
  • save values to be tested
  • fsyncUnlock
  • check saved values (ie. call assert())

The main alternative to this approach would be to pass a "finally" function to assert(), which is called just before doassert(). But this would be more invasive (and probably uglier) than just fixing this one jstest.



 Comments   
Comment by Brooke Miller [ 19/Nov/21 ]

We'd like this to be addressed in individual tests, rather than by Resmoke. Closing as Won't Fix as a result.

Comment by Kevin Pulo [ 07/Oct/15 ]

Grepping reveals some other tests which could suffer from the same problem:

$ egrep -ilr 'fsync\s*:\s*1\s*,\s*lock\s*:\s*1|fsyncLock' .
./repl/snapshot1.js
./auth/pseudo_commands.js
./core/fsync.js
./gle/gle_explicit_optime.js
./noPassthrough/backup_restore.js
./replsets/election_not_blocked.js
./replsets/stepdown3.js
./replsets/maxSyncSourceLagSecs.js
./replsets/stepdown.js
./replsets/fsync_lock_read_secondaries.js
./replsets/maintenance_non-blocking.js
./noPassthroughWithMongod/sharding_rs2.js
./noPassthroughWithMongod/fsync2.js
./noPassthroughWithMongod/backup_cursors_wt.js

After inspection, there are three categories:

  1. Those along the same lines as core/fsync.js:
    • noPassthroughWithMongod/fsync2.js
    • noPassthroughWithMongod/backup_cursors_wt.js
  2. Those that lock nodes inside a ReplSetTest. The impact here is much lower — it should only cause the test to run longer by about a minute, since the secondary won't come down cleanly and so will end up being forcibly killed:
    • replsets/stepdown.js
    • replsets/stepdown3.js
    • replsets/maxSyncSourceLagSecs.js
    • noPassthroughWithMongod/sharding_rs2.js
    • replsets/maintenance_non-blocking.js
    • replsets/fsync_lock_read_secondaries.js
    • replsets/election_not_blocked.js
    • noPassthrough/backup_restore.js
    • gle/gle_explicit_optime.js
  3. Those that (probably) don't have this problem:
    • repl/snapshot1.js
    • auth/pseudo_commands.js
Generated at Thu Feb 08 03:55:14 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.