[SERVER-39154]  Create a fail point to make it so if a LockManager lock cannot be acquired immediately, then the operation fails Created: 23/Jan/19  Updated: 29/Oct/23  Resolved: 05/Mar/19

Status: Closed
Project: Core Server
Component/s: Concurrency, Testing Infrastructure
Affects Version/s: None
Fix Version/s: 4.1.9

Type: Task Priority: Major - P3
Reporter: Judah Schvimer Assignee: Vesselina Ratcheva (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
related to SERVER-40038 failNonIntentLocksIfWaitNeeded failpo... Closed
related to SERVER-40420 Using failNonIntentLocksIfWaitNeeded ... Closed
related to SERVER-42452 failNonIntentLocksIfWaitNeeded failpo... Closed
Backwards Compatibility: Fully Compatible
Sprint: Repl 2019-02-25, Repl 2019-03-11
Participants:

 Description   

This is necessary for prepare transaction testing so that DDL ops do not block forever.



 Comments   
Comment by Githook User [ 05/Mar/19 ]

Author:

{'name': 'Vesselina Ratcheva', 'email': 'vesselina.ratcheva@10gen.com', 'username': 'vessy-mongodb'}

Message: SERVER-39154 Create a fail point to make it so if a LockManager lock cannot be acquired immediately, then the operation fails
Branch: master
https://github.com/mongodb/mongo/commit/6f3c3df4fc0abda76fd97e970ced4a01f0c48007

Comment by Eric Milkie [ 25/Jan/19 ]

If you limit the failpoint to apply to only MODE_S and MODE_X acquisitions for all resource types, that limits the amount of unintentional failures you might see.  I say, try it and see if it works.

Comment by Max Hirschhorn [ 25/Jan/19 ]

but if an index build hangs for 5 seconds waiting on a lock, we'd want that to fail

To clarify one point that may be getting missed here, the index build wouldn't "hang for 5 seconds" in the kind of test the initial sync fuzzer would generate. It would hang forever because there's a single thread of execution on the client and it would have run

  1. insert with txnNumber=5 on the test.mycoll collection
  2. prepareTransaction with txnNumber=5
  3. drop on the test.mycoll2 collection
  4. (unreachable) commitTransaction with txnNumber=5

This failpoint only needs to apply for MODE_S and MODE_X lock acquisitions because MODE_IS and MODE_IX would be compatible with the locks held by a transaction.

Comment by Judah Schvimer [ 25/Jan/19 ]

max.hirschhorn suggested this instead of using maxTimeMS. I believe that was to allow the operation to take as long as it wants for everything but the lock acquisition. If an index build with no lock contention would take 45 seconds, we want that to succeed, but if an index build hangs for 5 seconds waiting on a lock, we'd want that to fail.

Comment by Eric Milkie [ 25/Jan/19 ]

Wouldn't it be easier to add a timeout to all DDL operations (either by modifying the fuzzer, or by editing the timeout on the server when a failpoint was activated)?  I am concerned that you might find lots of things would break if you make all lock acquisitions fail immediately if the resource is contended.

Comment by Judah Schvimer [ 25/Jan/19 ]

Once a transaction is in prepare it no longer has a time limit. Since the fuzzer is single-threaded, if it does a DDL op before committing or aborting the prepared transaction, it will block on the DDL op and never try to commit or abort the transaction.

Comment by Eric Milkie [ 25/Jan/19 ]

Why would DDL ops ever block forever?  I thought all transactions had a time limit to eventually commit or abort.

Generated at Thu Feb 08 04:51:12 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.