[SERVER-46510] Concurrent drop-pending notifications can race with subsequent transactions Created: 28/Feb/20  Updated: 12/Mar/20  Resolved: 12/Mar/20

Status: Closed
Project: Core Server
Component/s: Catalog, Replication, Testing Infrastructure
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Judah Schvimer Assignee: Eric Milkie
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File bf-10827.diff    
Issue Links:
Depends
Related
is related to SERVER-39082 Retry on TransientTransaction errors ... Closed
Operating System: ALL
Steps To Reproduce:

Repro attached, applied on commit 7f9424413d86afeae33e3563c5a33fcb95317e2e on the 4.0 branch.

Participants:
Linked BF Score: 0

 Description   

Replication two-phase drop (4.0 and eMRC=F) finishes dropping collections when the drop oplog entry gets majority committed. drop commands with w:majority wait until their collections finish dropping to return to the user. Tests expect that after executing a drop with w:majority they can safely move onto the next operation, including transactions which have a 5ms lock timeout. If the majority commit point gets advanced multiple times concurrently, multiple notifications can schedule tasks to complete the same two phase drop. The first scheduled task will succeed and will let the w:majority drop return to the user. At this point the user may start a transaction. The other drop-pending notifications that are still scheduled will eventually be run and will acquire a database lock to complete the collection drop. This can cause the transaction to get a lock timeout. It's a TransientTransactionError that could be easily retried, but our tests do not expect to get a TransientTransactionError everywhere this is possible.

Sequence of events

  1. Thread A gets a drop command and waits on write concern majority
  2. Thread B gets notified to drop the collection
  3. Thread B drops the collection
  4. Thread C gets notified to drop the collection
  5. Thread A returns to the user
  6. Thread C acquires an X lock on the 'test' database
  7. Thread A starts a transaction and conflicts with thread C's lock.


 Comments   
Comment by Eric Milkie [ 12/Mar/20 ]

Thanks for the analysis Judah – after some deliberation we've decided not to do any further work on this, as it only affects our testing.

Generated at Thu Feb 08 05:11:42 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.