[SERVER-79123] Tests enabling skipWriteConflictRetries failpoint incompatible with index build abort Created: 19/Jul/23  Updated: 29/Oct/23  Resolved: 29/Aug/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 7.2.0-rc0

Type: Bug Priority: Major - P3
Reporter: Yujin Kang Park Assignee: Benety Goh
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
is related to SERVER-49396 Only activate skipWriteConflictRetrie... Closed
is related to SERVER-40062 Add failpoint to skip doing retries o... Closed
is related to SERVER-73292 Add internal voteAbortIndexBuild command Closed
is related to SERVER-40103 Improve retry behavior of non-txn wri... Closed
is related to SERVER-40105 Improve diagnostic information in cur... Closed
Assigned Teams:
Storage Execution NAMER
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Execution NAMR Team 2023-09-04
Participants:
Linked BF Score: 5

 Description   

Index builds handle WriteConflicts while aborting, and it is not expected that the operation fails as it means the index build ended up partially teared down, triggering this fassert.

The initial sync fuzzer, and potentially other tests, enable the skipWriteConflictRetries and can trigger the fassert.



 Comments   
Comment by Githook User [ 29/Aug/23 ]

Author:

{'name': 'Benety Goh', 'email': 'benety@mongodb.com', 'username': 'benety'}

Message: SERVER-79123 add warning message during index build abort for write conflict in test environment
Branch: master
https://github.com/mongodb/mongo/commit/3aeaec970711c7f9483d881bb867d5c66e01db92

Comment by Max Hirschhorn [ 25/Aug/23 ]

benety.goh@mongodb.com, I'm not sure I understand. Won't the change you're proposing mean BF-28966 cannot be resolved because the fassert() will still occur?

Comment by Benety Goh [ 25/Aug/23 ]

max.hirschhorn@mongodb.com, in this case the index build abort, initiated through the voteAbortIndexBuild command, is considered a user operation in the context of SERVER-49396. It would have been fine to relay the WriteConflict to the caller if not for this fassert in the index abort callstack.

Since the skipWriteConflictRetries mechanism is critical to our rollback and initial sync fuzzer framework and that the issue in this ticket does not affect production deployments, yujin.kang@mongodb.com and I are considering adding a warning message before the fatal assertion to assist developers in diagnosing build failures with this exact failure.

Comment by Max Hirschhorn [ 22/Aug/23 ]

benety.goh@mongodb.com, it isn't important that the exception propagates as a WriteConflict back to the client. What is important is for the initial sync fuzzer and rollback fuzzer to receive EWOULDBLOCK-like errors when running their operations and avoid ending up in a test-induced deadlock.

I didn't fully understand the explanation for how index builds are impacted by the enablement of the skipWriteConflictRetries failpoint. Would it be possible to extend the work done under SERVER-49396 to restrict the activation of the skipWriteConflictRetries failpoint to only user operations and skip the index build voting-related commands? If changing the server behavior this way is undesirable then perhaps the skipWriteConflictRetries failpoint can be made specific to a particular Client similar to the checkForInterruptFail failpoint? The rollback fuzzer will re-establish new connections and so maybe conditioning on appName like the failCommand failpoint would be a more suitable alternative if we're not considering changing the index build voting-related commands.

Comment by Benety Goh [ 22/Aug/23 ]

max.hirschhorn@mongodb.com, do you think this fail point still makes sense to keep around today? It seems to trigger behavior that doesn't seem in line with how the server deals with WriteConflictExceptions internally, in that we generally don't expect to bubble up WriteConflictExceptions to the user. CC:geert.bosch@mongodb.com, josef.ahmad@mongodb.com

Comment by Benety Goh [ 22/Aug/23 ]

This fail point was added in SERVER-40062. From this comment on SERVER-40062, it looks like the fail point was introduced to work around SERVER-40103 and SERVER-40105, which has since been resolved.

Generated at Thu Feb 08 06:40:08 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.