[SERVER-55669] [SBE][replica_sets] Hang because rollback id won't increment Created: 31/Mar/21  Updated: 15/Apr/21  Resolved: 15/Apr/21

Status: Closed
Project: Core Server
Component/s: Query Execution, Replication
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Kyle Suarez Assignee: Justin Seyster
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File recover_prepared_transaction_state.txt     Text File rollback_prepare_transaction.txt    
Issue Links:
Duplicate
duplicates SERVER-55676 [SBE][replica_sets] Tests trigger fat... Closed
Related
related to SERVER-55009 Enable replica_sets suites against SB... Closed
Sprint: Query Execution 2021-04-19
Participants:

 Description   

For both rollback_prepare_transaction.js and recover_prepared_transaction_state.js, the RollbackTest is stuck in a state where the rollback id doesn't get past 1:

[js_test:recover_prepared_transaction_state] assert.soonNoExcept caught exception, exception: Error: assert failed : Expected RBID to increment past 1 on ip-10-122-74-79.ec2.internal:21020

There's some hang analyzer analysis in the attached logs.



 Comments   
Comment by Justin Seyster [ 15/Apr/21 ]

I verified locally that both rollback_prepare_transaction.js and recover_prepared_transaction_state.js pass now that SERVER-55676 has been pushed.

Doing some more digging, I see that the exception that TransactionHistoryIterator was throwing does eventually get caught and logged here:

https://github.com/mongodb/mongo/blob/513af98cb33e736d885a2a1b159685c335492cf0/src/mongo/db/repl/bgsync.cpp#L218-L223

It would be nicer to log it in the same place that other rollback errors are logged, but I don't have an immediate plan as to how to go about that.

Comment by Kyle Suarez [ 12/Apr/21 ]

If switching from throwing to returning to a Status would improve the error message, I think that's definitely worth doing, as the failure is rather obscure from the logs. But if there isn't an easy way to improve this failure mode's readability, then I'd be fine with closing this as a duplicate of SERVER-55676.

Comment by Justin Seyster [ 12/Apr/21 ]

This failure is caused by the NotPrimaryOrSecondary exception described in SERVER-55676. Rollback fails because RollbackImpl::_findCommonPoint fails with this exception.

https://github.com/mongodb/mongo/blob/4741ef2d5ca6a0809c2569cc55142bb1f7ed7547/src/mongo/db/repl/rollback_impl.cpp#L210

 

It looks like that line of code is expecting _findCommonPoint() to return an error Status rather than throw an exception. To help diagnose problems of this nature in the future, perhaps we should investigate how to make _findCommonPoint() report all its errors the same way (either with a Status or an Exception).

Generated at Thu Feb 08 05:37:07 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.