[GODRIVER-2950] Possible hang in call to session.AbortTransaction Created: 21/Aug/23 Updated: 27/Oct/23 Resolved: 27/Sep/23 |
|
| Status: | Closed |
| Project: | Go Driver |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Dave Rolsky | Assignee: | Preston Vasquez |
| Resolution: | Gone away | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||
| Description |
SummaryIn on our tests for mongosync, we saw an odd hang. From my investigation it seems like a call to session.WithTransaction ran the provided callback func. That func ran and returned an error. But then execution of our code never continued after that point. Looking at the driver code, the only thing I can see that might hang is the call to session.AbortTransaction that is done when the callback func returns an error. The Mongosync ticket for the failure is https://jira.mongodb.org/browse/REP-3058. Please provide the version of the driver. If applicable, please provide the MongoDB server version and topology (standalone, replica set, or sharded cluster).We are using v1.11.9 of the driver. Here's the log line from the Server startup:
How to ReproduceUnfortunately, we don't have an easy recipe to reproduce this. |
| Comments |
| Comment by PM Bot [ 27/Sep/23 ] | |||||||||||||||||
|
There hasn't been any recent activity on this ticket, so we're resolving it. Thanks for reaching out! Please feel free to reopen this ticket if you're still experiencing the issue, and add a comment if you're able to provide more information. | |||||||||||||||||
| Comment by PM Bot [ 19/Sep/23 ] | |||||||||||||||||
|
Hi dave.rolsky@mongodb.com! If this is still an issue for you, please open Jira to review the latest status and provide your feedback. Thanks! | |||||||||||||||||
| Comment by Preston Vasquez [ 11/Sep/23 ] | |||||||||||||||||
|
Waiting to receive logs for recurrence of issues after mongosync PR1934 is merged | |||||||||||||||||
| Comment by Dave Rolsky [ 01/Sep/23 ] | |||||||||||||||||
|
Hi Preston, I added you as a PR reviewer for a PR that adds logging to Mongosync relevant to this case. Can you take a look at the PR and see if it logs everything you'd want to see? | |||||||||||||||||
| Comment by Preston Vasquez [ 25/Aug/23 ] | |||||||||||||||||
|
rohan.sharan@mongodb.com The goals are to determine if (1) Go Driver sent an abort transaction wire message to the server, The following would be a good start:
I don't expect sessions to abort often, so those events shouldn't cause log saturation. I'm not sure about ending sessions. | |||||||||||||||||
| Comment by Rohan Sharan [ 25/Aug/23 ] | |||||||||||||||||
|
Potentially, what filters would we add? If you tell us what you're looking for, and how to add the filters, we should be able to. I am hesitant to log all CommandSuccceededEvent but if you tell us what we need to add, that'll be fine. You can see here: https://github.com/10gen/mongosync/blob/main/internal/db/db.go#L123 one instance of me avoiding logging failed commands in one expected case. | |||||||||||||||||
| Comment by Preston Vasquez [ 25/Aug/23 ] | |||||||||||||||||
|
rohan.sharan@mongodb.com Can you add filters for this specific use case? | |||||||||||||||||
| Comment by Rohan Sharan [ 25/Aug/23 ] | |||||||||||||||||
|
preston.vasquez@mongodb.com I thought we mentioned that adding that message will 100x our logs? Isn't that still the case? | |||||||||||||||||
| Comment by Dave Rolsky [ 25/Aug/23 ] | |||||||||||||||||
|
Yes, that I can do. | |||||||||||||||||
| Comment by Preston Vasquez [ 25/Aug/23 ] | |||||||||||||||||
|
dave.rolsky@mongodb.com , since the server receives the abort transaction for the session, it is unclear to us where the Go Driver could be blocking. If we can't roll back the version, then can the mongosync team extend command monitor logging here to include CommandSuccceededEvent. This will definitively tell us that the Go Driver sent the abort transaction wire message and narrow the scope of where we could be blocking. | |||||||||||||||||
| Comment by Rohan Sharan [ 25/Aug/23 ] | |||||||||||||||||
|
We can't really downgrade, agreed. | |||||||||||||||||
| Comment by Dave Rolsky [ 25/Aug/23 ] | |||||||||||||||||
|
preston.vasquez@mongodb.com We upgraded to 1.19 recently because it fixed a bug that caused customer-reported issues. If we downgrade I'm concerned that would re-introduce old failures for us in CI. rohan.sharan@mongodb.com You worked on this in REP-3012. What do you think? | |||||||||||||||||
| Comment by Preston Vasquez [ 24/Aug/23 ] | |||||||||||||||||
|
rohan.sharan@mongodb.com dave.rolsky@mongodb.com Can the mongosync team roll back the Go Driver version to 1.11.7 to see if this problem persists? | |||||||||||||||||
| Comment by Preston Vasquez [ 23/Aug/23 ] | |||||||||||||||||
|
Notes from discussing offline with rohan.sharan@mongodb.com : The lsid for the transaction is associated with an abort transaction wire message, here: https://parsley.mongodb.com/resmoke/5431f0e05b713e21e0c30d370e9a3029/all?bookmarks=0,64514,66919,66952,66955,67002,67963,67964,67966,68480,70008,70016,78861&filterLogic=or&filters=000transient&shareLine=68479
It also doesn't appear that the transaction error is transient / that the operation should retry. Only two things happen in this session: 1. A command WM is sent to the server, lasting 300,000 ms (5 minutes) Other things to note: 1. mongosync uses a transactionLifetimeLimitSeconds of 30 seconds
| |||||||||||||||||
| Comment by Dave Rolsky [ 23/Aug/23 ] | |||||||||||||||||
|
Yes, I'm talking about WithTransaction. I know it's returning an error because it logs the "Giving up on batch CRUD event application because of an error from a transaction." message. If we see that we know it's about to return an error. But maybe I'm not understanding your question? If it'd help to do this over Zoom with screen sharing I'd be happy to do that. | |||||||||||||||||
| Comment by Preston Vasquez [ 23/Aug/23 ] | |||||||||||||||||
|
dave.rolsky@mongodb.com rohan.sharan@mongodb.com For this:
Are you referring to WithTransaction? How do you know it is returning an error? IIUC the error check never get's executed: https://github.com/10gen/mongosync/blob/main/internal/mongosync/cea/crud_event_applier.go#L292 , i.e. "Immediately after the shared line, we should see the following log". | |||||||||||||||||
| Comment by Rohan Sharan [ 23/Aug/23 ] | |||||||||||||||||
|
Here is an example patch where it hangs: https://parsley.mongodb.com/resmoke/5431f0e05b713e21e0c30d370e9a3029/all?bookmarks=0,64514,66919,66952,66955,67963,67964,67966,70008,78861&lower=67966&shareLine=70008
Immediately after the shared line, we should see the following log: https://github.com/10gen/mongosync/blob/main/internal/mongosync/cea/crud_event_applier.go#L296 but we don't. This is happening pretty frequently, so we'd appreciate this getting picked up soon |