[SERVER-48396] SIGUSR2 stack collection closes connections Created: 25/May/20 Updated: 29/Oct/23 Resolved: 30/May/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Diagnostics, Networking |
| Affects Version/s: | 4.4.0-rc6 |
| Fix Version/s: | 4.4.0-rc11, 4.7.0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Bruce Lucas (Inactive) | Assignee: | Billy Donahue |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||
| Backport Requested: |
v4.4
|
||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Description |
|
I ran the repro script in
accompanied by messages from the mongo shell like
The mongod logs tell a similar story:
The blue markers on the timeline show the points at which SIGUSR2 was received. These are accompanied by some number of connections ended (red curve) and a smaller number of connections accepted (blue curve), resulting in a net decrease of connections each time (green curve) I wonder if we might not be re-trying network operations when they return the EINTR that would result from SIGUSR2. |
| Comments |
| Comment by Githook User [ 22/Jun/20 ] | |||||||
|
Author: {'name': 'Billy Donahue', 'email': 'billy.donahue@mongodb.com', 'username': 'BillyDonahue'}Message: Add SA_RESTART to stacktrace_thread handler This is a partial cherry-pick, we DO NOT modify (cherry picked from commit 8ea7299541bdb315926d7fe17373c0387645fae6) | |||||||
| Comment by Billy Donahue [ 01/Jun/20 ] | |||||||
update Oh maybe I get it. Maybe there IS no asio::read operation to cancel anymore. As a caller, you had your chance. The asio::read operation you canceled is over, and it returned interrupted. If you want to call asio::read again (as a retry of the interrupted call, that's on you, and you'll have to cancel the new one if you want to cancel something. Maybe that's the problem. | |||||||
| Comment by Benjamin Caimano (Inactive) [ 01/Jun/20 ] | |||||||
|
It isn't about a single call to asio::read() but the space in between the calls. Imagine you experience an interruption and return asio::error::interrupted on the networking thread but go no further. Then another thread attempts to call asio::basic_stream_socket::cancel(). Since you haven't called asio::read() again yet, and there isn't any enforcement of sequence, there is nothing to cancel so we just go back into blocking until another event happens. It can be a long long time until that next event. | |||||||
| Comment by Billy Donahue [ 01/Jun/20 ] | |||||||
|
Can you explain the problem you're anticipating? I wrote this fix thinking that cancellation was accounted for in asio's low-level recv/send loops, so the async asio::read/write would return boost::asio::error::operation_aborted instead of boost::asio::error::interrupted.
| |||||||
| Comment by Benjamin Caimano (Inactive) [ 01/Jun/20 ] | |||||||
|
I believe this adds an extra opportunity to miss asio cancelation. We already had this problem from batons, so I'll make sure I fix this spot as well. | |||||||
| Comment by Githook User [ 29/May/20 ] | |||||||
|
Author: {'name': 'Billy Donahue', 'email': 'billy.donahue@mongodb.com', 'username': 'BillyDonahue'}Message: Add SA_RESTART to stacktrace_thread handler | |||||||
| Comment by Billy Donahue [ 26/May/20 ] | |||||||
| Comment by Billy Donahue [ 26/May/20 ] | |||||||
|
These connection threads are all dying from syscall recvmsg returning an EINTR. We can see that in the attached log. Now, according to the man page signal(7), whether the SA_RESTART flag will help or not depends on whether the sockets doing these recvmsg calls have a timeout. If the socket has a receive timeout then SA_RESTART, doesn't matter, and they'll still get the EINTR. We might look into the code and prove to ourselves that SA_RESTART will help because perhaps they all DO have a timeout set, but that's not a durable fix. We might need to get into our Asio calls and actually deal with the EINTR in an upfront way. From man signal(7):
But then...
So I'm going to look into src/mongo/transport/session_asio.h and see if I can add some EINTR resilience. I see a socket_ops::send also in the stacktrace. Writes might need the same treatment. If they aren't fixed, these calls will also fail if the process is stopped by anything, like a ^Z or a debugger. | |||||||
| Comment by Billy Donahue [ 26/May/20 ] | |||||||
|
Bruce the SA_RESTART idea sounds right. EVERY thread in the process will get a SIGUSR2 during multithread stacktrace collection. It is a little troubling to me that the EINTR caused by our SIGUSR2 isn't handled gracefully though. But this SIGUSR2 isn't intended to cancel or interrupt any syscalls, and it's purely passive instrumentation, so the SA_RESTART is appropriate here. | |||||||
| Comment by Bruce Lucas (Inactive) [ 26/May/20 ] | |||||||
|
I wonder if adding SA_RESTART to sa_flags in the sigaction call that establishes the SIGUSR2 handler would address this. |