[SERVER-48396] SIGUSR2 stack collection closes connections Created: 25/May/20  Updated: 29/Oct/23  Resolved: 30/May/20

Status: Closed
Project: Core Server
Component/s: Diagnostics, Networking
Affects Version/s: 4.4.0-rc6
Fix Version/s: 4.4.0-rc11, 4.7.0

Type: Bug Priority: Major - P3
Reporter: Bruce Lucas (Inactive) Assignee: Billy Donahue
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File db.log     PNG File sigusr2.png    
Issue Links:
Backports
Related
related to SERVER-48395 Extended stalls during heavy insert w... Closed
related to SERVER-33445 Add signal handler to generate stack ... Closed
is related to SERVER-47229 Make TransportSessionASIO cancelation... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.4
Participants:

 Description   

I ran the repro script in SERVER-48395 while sending mongod a SIGUSR2 every 10 seconds to collect stack traces. This resulted in network errors and failed operations at the application

Error: error doing query: failed: network error while attempting to run command 'insert' on host '127.0.0.1:27017'  :

accompanied by messages from the mongo shell like

{"t":{"$date":"2020-05-25T15:07:29.723Z"},"s":"I",  "c":"NETWORK",  "id":20120,   "ctx":"js","msg":"Trying to reconnnect","attr":{"connString":"127.0.0.1:27017 failed"}}
{"t":{"$date":"2020-05-25T15:07:29.724Z"},"s":"I",  "c":"NETWORK",  "id":20125,   "ctx":"js","msg":"DBClientConnection failed to receive message","attr":{"connString":"127.0.0.1:27017","error":"HostUnreachable: Connection closed by peer"}}

The mongod logs tell a similar story:

The blue markers on the timeline show the points at which SIGUSR2 was received. These are accompanied by some number of connections ended (red curve) and a smaller number of connections accepted (blue curve), resulting in a net decrease of connections each time (green curve)

I wonder if we might not be re-trying network operations when they return the EINTR that would result from SIGUSR2.



 Comments   
Comment by Githook User [ 22/Jun/20 ]

Author:

{'name': 'Billy Donahue', 'email': 'billy.donahue@mongodb.com', 'username': 'BillyDonahue'}

Message: SERVER-48396 stop SIGUSR2 from killing connections

Add SA_RESTART to stacktrace_thread handler

This is a partial cherry-pick, we DO NOT modify
asio::read or asio::write calls as the
original did

(cherry picked from commit 8ea7299541bdb315926d7fe17373c0387645fae6)
Branch: v4.4
https://github.com/mongodb/mongo/commit/be486f8f49efcc706a53a072cb5527e1200a49cf

Comment by Billy Donahue [ 01/Jun/20 ]

It isn't about a single call to asio::read() but the space in between the calls. Imagine you experience an interruption and return asio::error::interrupted on the networking thread but go no further. Then another thread attempts to call asio::basic_stream_socket::cancel(). Since you haven't called asio::read() again yet, and there isn't any enforcement of sequence, there is nothing to cancel so we just go back into blocking until another event happens. It can be a long long time until that next event.

 
Who's the "we" that goes back into blocking? I'm thinking the thread that returned interrupted is going to immediately do-loop back into asio::read, at which point asio would see that the "cancellation token" for the read operation has expired, so the read would return immediately without blocking at all, but this time return the operation_aborted code. I mean I haven't looked at the Asio source for this but I would hope that's how it works. Maybe something about expiring the cancellation token doesn't propagate to the read operation in time to stop it from blocking? When you say "there isn't any enforcement of sequence" I have to wonder whether Asio cancellation works at all if that's the case. Is this an Asio synchronization bug or what? What do you mean about lack of sequencing exactly? Interesting issue.

update Oh maybe I get it. Maybe there IS no asio::read operation to cancel anymore. As a caller, you had your chance. The asio::read operation you canceled is over, and it returned interrupted. If you want to call asio::read again (as a retry of the interrupted call, that's on you, and you'll have to cancel the new one if you want to cancel something. Maybe that's the problem.

Comment by Benjamin Caimano (Inactive) [ 01/Jun/20 ]

It isn't about a single call to asio::read() but the space in between the calls. Imagine you experience an interruption and return asio::error::interrupted on the networking thread but go no further. Then another thread attempts to call asio::basic_stream_socket::cancel(). Since you haven't called asio::read() again yet, and there isn't any enforcement of sequence, there is nothing to cancel so we just go back into blocking until another event happens. It can be a long long time until that next event.

Comment by Billy Donahue [ 01/Jun/20 ]

Can you explain the problem you're anticipating? I wrote this fix thinking that cancellation was accounted for in asio's low-level recv/send loops, so the async asio::read/write would return boost::asio::error::operation_aborted instead of boost::asio::error::interrupted.

 

Comment by Benjamin Caimano (Inactive) [ 01/Jun/20 ]

I believe this adds an extra opportunity to miss asio cancelation. We already had this problem from batons, so I'll make sure I fix this spot as well.

Comment by Githook User [ 29/May/20 ]

Author:

{'name': 'Billy Donahue', 'email': 'billy.donahue@mongodb.com', 'username': 'BillyDonahue'}

Message: SERVER-48396 stop SIGUSR2 from killing connections

Add SA_RESTART to stacktrace_thread handler
Restart asio::read and asio::write on EINTR
Branch: master
https://github.com/mongodb/mongo/commit/8ea7299541bdb315926d7fe17373c0387645fae6

Comment by Billy Donahue [ 26/May/20 ]

CR http://mongodbcr.appspot.com/614080001

Comment by Billy Donahue [ 26/May/20 ]

These connection threads are all dying from syscall recvmsg returning an EINTR. We can see that in the attached log.

Now, according to the man page signal(7), whether the SA_RESTART flag will help or not depends on whether the sockets doing these recvmsg calls have a timeout. If the socket has a receive timeout then SA_RESTART, doesn't matter, and they'll still get the EINTR. We might look into the code and prove to ourselves that SA_RESTART will help because perhaps they all DO have a timeout set, but that's not a durable fix. We might need to get into our Asio calls and actually deal with the EINTR in an upfront way.

From man signal(7):

If a blocked call to one of the following interfaces is interrupted by a signal handler, then the call is automatically restarted after the signal handler returns if the SA_RESTART flag was used; otherwise the call fails with the error EINTR:
...
       * Socket interfaces: accept(2), connect(2), recv(2), recvfrom(2), recvmmsg(2), recvmsg(2), send(2), sendto(2), and sendmsg(2), unless a timeout has been set on the socket (see below).

But then...

The following interfaces are never restarted after being interrupted by a signal handler, regardless of the use of SA_RESTART; they always fail with the error EINTR when interrupted by a signal handler:
...
       * "Input" socket interfaces, when a timeout (SO_RCVTIMEO) has been set on the socket using setsockopt(2): accept(2), recv(2), recvfrom(2), recvmmsg(2) (also with a non-NULL timeout argument), and recvmsg(2).

So I'm going to look into src/mongo/transport/session_asio.h and see if I can add some EINTR resilience.

I see a socket_ops::send also in the stacktrace. Writes might need the same treatment.

If they aren't fixed, these calls will also fail if the process is stopped by anything, like a ^Z or a debugger.

Comment by Billy Donahue [ 26/May/20 ]

Bruce the SA_RESTART idea sounds right.

EVERY thread in the process will get a SIGUSR2 during multithread stacktrace collection.

It is a little troubling to me that the EINTR caused by our SIGUSR2 isn't handled gracefully though. But this SIGUSR2 isn't intended to cancel or interrupt any syscalls, and it's purely passive instrumentation, so the SA_RESTART is appropriate here.

Comment by Bruce Lucas (Inactive) [ 26/May/20 ]

I wonder if adding SA_RESTART to sa_flags in the sigaction call that establishes the SIGUSR2 handler would address this.

Generated at Thu Feb 08 05:17:00 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.