[SERVER-17043] Replication code does not retry interrupted system call Created: 25/Jan/15  Updated: 18/Sep/15  Resolved: 17/Feb/15

Status: Closed
Project: Core Server
Component/s: Networking, Replication
Affects Version/s: 3.0.0-rc6
Fix Version/s: 3.0.0-rc9, 3.1.0

Type: Bug Priority: Major - P3
Reporter: Bruce Lucas (Inactive) Assignee: Andrew Morrow (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File mongod.log    
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Completed:
Participants:

 Description   

While debugging an issue involving a replica set using gdb got the following error:

2015-01-25T16:10:05.936-0500 W NETWORK  [ReplExecNetThread-0] Failed to connect to 127.0.0.1:27117, reason: errno:4 Interrupted system call
2015-01-25T16:10:05.936-0500 I REPL     [ReplicationExecutor] Error in heartbeat request to localhost:27117; Location18915 Failed attempt to connect to localhost:27117; couldn't connect to server localhost:27117 (127.0.0.1), connection attempt failed
2015-01-25T16:10:05.936-0500 I REPL     [ReplicationExecutor] can't see a majority of the set, relinquishing primary
2015-01-25T16:10:05.936-0500 I REPL     [ReplicationExecutor] Stepping down from primary in response to heartbeat
2015-01-25T16:10:05.936-0500 I REPL     [replCallbackWithGlobalLock-0] transition to SECONDARY



 Comments   
Comment by Githook User [ 19/Feb/15 ]

Author:

{u'username': u'acmorrow', u'name': u'Andrew Morrow', u'email': u'acm@mongodb.com'}

Message: SERVER-17043 Reattempt failed socket connect when errno is EINTR

(cherry picked from commit 490b0b2b14fa14b463ef612b79386d20d95b4057)
Branch: v3.0
https://github.com/mongodb/mongo/commit/77c54935ee0890fb2c00ef84727b0b5ca5a1e9a1

Comment by Githook User [ 17/Feb/15 ]

Author:

{u'username': u'acmorrow', u'name': u'Andrew Morrow', u'email': u'acm@mongodb.com'}

Message: SERVER-17043 Reattempt failed socket connect when errno is EINTR
Branch: master
https://github.com/mongodb/mongo/commit/490b0b2b14fa14b463ef612b79386d20d95b4057

Comment by Bruce Lucas (Inactive) [ 10/Feb/15 ]

The problem is in ConnectBG::run in sock.cpp - the call to ::connect() isn't retried in case of EINTR.

Comment by Bruce Lucas (Inactive) [ 10/Feb/15 ]

Log attached. There's nothing in the log prior to the complaint about the interrupted system call that causes the primary to step down. A couple other system calls report being interrupted and retried.

Comment by Scott Hernandez (Inactive) [ 25/Jan/15 ]

Please include the logs from before here. If this happened after the timeout period then it will not be retried, as it is designed and intended.

Generated at Thu Feb 08 03:43:07 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.