[SERVER-26215] Initial sync fails due to false time out if index creation takes longer than X hours Created: 21/Sep/16  Updated: 08/Jan/24  Resolved: 05/May/17

Status: Closed
Project: Core Server
Component/s: Index Maintenance, Networking, Replication
Affects Version/s: 3.2.9
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Matthieu Rigal Assignee: Mira Carey
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-28710 vectorized send() should handle EWOUL... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

Create a collection big enough (several hundred millions of entries), enough indexes to keep the system busy for more than 3-4 hours and just run an initial sync, it will fail.

Participants:

 Description   

Hi guys,

We recently had to redo some sync of our replica sets. We could reproducibly find a way to make the initial sync fail.

We have several collections to be synced. For our largest, the sync permanently failed after the index build of the largest collection. When it tries to get the indexes for the next collection, it fails permanently. We had to use rsync.

Between normal size collections, it was working, but not for this big one. Also a later retry gave the same effect. As you will see from the very narrow timestamps, the error must lay in the network component. The previous call was probably older than the threshold set for this operation, and then instead of resetting the timeout when the process is really making the request, it takes the previous now very old timestamp and thus times out.

Here is a part of the log

## rs0
2016-08-27T18:49:09.004+0000 I -        [rsSync]   Index: (2/3) BTree Bottom Up Progress: 619110400/631007639 98%
2016-08-27T18:49:16.218+0000 I INDEX    [rsSync]         done building bottom layer, going to commit
2016-08-27T18:49:17.341+0000 I INDEX    [rsSync] build index done.  scanned 631007639 total records. 16912 secs
2016-08-27T18:49:17.342+0000 I NETWORK  [rsSync] Socket say send() errno:110 Connection timed out 10.0.0.24:27017
2016-08-27T18:49:17.349+0000 E REPL     [rsSync] 9001 socket exception [SEND_ERROR] server [10.0.0.24:27017]
2016-08-27T18:49:17.349+0000 E REPL     [rsSync] initial sync attempt failed, 9 attempts remaining
2016-08-27T18:49:22.349+0000 I REPL     [rsSync] initial sync pending
2016-08-27T18:49:22.350+0000 I REPL     [ReplicationExecutor] syncing from: MONGO-RS1-1:27017
2016-08-27T18:49:22.421+0000 I REPL     [rsSync] initial sync drop all databases
 
## rs1, same (shorter excerpt
2016-08-27T18:31:09.869+0000 I INDEX    [rsSync]         done building bottom layer, going to commit
2016-08-27T18:31:10.819+0000 I INDEX    [rsSync] build index done.  scanned 612324609 total records. 16342 secs
2016-08-27T18:31:10.819+0000 I NETWORK  [rsSync] Socket say send() errno:110 Connection timed out 10.0.0.43:27017
2016-08-27T18:31:10.830+0000 E REPL     [rsSync] 9001 socket exception [SEND_ERROR] server [10.0.0.43:27017]
2016-08-27T18:31:10.830+0000 E REPL     [rsSync] initial sync attempt failed, 9 attempts remaining
 
## rs2, same (shorter excerpt
2016-08-27T17:18:19.030+0000 I INDEX    [rsSync]         done building bottom layer, going to commit
2016-08-27T17:18:20.245+0000 I INDEX    [rsSync] build index done.  scanned 523205629 total records. 40686 secs
2016-08-27T17:18:20.246+0000 I STORAGE  [rsSync] copying indexes for: { name: "reports", options: {} }
2016-08-27T17:18:20.246+0000 I NETWORK  [rsSync] Socket say send() errno:110 Connection timed out 10.0.0.68:27017
2016-08-27T17:18:20.260+0000 E REPL     [rsSync] 9001 socket exception [SEND_ERROR] server [10.0.0.68:27017]
2016-08-27T17:18:20.260+0000 E REPL     [rsSync] initial sync attempt failed, 9 attempts remaining



 Comments   
Comment by Matthieu Rigal [ 05/May/17 ]

Hi @jason.carey

Thanks for having pursued your investigations! We were a bit astonished that the issue was not tackled at all before, but better late than never.

Regards,
Matthieu

Comment by Mira Carey [ 04/May/17 ]

MRigal,

I believe that the behavior you saw was fixed in SERVER-28710. It performs a fix somewhat inline with my previous analysis, although more pressing was probably a failure to handle EINTR correctly, rather than EWOULDBLOCK.

It is slated for release with 3.2.14. Or, if an upgrade is an option, SERVER-28710 doesn't effect current stable.

Regards,
Jason

Comment by Mira Carey [ 23/Mar/17 ]

After some staring, I think this may actually be a bug in the networking code (specific at this point to 3.2 and earlier).

While I haven't dug into the actual repl side of this, the log lines in question don't make a lot of sense. All of our actual log lines for time outs should say:

Socket say send() remote timeout errno:110 Connection timed out 10.0.0.43:27017

instead of

Socket say send() errno:110 Connection timed out 10.0.0.43:27017

And those functions should throw SocketException::SEND_TIMEOUT, instead of SEND_ERROR.

From what I can tell, the cause is likely the sendmsg path in our Socket type. It's error handling for timeouts is different in the vectorized case than the non-vectorized case. Specifically:

// vectorized path
if (errno != EAGAIN || _timeout == 0)

instead of

// single buffer path
if ((mongo_errno == EAGAIN || mongo_errno == EWOULDBLOCK) && _timeout != 0)

Note that the second case specifically checks for EWOULDBLOCK, which is a valid return from a timeout generated by the SO_SNDTIMEO socket option.

This doesn't matter in 3.4 and later, as that code path is no longer exercised (messages always use the single buffer version).

So in this case, it's possible that most users see EAGAIN during initial sync timeouts for vectorized messages. Or that most users aren't generating vectorized messages during initial syncs (I haven't dug into which is more likely). But in either case, we have confirmation that there is a reasonable code path that involves returning timeouts as socket errors up into the initial sync code. Which I assume fails incorrectly, where a different exception would have been retried.

Comment by Kelsey Schubert [ 23/Sep/16 ]

Hi MRigal,

Thank you for reporting this issue. We're investigating and will update this ticket when we know more.

Kind regards,
Thomas

Comment by Matthieu Rigal [ 21/Sep/16 ]

Additional comment, we retried it also with modifying the TCP keepalive parameters, in both directions, without more success... https://docs.mongodb.com/manual/faq/diagnostics/#does-tcp-keepalive-time-affect-mongodb-deployments

Comment by Matthieu Rigal [ 21/Sep/16 ]

This could (should ?) be done prior to https://jira.mongodb.org/browse/SERVER-8076

Generated at Thu Feb 08 04:11:27 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.