Details
-
Bug
-
Status: Closed
-
Major - P3
-
Resolution: Duplicate
-
3.2.9
-
None
-
None
-
Fully Compatible
-
ALL
-
Description
Hi guys,
We recently had to redo some sync of our replica sets. We could reproducibly find a way to make the initial sync fail.
We have several collections to be synced. For our largest, the sync permanently failed after the index build of the largest collection. When it tries to get the indexes for the next collection, it fails permanently. We had to use rsync.
Between normal size collections, it was working, but not for this big one. Also a later retry gave the same effect. As you will see from the very narrow timestamps, the error must lay in the network component. The previous call was probably older than the threshold set for this operation, and then instead of resetting the timeout when the process is really making the request, it takes the previous now very old timestamp and thus times out.
Here is a part of the log
## rs0
|
2016-08-27T18:49:09.004+0000 I - [rsSync] Index: (2/3) BTree Bottom Up Progress: 619110400/631007639 98%
|
2016-08-27T18:49:16.218+0000 I INDEX [rsSync] done building bottom layer, going to commit
|
2016-08-27T18:49:17.341+0000 I INDEX [rsSync] build index done. scanned 631007639 total records. 16912 secs
|
2016-08-27T18:49:17.342+0000 I NETWORK [rsSync] Socket say send() errno:110 Connection timed out 10.0.0.24:27017
|
2016-08-27T18:49:17.349+0000 E REPL [rsSync] 9001 socket exception [SEND_ERROR] server [10.0.0.24:27017]
|
2016-08-27T18:49:17.349+0000 E REPL [rsSync] initial sync attempt failed, 9 attempts remaining
|
2016-08-27T18:49:22.349+0000 I REPL [rsSync] initial sync pending
|
2016-08-27T18:49:22.350+0000 I REPL [ReplicationExecutor] syncing from: MONGO-RS1-1:27017
|
2016-08-27T18:49:22.421+0000 I REPL [rsSync] initial sync drop all databases
|
|
## rs1, same (shorter excerpt
|
2016-08-27T18:31:09.869+0000 I INDEX [rsSync] done building bottom layer, going to commit
|
2016-08-27T18:31:10.819+0000 I INDEX [rsSync] build index done. scanned 612324609 total records. 16342 secs
|
2016-08-27T18:31:10.819+0000 I NETWORK [rsSync] Socket say send() errno:110 Connection timed out 10.0.0.43:27017
|
2016-08-27T18:31:10.830+0000 E REPL [rsSync] 9001 socket exception [SEND_ERROR] server [10.0.0.43:27017]
|
2016-08-27T18:31:10.830+0000 E REPL [rsSync] initial sync attempt failed, 9 attempts remaining
|
|
## rs2, same (shorter excerpt
|
2016-08-27T17:18:19.030+0000 I INDEX [rsSync] done building bottom layer, going to commit
|
2016-08-27T17:18:20.245+0000 I INDEX [rsSync] build index done. scanned 523205629 total records. 40686 secs
|
2016-08-27T17:18:20.246+0000 I STORAGE [rsSync] copying indexes for: { name: "reports", options: {} }
|
2016-08-27T17:18:20.246+0000 I NETWORK [rsSync] Socket say send() errno:110 Connection timed out 10.0.0.68:27017
|
2016-08-27T17:18:20.260+0000 E REPL [rsSync] 9001 socket exception [SEND_ERROR] server [10.0.0.68:27017]
|
2016-08-27T17:18:20.260+0000 E REPL [rsSync] initial sync attempt failed, 9 attempts remaining
|
Attachments
Issue Links
- duplicates
-
SERVER-28710 vectorized send() should handle EWOULDBLOCK
-
- Closed
-