-
Type: Bug
-
Resolution: Duplicate
-
Priority: Major - P3
-
None
-
Affects Version/s: 3.2.9
-
Component/s: Index Maintenance, Networking, Replication
-
None
-
Fully Compatible
-
ALL
-
Hi guys,
We recently had to redo some sync of our replica sets. We could reproducibly find a way to make the initial sync fail.
We have several collections to be synced. For our largest, the sync permanently failed after the index build of the largest collection. When it tries to get the indexes for the next collection, it fails permanently. We had to use rsync.
Between normal size collections, it was working, but not for this big one. Also a later retry gave the same effect. As you will see from the very narrow timestamps, the error must lay in the network component. The previous call was probably older than the threshold set for this operation, and then instead of resetting the timeout when the process is really making the request, it takes the previous now very old timestamp and thus times out.
Here is a part of the log
## rs0 2016-08-27T18:49:09.004+0000 I - [rsSync] Index: (2/3) BTree Bottom Up Progress: 619110400/631007639 98% 2016-08-27T18:49:16.218+0000 I INDEX [rsSync] done building bottom layer, going to commit 2016-08-27T18:49:17.341+0000 I INDEX [rsSync] build index done. scanned 631007639 total records. 16912 secs 2016-08-27T18:49:17.342+0000 I NETWORK [rsSync] Socket say send() errno:110 Connection timed out 10.0.0.24:27017 2016-08-27T18:49:17.349+0000 E REPL [rsSync] 9001 socket exception [SEND_ERROR] server [10.0.0.24:27017] 2016-08-27T18:49:17.349+0000 E REPL [rsSync] initial sync attempt failed, 9 attempts remaining 2016-08-27T18:49:22.349+0000 I REPL [rsSync] initial sync pending 2016-08-27T18:49:22.350+0000 I REPL [ReplicationExecutor] syncing from: MONGO-RS1-1:27017 2016-08-27T18:49:22.421+0000 I REPL [rsSync] initial sync drop all databases ## rs1, same (shorter excerpt 2016-08-27T18:31:09.869+0000 I INDEX [rsSync] done building bottom layer, going to commit 2016-08-27T18:31:10.819+0000 I INDEX [rsSync] build index done. scanned 612324609 total records. 16342 secs 2016-08-27T18:31:10.819+0000 I NETWORK [rsSync] Socket say send() errno:110 Connection timed out 10.0.0.43:27017 2016-08-27T18:31:10.830+0000 E REPL [rsSync] 9001 socket exception [SEND_ERROR] server [10.0.0.43:27017] 2016-08-27T18:31:10.830+0000 E REPL [rsSync] initial sync attempt failed, 9 attempts remaining ## rs2, same (shorter excerpt 2016-08-27T17:18:19.030+0000 I INDEX [rsSync] done building bottom layer, going to commit 2016-08-27T17:18:20.245+0000 I INDEX [rsSync] build index done. scanned 523205629 total records. 40686 secs 2016-08-27T17:18:20.246+0000 I STORAGE [rsSync] copying indexes for: { name: "reports", options: {} } 2016-08-27T17:18:20.246+0000 I NETWORK [rsSync] Socket say send() errno:110 Connection timed out 10.0.0.68:27017 2016-08-27T17:18:20.260+0000 E REPL [rsSync] 9001 socket exception [SEND_ERROR] server [10.0.0.68:27017] 2016-08-27T17:18:20.260+0000 E REPL [rsSync] initial sync attempt failed, 9 attempts remaining
- duplicates
-
SERVER-28710 vectorized send() should handle EWOULDBLOCK
- Closed