Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-26215

Initial sync fails due to false time out if index creation takes longer than X hours

    • Fully Compatible
    • ALL
    • Hide

      Create a collection big enough (several hundred millions of entries), enough indexes to keep the system busy for more than 3-4 hours and just run an initial sync, it will fail.

      Show
      Create a collection big enough (several hundred millions of entries), enough indexes to keep the system busy for more than 3-4 hours and just run an initial sync, it will fail.

      Hi guys,

      We recently had to redo some sync of our replica sets. We could reproducibly find a way to make the initial sync fail.

      We have several collections to be synced. For our largest, the sync permanently failed after the index build of the largest collection. When it tries to get the indexes for the next collection, it fails permanently. We had to use rsync.

      Between normal size collections, it was working, but not for this big one. Also a later retry gave the same effect. As you will see from the very narrow timestamps, the error must lay in the network component. The previous call was probably older than the threshold set for this operation, and then instead of resetting the timeout when the process is really making the request, it takes the previous now very old timestamp and thus times out.

      Here is a part of the log

      ## rs0
      2016-08-27T18:49:09.004+0000 I -        [rsSync]   Index: (2/3) BTree Bottom Up Progress: 619110400/631007639 98%
      2016-08-27T18:49:16.218+0000 I INDEX    [rsSync]         done building bottom layer, going to commit
      2016-08-27T18:49:17.341+0000 I INDEX    [rsSync] build index done.  scanned 631007639 total records. 16912 secs
      2016-08-27T18:49:17.342+0000 I NETWORK  [rsSync] Socket say send() errno:110 Connection timed out 10.0.0.24:27017
      2016-08-27T18:49:17.349+0000 E REPL     [rsSync] 9001 socket exception [SEND_ERROR] server [10.0.0.24:27017]
      2016-08-27T18:49:17.349+0000 E REPL     [rsSync] initial sync attempt failed, 9 attempts remaining
      2016-08-27T18:49:22.349+0000 I REPL     [rsSync] initial sync pending
      2016-08-27T18:49:22.350+0000 I REPL     [ReplicationExecutor] syncing from: MONGO-RS1-1:27017
      2016-08-27T18:49:22.421+0000 I REPL     [rsSync] initial sync drop all databases
      
      ## rs1, same (shorter excerpt
      2016-08-27T18:31:09.869+0000 I INDEX    [rsSync]         done building bottom layer, going to commit
      2016-08-27T18:31:10.819+0000 I INDEX    [rsSync] build index done.  scanned 612324609 total records. 16342 secs
      2016-08-27T18:31:10.819+0000 I NETWORK  [rsSync] Socket say send() errno:110 Connection timed out 10.0.0.43:27017
      2016-08-27T18:31:10.830+0000 E REPL     [rsSync] 9001 socket exception [SEND_ERROR] server [10.0.0.43:27017]
      2016-08-27T18:31:10.830+0000 E REPL     [rsSync] initial sync attempt failed, 9 attempts remaining
      
      ## rs2, same (shorter excerpt
      2016-08-27T17:18:19.030+0000 I INDEX    [rsSync]         done building bottom layer, going to commit
      2016-08-27T17:18:20.245+0000 I INDEX    [rsSync] build index done.  scanned 523205629 total records. 40686 secs
      2016-08-27T17:18:20.246+0000 I STORAGE  [rsSync] copying indexes for: { name: "reports", options: {} }
      2016-08-27T17:18:20.246+0000 I NETWORK  [rsSync] Socket say send() errno:110 Connection timed out 10.0.0.68:27017
      2016-08-27T17:18:20.260+0000 E REPL     [rsSync] 9001 socket exception [SEND_ERROR] server [10.0.0.68:27017]
      2016-08-27T17:18:20.260+0000 E REPL     [rsSync] initial sync attempt failed, 9 attempts remaining
      

            Assignee:
            mira.carey@mongodb.com Mira Carey
            Reporter:
            MRigal Matthieu Rigal
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

              Created:
              Updated:
              Resolved: