Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-24948

initial sync failed because listDatabases exceeded 30s socket (read) timeout

    • Type: Icon: Bug Bug
    • Resolution: Duplicate
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 3.2.3
    • Component/s: Replication
    • Labels:
      None
    • Replication
    • ALL
    • Hide

      create many collections and indexes to make 'listDatabases' cost more than 30s, then do initial sync.

      Show
      create many collections and indexes to make 'listDatabases' cost more than 30s, then do initial sync.

      We encounter a user case with 60,000+ collections(300,000+ files including index) using wiredtiger engine, listDatabases will cost more than 30s in this case because it need to traverse all the wt file to get the size stat.

      Secondary set socket timeout to 30s during sync process, so it failed to run listDatabases command in this case. (OplogReader::kSocketTimeout(30)

      2016-06-27T20:59:46.494+0800 I REPL     [rsSync] ******
      2016-06-27T20:59:46.495+0800 I REPL     [rsSync] initial sync pending
      2016-06-27T20:59:46.499+0800 I REPL     [rsSync] no valid sync sources found in current replset to do an initial sync
      2016-06-27T20:59:47.499+0800 I REPL     [rsSync] initial sync pending
      2016-06-27T20:59:47.517+0800 I REPL     [rsSync] initial sync drop all databases
      2016-06-27T20:59:47.517+0800 I STORAGE  [rsSync] dropAllDatabasesExceptLocal 1
      2016-06-27T20:59:47.517+0800 I REPL     [rsSync] initial sync clone all databases
      2016-06-27T21:00:17.517+0800 I NETWORK  [rsSync] Socket recv() timeout  10.182.4.106:27017
      2016-06-27T21:00:17.517+0800 I NETWORK  [rsSync] SocketException: remote: (NONE):0 error: 9001 socket exception [RECV_TIMEOUT] server [10.182.4.106:27017] 
      2016-06-27T21:00:17.519+0800 E REPL     [rsSync] 6 network error while attempting to run command 'listDatabases' on host '10.182.4.106:27017' 
      2016-06-27T21:00:17.519+0800 E REPL     [rsSync] initial sync attempt failed, 9 attempts remaining
      2016-06-27T21:00:22.519+0800 I REPL     [rsSync] initial sync pending
      2016-06-27T21:00:40.435+0800 I REPL     [rsSync] initial sync drop all databases
      2016-06-27T21:00:40.435+0800 I STORAGE  [rsSync] dropAllDatabasesExceptLocal 1
      2016-06-27T21:00:40.435+0800 I REPL     [rsSync] initial sync clone all databases
      2016-06-27T21:01:10.436+0800 I NETWORK  [rsSync] Socket recv() timeout  10.182.4.106:27017
      

      During initial sync, the secondary only need to get the db names, it will not care the db size information, so we can add an option when listDatabases to tell the server "only db names are needed", this will decrease the listDatabases cost a lot.

      db.runCommand({listDatabases: 1, nameOnly: true})
      

            Assignee:
            backlog-server-repl [DO NOT USE] Backlog - Replication Team
            Reporter:
            zyd_com@126.com Zhang Youdong
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

              Created:
              Updated:
              Resolved: