Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-18721

Initial-sync can exhaust available ports with rapid short-lived connections

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Needs Scheduling
    • Priority: Major - P3
    • Resolution: Unresolved
    • Affects Version/s: 3.0.3
    • Fix Version/s: None
    • Component/s: Replication
    • Labels:
    • Environment:
      Linux 3.13.0-53-generic #89~precise1-Ubuntu SMP Wed May 20 17:42:16 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

      32-core / 128GB RAM, all SSD, monster servers.
    • Operating System:
      Linux
    • Steps To Reproduce:
      Hide

      1. Have large collection that will take an hour to sync
      2. Have a script create, update, delete objects in another collection at a high rate (10/second I'd guess)
      3. Add a replica to the set
      4. Once indices are built, watch connection for spam on primary and watch for TIME_WAIT sockets on the replica; if you can run fast enough you can exhaust the available ports.

      Our workaround:

      sudo vi /etc/sysctl.conf
      net.ipv4.tcp_tw_reuse = 1
      net.ipv4.ip_local_port_range = 10000 61000
      sudo sysctl -p

      Show
      1. Have large collection that will take an hour to sync 2. Have a script create, update, delete objects in another collection at a high rate (10/second I'd guess) 3. Add a replica to the set 4. Once indices are built, watch connection for spam on primary and watch for TIME_WAIT sockets on the replica; if you can run fast enough you can exhaust the available ports. Our workaround: sudo vi /etc/sysctl.conf net.ipv4.tcp_tw_reuse = 1 net.ipv4.ip_local_port_range = 10000 61000 sudo sysctl -p
    • Case:

      Description

      Similar to SERVER-17140 but much more insidious; this issue prevented a downed replica from making an initial sync 2x in a row; we had to resort to tcp_tw_reuse to get the replica back online.

      Scenario: replica set with, among other things, a large collection of long-term documents (10 million objects / 190GB of data) and a small collection of highly transient (1000 objects / 5MB of data).

      During initial sync it takes about an hour for the bulk of the data to be replicated and the indices to be built for the big collection – during this time a large number of the documents in the small collection are updated and deleted. Once the sync is completed the log exploded into a fire-hose of:

      I REPL     [repl writer worker 12] replication failed to apply update: { ts: Timestamp 1432762108000|244, h: -8874966953786363575, v: 2, op: "u", ns: "lotus_tem
      I REPL     [repl writer worker 4] replication failed to apply update: { ts: Timestamp 1432762108000|232, h: -2423412449985218813, v: 2, op: "u", ns: "lotus_temp
      I REPL     [repl writer worker 10] replication failed to apply update: { ts: Timestamp 1432762108000|259, h: 6507151402806398582, v: 2, op: "u", ns: "lotus_temp
      I REPL     [repl writer worker 13] replication failed to apply update: { ts: Timestamp 1432762108000|251, h: 8927018345590464930, v: 2, op: "u", ns: "lotus_temp
      I REPL     [repl writer worker 8] replication failed to apply update: { ts: Timestamp 1432762108000|242, h: 7518410875297535456, v: 2, op: "u", ns: "lotus_temp.
      I REPL     [repl writer worker 12] replication info adding missing object
      I REPL     [repl writer worker 12] replication missing object not found on source. presumably deleted later in oplog
      I REPL     [repl writer worker 12] replication o2: { _id: ObjectId('5566368006c56f080dfc5033') }
      I REPL     [repl writer worker 12] replication o firstfield: $set
      I REPL     [repl writer worker 4] replication info adding missing object
      I REPL     [repl writer worker 4] replication missing object not found on source. presumably deleted later in oplog
      I REPL     [repl writer worker 4] replication o2: { _id: ObjectId('556636ec06c56f330efc4fc1') }
      I REPL     [repl writer worker 4] replication o firstfield: $set
      I REPL     [repl writer worker 10] replication info adding missing object
      I REPL     [repl writer worker 10] replication missing object not found on source. presumably deleted later in oplog
      I REPL     [repl writer worker 10] replication o2: { _id: ObjectId('556636ca07c56f642fa8f0a5') }
      I REPL     [repl writer worker 10] replication o firstfield: $set
      I REPL     [repl writer worker 13] replication info adding missing object
      I REPL     [repl writer worker 13] replication missing object not found on source. presumably deleted later in oplog
      I REPL     [repl writer worker 13] replication o2: { _id: ObjectId('5566366c06c56f340dfc4fdc') }
      I REPL     [repl writer worker 13] replication o firstfield: $set
      I REPL     [repl writer worker 8] replication info adding missing object
      I REPL     [repl writer worker 8] replication missing object not found on source. presumably deleted later in oplog
      I REPL     [repl writer worker 8] replication o2: { _id: ObjectId('556634a508c56ffa267b23cd') }
      I REPL     [repl writer worker 8] replication o firstfield: $set
      

      This goes on for about 140,000 lines of log in about a minute – during which time it seems to be making a whole lot of short-lived connections – we can see this happening in the log for the server it’s syncing from:

      2015-05-27T19:18:43.379-0400 I NETWORK  [conn100064] end connection x.x.x.x:35686 (1114 connections now open)
      2015-05-27T19:18:43.379-0400 I NETWORK  [initandlisten] connection accepted from x.x.x.x:35687 #100065 (1115 connections now open)
      2015-05-27T19:18:43.380-0400 I NETWORK  [conn100065] end connection x.x.x.x:35687 (1114 connections now open)
      2015-05-27T19:18:43.380-0400 I NETWORK  [initandlisten] connection accepted from x.x.x.x:35688 #100066 (1115 connections now open)
      2015-05-27T19:18:43.380-0400 I NETWORK  [conn100066] end connection x.x.x.x:35688 (1114 connections now open)
      2015-05-27T19:18:43.381-0400 I NETWORK  [initandlisten] connection accepted from x.x.x.x:35689 #100067 (1115 connections now open)
      2015-05-27T19:18:43.381-0400 I NETWORK  [conn100067] end connection x.x.x.x:35689 (1114 connections now open)
      2015-05-27T19:18:43.382-0400 I NETWORK  [initandlisten] connection accepted from x.x.x.x:35690 #100068 (1115 connections now open)
      2015-05-27T19:18:43.382-0400 I NETWORK  [conn100068] end connection x.x.x.x:35690 (1114 connections now open)
      2015-05-27T19:18:43.383-0400 I NETWORK  [initandlisten] connection accepted from x.x.x.x:35691 #100069 (1115 connections now open)
      2015-05-27T19:18:43.383-0400 I NETWORK  [conn100069] end connection x.x.x.x:35691 (1114 connections now open)
      2015-05-27T19:18:43.384-0400 I NETWORK  [initandlisten] connection accepted from x.x.x.x:35692 #100070 (1115 connections now open)
      2015-05-27T19:18:43.384-0400 I NETWORK  [conn100070] end connection x.x.x.x:35692 (1114 connections now open)
      

      Note the freakishly short lifetimes! This of course isn’t healthy for the primary but it gets worse than that; on the new replica, while this firehose was blasting the primary, I ran a quick check of the TIME_WAIT reservations:

      # netstat -an | grep TIME_WAIT | wc -l
      25246
      

      This means we’re exhausting network ports at a furious rate and of course we end up choking:

      I REPL     [repl writer worker 2] replication info adding missing object
      W NETWORK  [repl writer worker 2] Failed to connect to x.x.x.x:27017, reason: errno:99 Cannot assign requested address
      I REPL     [repl writer worker 2] repl: couldn't connect to server war-mongo2:27017 (x.x.x.x), connection attempt failed
      W REPL     [repl writer worker 2] network problem detected while connecting to the sync source, attempt 1 of 3
      W NETWORK  [repl writer worker 2] Failed to connect to x.x.x.x:27017, reason: errno:99 Cannot assign requested address
      I REPL     [repl writer worker 2] repl: couldn't connect to server war-mongo2:27017 (x.x.x.x), connection attempt failed
      W REPL     [repl writer worker 2] network problem detected while connecting to the sync source, attempt 2 of 3
      W NETWORK  [repl writer worker 2] Failed to connect to x.x.x.x:27017, reason: errno:99 Cannot assign requested address
      I REPL     [repl writer worker 2] repl: couldn't connect to server war-mongo2:27017 (x.x.x.x), connection attempt failed
      W REPL     [repl writer worker 2] network problem detected while connecting to the sync source, attempt 3 of 3
      I -        [repl writer worker 2] Assertion: 15916:Can no longer connect to initial sync source: war-mongo2:27017
      I CONTROL  [repl writer worker 2]
      0xf6d639 0xf0d371 0xef25cf 0xef267c 0xcaced2 0xcac176 0xcb2b5c 0xf0418b 0xfbab94 0x7f0eb110ce9a 0x7f0eafbe338d
      ----- BEGIN BACKTRACE -----
      {"backtrace":[{"b":"400000","o":"B6D639"},{"b":"400000","o":"B0D371"},{"b":"400000","o":"AF25CF"},{"b":"400000","o":"AF267C"},{"b":"400000","o":"8ACED2"},{"b":"400000","o":"8AC176"},{"b":"4
       mongod(_ZN5mongo15printStackTraceERSo+0x29) [0xf6d639]
       mongod(_ZN5mongo10logContextEPKc+0xE1) [0xf0d371]
       mongod(_ZN5mongo11msgassertedEiPKc+0xAF) [0xef25cf]
       mongod(+0xAF267C) [0xef267c]
       mongod(_ZN5mongo4repl4Sync13getMissingDocEPNS_16OperationContextEPNS_8DatabaseERKNS_7BSONObjE+0x622) [0xcaced2]
       mongod(_ZN5mongo4repl4Sync11shouldRetryEPNS_16OperationContextERKNS_7BSONObjE+0x196) [0xcac176]
       mongod(_ZN5mongo4repl21multiInitialSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x8C) [0xcb2b5c]
       mongod(_ZN5mongo10threadpool6Worker4loopERKSs+0x2FB) [0xf0418b]
       mongod(+0xBBAB94) [0xfbab94]
       libpthread.so.0(+0x7E9A) [0x7f0eb110ce9a]
       libc.so.6(clone+0x6D) [0x7f0eafbe338d]
      -----  END BACKTRACE  -----
      2015-05-27T19:18:56.663-0400 E REPL     [repl writer worker 2] writer worker caught exception:  :: caused by :: 15916 Can no longer connect to initial sync source: war-mongo2:27017 on: { ts
      2015-05-27T19:18:56.663-0400 I -        [repl writer worker 2] Fatal Assertion 16361
      2015-05-27T19:18:56.663-0400 I -        [repl writer worker 2]
       
      ***aborting after fassert() failure
      

      We had this happen twice before we started messing with kernel parameters which ultimately got us up and running.

      Since we also rebuild larger "mostly write no delete" replica-sets the same day and never saw this it seems likely that our highly-transient collection is causing this sort of connection abuse.

        Attachments

          Issue Links

            Activity

              People

              • Votes:
                3 Vote for this issue
                Watchers:
                25 Start watching this issue

                Dates

                • Created:
                  Updated: