Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-27700

WT secondary performance drops to near-zero with cache full

    Details

    • Backwards Compatibility:
      Fully Compatible
    • Operating System:
      ALL
    • Backport Requested:
      v3.4, v3.2
    • Steps To Reproduce:
      Hide

      No easy steps of course, in our case we have a secondary with replication load and some mongo dumps running every night, the problem appears progressively.

      Show
      No easy steps of course, in our case we have a secondary with replication load and some mongo dumps running every night, the problem appears progressively.
    • Sprint:
      Storage 2017-03-06

      Description

      Hello,

      I'm sorry but I'll have to reopen the difficult topic of WT dying when cache is full which has been reported multiple times already but all closed hopping 3.2.10 would fix the issue: SERVER-25070 SERVER-25974 SERVER-26001 SERVER-24580 SERVER-26055

      We upgraded to 3.2.10 and later 3.2.11 as suggested but unfortunately it didn't solve this issue (though it improved other performance problems). We're currently running 3.2.11 in RS on 2 capable servers (dual Xeon, 256GB RAM, Raid 10 SSD) and running only mongo dumps on the seconday every night. The servers were running fine for week and we recently added another mongo dump with --oplog option evey night, and since then (about 5 days ago), our secondary started getting slower and slower, up to a point today that we couldn't do anything on it, even running db.getReplicationInfo() take 30 to 60 seconds.

      This is the exact same problem I explained in SERVER-26055.
      In the end the server was in the following state:

      • mongo process was using 100~150% CPU and 0 IO (normal operation is 20-30% CPU and 1-2% IO)
      • cache is full: 153G
      • most commands would take minutes to complete
      • the mongodump processes were running for more than 10 hours and not doing much: a few records per minute (they usually take just 1h~1.5h)

      I've uploaded the diagnostic data dir to your secure portal (diagnostic.tar.bz2,
      100 MiB) for the last 6 days, the last file starting at 13:39 the 17th is when we restarted the server to save it so the problem is in the previous files.

      I hope we'll be able to fix this once and for all and that our primary server won't suffer from the same issue.

      Here is the server status before the restart (at it's worse):

      company:SECONDARY> db.serverStatus()
      {
        "host" : "mongo2.company",
        "advisoryHostFQDNs" : [
          "mongo2.company"
        ],
        "version" : "3.2.11",
        "process" : "mongod",
        "pid" : NumberLong(11157),
        "uptime" : 3012644,
        "uptimeMillis" : NumberLong("3012644493"),
        "uptimeEstimate" : 2987132,
        "localTime" : ISODate("2017-01-17T13:22:10.765Z"),
        "asserts" : {
          "regular" : 0,
          "warning" : 0,
          "msg" : 0,
          "user" : 356,
          "rollovers" : 0
        },
        "connections" : {
          "current" : 893,
          "available" : 199107,
          "totalCreated" : NumberLong(3089363)
        },
        "extra_info" : {
          "note" : "fields vary by platform",
          "heap_usage_bytes" : -2054714728,
          "page_faults" : 493
        },
        "globalLock" : {
          "totalTime" : NumberLong("3012644491000"),
          "currentQueue" : {
            "total" : 0,
            "readers" : 0,
            "writers" : 0
          },
          "activeClients" : {
            "total" : 926,
            "readers" : 0,
            "writers" : 15
          }
        },
        "locks" : {
          "Global" : {
            "acquireCount" : {
              "r" : NumberLong("16384331421"),
              "w" : NumberLong(1308770204),
              "R" : NumberLong(297276217),
              "W" : NumberLong(297288581)
            },
            "acquireWaitCount" : {
              "r" : NumberLong(22010128),
              "w" : NumberLong(301),
              "R" : NumberLong(255),
              "W" : NumberLong(11397391)
            },
            "timeAcquiringMicros" : {
              "r" : NumberLong("1566808918815"),
              "w" : NumberLong(8786271),
              "R" : NumberLong(44650942),
              "W" : NumberLong("169000143619")
            }
          },
          "Database" : {
            "acquireCount" : {
              "r" : NumberLong("7597006401"),
              "w" : NumberLong(416908607),
              "R" : NumberLong(596275),
              "W" : NumberLong(891861639)
            },
            "acquireWaitCount" : {
              "w" : NumberLong(2),
              "R" : NumberLong(1),
              "W" : NumberLong(540)
            },
            "timeAcquiringMicros" : {
              "w" : NumberLong(288),
              "R" : NumberLong(1371613),
              "W" : NumberLong(51766)
            }
          },
          "Collection" : {
            "acquireCount" : {
              "r" : NumberLong("8108302166"),
              "w" : NumberLong(416907972)
            }
          },
          "Metadata" : {
            "acquireCount" : {
              "w" : NumberLong(1),
              "W" : NumberLong(71)
            },
            "acquireWaitCount" : {
              "W" : NumberLong(3)
            },
            "timeAcquiringMicros" : {
              "W" : NumberLong(2270)
            }
          },
          "oplog" : {
            "acquireCount" : {
              "r" : NumberLong(10153326),
              "w" : NumberLong(681)
            }
          }
        },
        "network" : {
          "bytesIn" : NumberLong("11967403731"),
          "bytesOut" : NumberLong("8446426050037"),
          "numRequests" : NumberLong(146466986)
        },
        "opcounters" : {
          "insert" : 0,
          "query" : 13161552,
          "update" : 0,
          "delete" : 0,
          "getmore" : 3895271,
          "command" : 129410157
        },
        "opcountersRepl" : {
          "insert" : 35012985,
          "query" : 0,
          "update" : 378560144,
          "delete" : 3334842,
          "getmore" : 0,
          "command" : 0
        },
        "repl" : {
          "hosts" : [
            "10.10.11.18:27017",
            "10.10.11.19:27017"
          ],
          "arbiters" : [
            "10.10.11.9:27017"
          ],
          "setName" : "company",
          "setVersion" : 48,
          "ismaster" : false,
          "secondary" : true,
          "primary" : "10.10.11.18:27017",
          "me" : "10.10.11.19:27017",
          "rbid" : 31628773
        },
        "storageEngine" : {
          "name" : "wiredTiger",
          "supportsCommittedReads" : true,
          "persistent" : true
        },
        "tcmalloc" : {
          "generic" : {
            "current_allocated_bytes" : NumberLong("161154044008"),
            "heap_size" : NumberLong("182151512064")
          },
          "tcmalloc" : {
            "pageheap_free_bytes" : NumberLong("14107099136"),
            "pageheap_unmapped_bytes" : NumberLong(1802354688),
            "max_total_thread_cache_bytes" : NumberLong(1073741824),
            "current_total_thread_cache_bytes" : 479349592,
            "total_free_bytes" : NumberLong("5088014232"),
            "central_cache_free_bytes" : NumberLong("4578532256"),
            "transfer_cache_free_bytes" : 30132384,
            "thread_cache_free_bytes" : 479349592,
            "aggressive_memory_decommit" : 0,
            "formattedString" : "------------------------------------------------\nMALLOC:   161154044008 (153688.5 MiB) Bytes in use by application\nMALLOC: +  14107099136 (13453.6 MiB) Bytes in page heap freelist\nMALLOC: +   4578532256 ( 4366.4 MiB) Bytes in central cache freelist\nMALLOC: +     30132384 (   28.7 MiB) Bytes in transfer cache freelist\nMALLOC: +    479349592 (  457.1 MiB) Bytes in thread cache freelists\nMALLOC: +    499527840 (  476.4 MiB) Bytes in malloc metadata\nMALLOC:   ------------\nMALLOC: = 180848685216 (172470.7 MiB) Actual memory used (physical + swap)\nMALLOC: +   1802354688 ( 1718.9 MiB) Bytes released to OS (aka unmapped)\nMALLOC:   ------------\nMALLOC: = 182651039904 (174189.6 MiB) Virtual address space used\nMALLOC:\nMALLOC:        5011342              Spans in use\nMALLOC:            956              Thread heaps in use\nMALLOC:           8192              Tcmalloc page size\n------------------------------------------------\nCall ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).\nBytes released to the OS take up virtual address space but no physical memory.\n"
          }
        },
        "wiredTiger" : {
          "uri" : "statistics:",
          "LSM" : {
            "application work units currently queued" : 0,
            "merge work units currently queued" : 0,
            "rows merged in an LSM tree" : 0,
            "sleep for LSM checkpoint throttle" : 0,
            "sleep for LSM merge throttle" : 0,
            "switch work units currently queued" : 0,
            "tree maintenance operations discarded" : 0,
            "tree maintenance operations executed" : 0,
            "tree maintenance operations scheduled" : 0,
            "tree queue hit maximum" : 0
          },
          "async" : {
            "current work queue length" : 0,
            "maximum work queue length" : 0,
            "number of allocation state races" : 0,
            "number of flush calls" : 0,
            "number of operation slots viewed for allocation" : 0,
            "number of times operation allocation failed" : 0,
            "number of times worker found no work" : 0,
            "total allocations" : 0,
            "total compact calls" : 0,
            "total insert calls" : 0,
            "total remove calls" : 0,
            "total search calls" : 0,
            "total update calls" : 0
          },
          "block-manager" : {
            "blocks pre-loaded" : 563909,
            "blocks read" : 447716114,
            "blocks written" : 466221343,
            "bytes read" : NumberLong("5105232363520"),
            "bytes written" : NumberLong("5242967597056"),
            "bytes written for checkpoint" : NumberLong("5209527078912"),
            "mapped blocks read" : 0,
            "mapped bytes read" : 0
          },
          "cache" : {
            "application threads page read from disk to cache count" : 401675260,
            "application threads page read from disk to cache time (usecs)" : 21385899599,
            "application threads page write from cache to disk count" : 6026,
            "application threads page write from cache to disk time (usecs)" : 515035,
            "bytes belonging to page images in the cache" : 145719393101,
            "bytes currently in the cache" : 153008266330,
            "bytes not belonging to page images in the cache" : 7288873228,
            "bytes read into cache" : NumberLong("8588183106176"),
            "bytes written from cache" : NumberLong("4678177271668"),
            "checkpoint blocked page eviction" : 310,
            "eviction calls to get a page" : 1020803718,
            "eviction calls to get a page found queue empty" : 594912745,
            "eviction calls to get a page found queue empty after locking" : 10716036,
            "eviction currently operating in aggressive mode" : 0,
            "eviction empty score" : 0,
            "eviction server candidate queue empty when topping up" : 10557087,
            "eviction server candidate queue not empty when topping up" : 123218,
            "eviction server evicting pages" : 7185464,
            "eviction server slept, because we did not make progress with eviction" : 895321,
            "eviction server unable to reach eviction goal" : 0,
            "eviction state" : 3,
            "eviction walks abandoned" : 8639822,
            "eviction worker thread evicting pages" : 397247395,
            "failed eviction of pages that exceeded the in-memory maximum" : 40327,
            "files with active eviction walks" : 0,
            "files with new eviction walks started" : 11070568265,
            "hazard pointer blocked page eviction" : 42833,
            "hazard pointer check calls" : 406723085,
            "hazard pointer check entries walked" : 17485244113,
            "hazard pointer maximum array length" : 50,
            "in-memory page passed criteria to be split" : 68631,
            "in-memory page splits" : 34447,
            "internal pages evicted" : 9705753,
            "internal pages split during eviction" : 502,
            "leaf pages split during eviction" : 487134,
            "lookaside table insert calls" : 0,
            "lookaside table remove calls" : 0,
            "maximum bytes configured" : 161061273600,
            "maximum page size at eviction" : 0,
            "modified pages evicted" : 146340076,
            "modified pages evicted by application threads" : 4099,
            "overflow pages read into cache" : 0,
            "overflow values cached in memory" : 0,
            "page split during eviction deepened the tree" : 0,
            "page written requiring lookaside records" : 0,
            "pages currently held in the cache" : 4157514,
            "pages evicted because they exceeded the in-memory maximum" : 36799,
            "pages evicted because they had chains of deleted items" : 5847,
            "pages evicted by application threads" : 2201563,
            "pages queued for eviction" : 416886847,
            "pages queued for urgent eviction" : 97332,
            "pages queued for urgent eviction during walk" : 97332,
            "pages read into cache" : 401701308,
            "pages read into cache requiring lookaside entries" : 0,
            "pages requested from the cache" : 44781043115,
            "pages seen by eviction walk" : 15671920146,
            "pages selected for eviction unable to be evicted" : 7187074,
            "pages walked for eviction" : 1005956628381,
            "pages written from cache" : 374093813,
            "pages written requiring in-memory restoration" : 135689,
            "percentage overhead" : 8,
            "tracked bytes belonging to internal pages in the cache" : 658910666,
            "tracked bytes belonging to leaf pages in the cache" : 152349355664,
            "tracked dirty bytes in the cache" : 1194195084,
            "tracked dirty pages in the cache" : 17960,
            "unmodified pages evicted" : 251160847
          },
          "connection" : {
            "auto adjusting condition resets" : 47670726,
            "auto adjusting condition wait calls" : 61671191,
            "files currently open" : 19480,
            "memory allocations" : 124965158903,
            "memory frees" : 124512558703,
            "memory re-allocations" : 204658183,
            "pthread mutex condition wait calls" : 654232361,
            "pthread mutex shared lock read-lock calls" : 4968504666,
            "pthread mutex shared lock write-lock calls" : 3332317538,
            "total fsync I/Os" : 346443922,
            "total read I/Os" : 447943133,
            "total write I/Os" : 771846256
          },
          "cursor" : {
            "cursor create calls" : 205986549,
            "cursor insert calls" : 2330700111,
            "cursor next calls" : 17706432089,
            "cursor prev calls" : 14969190,
            "cursor remove calls" : 346172449,
            "cursor reset calls" : 21665455518,
            "cursor restarted searches" : 461443,
            "cursor search calls" : 22666117834,
            "cursor search near calls" : 940275132,
            "cursor update calls" : 1,
            "truncate calls" : 340
          },
          "data-handle" : {
            "connection data handles currently active" : 51620,
            "connection sweep candidate became referenced" : 0,
            "connection sweep dhandles closed" : 12864,
            "connection sweep dhandles removed from hash list" : 731162956,
            "connection sweep time-of-death sets" : 730653128,
            "connection sweeps" : 300763,
            "session dhandles swept" : 1661,
            "session sweep attempts" : 123182
          },
          "lock" : {
            "checkpoint lock acquisitions" : 36329,
            "checkpoint lock application thread wait time (usecs)" : 131065550,
            "checkpoint lock internal thread wait time (usecs)" : 124,
            "handle-list lock acquisitions" : 731959926,
            "handle-list lock application thread wait time (usecs)" : 867780820,
            "handle-list lock internal thread wait time (usecs)" : 748445731,
            "metadata lock acquisitions" : 36281,
            "metadata lock application thread wait time (usecs)" : 0,
            "metadata lock internal thread wait time (usecs)" : 196,
            "schema lock acquisitions" : 97621,
            "schema lock application thread wait time (usecs)" : 1728012,
            "schema lock internal thread wait time (usecs)" : 67,
            "table lock acquisitions" : 1647870,
            "table lock application thread time waiting for the table lock (usecs)" : 615679214,
            "table lock internal thread time waiting for the table lock (usecs)" : 2160519
          },
          "log" : {
            "busy returns attempting to switch slots" : 66054,
            "consolidated slot closures" : 303650650,
            "consolidated slot join races" : 379772512,
            "consolidated slot join transitions" : 303650650,
            "consolidated slot joins" : 1308972809,
            "consolidated slot unbuffered writes" : 2539282,
            "log bytes of payload data" : NumberLong("2017530171671"),
            "log bytes written" : NumberLong("2092694487296"),
            "log files manually zero-filled" : 0,
            "log flush operations" : 327040636,
            "log force write operations" : 337333167,
            "log force write operations skipped" : 36763021,
            "log records compressed" : 740009745,
            "log records not compressed" : 565440535,
            "log records too small to compress" : 3025714,
            "log release advances write LSN" : 65308,
            "log scan operations" : 5,
            "log scan records requiring two reads" : 2,
            "log server thread advances write LSN" : 303584401,
            "log server thread write LSN walk skipped" : 31096956,
            "log sync operations" : 300227454,
            "log sync time duration (usecs)" : 48779299297,
            "log sync_dir operations" : 20000,
            "log sync_dir time duration (usecs)" : 3706692,
            "log write operations" : 1308684346,
            "logging bytes consolidated" : NumberLong("2093413492096"),
            "maximum log file size" : 104857600,
            "number of pre-allocated log files to create" : 10,
            "pre-allocated log files not ready and missed" : 9,
            "pre-allocated log files prepared" : 20001,
            "pre-allocated log files used" : 19991,
            "records processed by log scan" : 11,
            "total in-memory size of compressed records" : NumberLong("2862605082266"),
            "total log buffer size" : 33554432,
            "total size of compressed records" : NumberLong("1903704122861"),
            "written slots coalesced" : 941,
            "yields waiting for previous log file close" : 0
          },
          "reconciliation" : {
            "fast-path pages deleted" : 4200903,
            "page reconciliation calls" : 361302879,
            "page reconciliation calls for eviction" : 1905672,
            "pages deleted" : 112805,
            "split bytes currently awaiting free" : 13252,
            "split objects currently awaiting free" : 10
          },
          "session" : {
            "open cursor count" : 49239,
            "open session count" : 69,
            "table compact failed calls" : 0,
            "table compact successful calls" : 0,
            "table create failed calls" : 0,
            "table create successful calls" : 179,
            "table drop failed calls" : 2,
            "table drop successful calls" : 28847,
            "table rebalance failed calls" : 0,
            "table rebalance successful calls" : 0,
            "table rename failed calls" : 0,
            "table rename successful calls" : 0,
            "table salvage failed calls" : 0,
            "table salvage successful calls" : 0,
            "table truncate failed calls" : 0,
            "table truncate successful calls" : 340,
            "table verify failed calls" : 0,
            "table verify successful calls" : 0
          },
          "thread-state" : {
            "active filesystem fsync calls" : 0,
            "active filesystem read calls" : 0,
            "active filesystem write calls" : 0
          },
          "thread-yield" : {
            "application thread time evicting (usecs)" : 65175199,
            "application thread time waiting for cache (usecs)" : NumberLong("1501020187418"),
            "page acquire busy blocked" : 66,
            "page acquire eviction blocked" : 162,
            "page acquire locked blocked" : 30037,
            "page acquire read blocked" : 1830625,
            "page acquire time sleeping (usecs)" : 1970680000
          },
          "transaction" : {
            "number of named snapshots created" : 0,
            "number of named snapshots dropped" : 0,
            "transaction begins" : 2440676065,
            "transaction checkpoint currently running" : 1,
            "transaction checkpoint generation" : 36282,
            "transaction checkpoint max time (msecs)" : 103814,
            "transaction checkpoint min time (msecs)" : 31,
            "transaction checkpoint most recent time (msecs)" : 61561,
            "transaction checkpoint scrub dirty target" : 0,
            "transaction checkpoint scrub time (msecs)" : 0,
            "transaction checkpoint total time (msecs)" : 831969937,
            "transaction checkpoints" : 36282,
            "transaction checkpoints skipped because database was clean" : 0,
            "transaction failures due to cache overflow" : 0,
            "transaction fsync calls for checkpoint after allocating the transaction ID" : 36281,
            "transaction fsync duration for checkpoint after allocating the transaction ID (usecs)" : 404122,
            "transaction range of IDs currently pinned" : 375,
            "transaction range of IDs currently pinned by a checkpoint" : 197,
            "transaction range of IDs currently pinned by named snapshots" : 0,
            "transaction sync calls" : 0,
            "transactions committed" : 1687828511,
            "transactions rolled back" : 752348725
          },
          "concurrentTransactions" : {
            "write" : {
              "out" : 15,
              "available" : 113,
              "totalTickets" : 128
            },
            "read" : {
              "out" : 0,
              "available" : 128,
              "totalTickets" : 128
            }
          }
        },
        "writeBacksQueued" : false,
        "mem" : {
          "bits" : 64,
          "resident" : 168549,
          "virtual" : 175704,
          "supported" : true,
          "mapped" : 0,
          "mappedWithJournal" : 0
        },
        "metrics" : {
          "commands" : {
            "_isSelf" : {
              "failed" : NumberLong(0),
              "total" : NumberLong(22)
            },
            "aggregate" : {
              "failed" : NumberLong(0),
              "total" : NumberLong(88070)
            },
            "buildInfo" : {
              "failed" : NumberLong(0),
              "total" : NumberLong(72)
            },
            "collStats" : {
              "failed" : NumberLong(0),
              "total" : NumberLong(16)
            },
            "count" : {
              "failed" : NumberLong(0),
              "total" : NumberLong(1099981)
            },
            "currentOp" : {
              "failed" : NumberLong(0),
              "total" : NumberLong(5)
            },
            "dbStats" : {
              "failed" : NumberLong(0),
              "total" : NumberLong(596120)
            },
            "find" : {
              "failed" : NumberLong(0),
              "total" : NumberLong(13121028)
            },
            "getLog" : {
              "failed" : NumberLong(0),
              "total" : NumberLong(14)
            },
            "getMore" : {
              "failed" : NumberLong(1),
              "total" : NumberLong(3790358)
            },
            "getnonce" : {
              "failed" : NumberLong(0),
              "total" : NumberLong(546)
            },
            "isMaster" : {
              "failed" : NumberLong(0),
              "total" : NumberLong(118296139)
            },
            "listCollections" : {
              "failed" : NumberLong(0),
              "total" : NumberLong(149)
            },
            "listDatabases" : {
              "failed" : NumberLong(0),
              "total" : NumberLong(99360)
            },
            "listIndexes" : {
              "failed" : NumberLong(0),
              "total" : NumberLong(412853)
            },
            "ping" : {
              "failed" : NumberLong(0),
              "total" : NumberLong(2274345)
            },
            "replSetGetConfig" : {
              "failed" : NumberLong(0),
              "total" : NumberLong(4)
            },
            "replSetGetStatus" : {
              "failed" : NumberLong(1),
              "total" : NumberLong(115004)
            },
            "replSetHeartbeat" : {
              "failed" : NumberLong(3),
              "total" : NumberLong(4592438)
            },
            "replSetRequestVotes" : {
              "failed" : NumberLong(0),
              "total" : NumberLong(4)
            },
            "replSetSyncFrom" : {
              "failed" : NumberLong(0),
              "total" : NumberLong(1)
            },
            "replSetUpdatePosition" : {
              "failed" : NumberLong(0),
              "total" : NumberLong(1682869)
            },
            "serverStatus" : {
              "failed" : NumberLong(0),
              "total" : NumberLong(117814)
            },
            "top" : {
              "failed" : NumberLong(0),
              "total" : NumberLong(316)
            },
            "whatsmyuri" : {
              "failed" : NumberLong(0),
              "total" : NumberLong(34015)
            }
          },
          "cursor" : {
            "timedOut" : NumberLong(8),
            "open" : {
              "noTimeout" : NumberLong(4),
              "pinned" : NumberLong(1),
              "total" : NumberLong(7)
            }
          },
          "document" : {
            "deleted" : NumberLong(0),
            "inserted" : NumberLong(0),
            "returned" : NumberLong("12382612005"),
            "updated" : NumberLong(0)
          },
          "getLastError" : {
            "wtime" : {
              "num" : 0,
              "totalMillis" : 0
            },
            "wtimeouts" : NumberLong(0)
          },
          "operation" : {
            "fastmod" : NumberLong(0),
            "idhack" : NumberLong(8027470),
            "scanAndOrder" : NumberLong(248943),
            "writeConflicts" : NumberLong(7)
          },
          "queryExecutor" : {
            "scanned" : NumberLong(600529645),
            "scannedObjects" : NumberLong("12859908515")
          },
          "record" : {
            "moves" : NumberLong(0)
          },
          "repl" : {
            "executor" : {
              "counters" : {
                "eventCreated" : 2,
                "eventWait" : 2,
                "cancels" : 53261409,
                "waits" : 941744516,
                "scheduledNetCmd" : 5495818,
                "scheduledDBWork" : 7,
                "scheduledXclWork" : 0,
                "scheduledWorkAt" : 59134340,
                "scheduledWork" : 943427733,
                "schedulingFailures" : 0
              },
              "queues" : {
                "networkInProgress" : 0,
                "dbWorkInProgress" : 0,
                "exclusiveInProgress" : 0,
                "sleepers" : 4,
                "ready" : 0,
                "free" : 14
              },
              "unsignaledEvents" : 4,
              "eventWaiters" : 0,
              "shuttingDown" : false,
              "networkInterface" : "NetworkInterfaceASIO inShutdown: 0"
            },
            "apply" : {
              "attemptsToBecomeSecondary" : NumberLong(1),
              "batches" : {
                "num" : 297276216,
                "totalMillis" : 26415
              },
              "ops" : NumberLong(416920316)
            },
            "buffer" : {
              "count" : NumberLong(3034),
              "maxSizeBytes" : 268435456,
              "sizeBytes" : NumberLong(4729764)
            },
            "network" : {
              "bytes" : NumberLong("231348439513"),
              "getmores" : {
                "num" : 402313044,
                "totalMillis" : 2224956627
              },
              "ops" : NumberLong(416925030),
              "readersCreated" : NumberLong(29)
            },
            "preload" : {
              "docs" : {
                "num" : 0,
                "totalMillis" : 0
              },
              "indexes" : {
                "num" : 0,
                "totalMillis" : 0
              }
            }
          },
          "storage" : {
            "freelist" : {
              "search" : {
                "bucketExhausted" : NumberLong(0),
                "requests" : NumberLong(0),
                "scanned" : NumberLong(0)
              }
            }
          },
          "ttl" : {
            "deletedDocuments" : NumberLong(0),
            "passes" : NumberLong(49086)
          }
        },
        "ok" : 1
      }
      

      1. image-2017-03-20-15-24-39-540.png
        21 kB
      2. image-2017-03-20-15-32-14-192.png
        31 kB
      3. server-27700-after.png
        115 kB
      4. server-27700-before.png
        120 kB

        Issue Links

          Activity

          Hide
          michael.cahill Michael Cahill added a comment -

          Adrien Jarthon, the synthetic workload I referenced above can be tweaked to show eviction being much more efficient with this set of changes (e.g., 50x less work in eviction to support the same / higher workload).

          We have now heard from several MongoDB users including you that these changes also help in real world workloads, so I'm confident that the synthetic workload is reproducing the condition you reported.

          Show
          michael.cahill Michael Cahill added a comment - Adrien Jarthon , the synthetic workload I referenced above can be tweaked to show eviction being much more efficient with this set of changes (e.g., 50x less work in eviction to support the same / higher workload). We have now heard from several MongoDB users including you that these changes also help in real world workloads, so I'm confident that the synthetic workload is reproducing the condition you reported.
          Hide
          bigbourin@gmail.com Adrien Jarthon added a comment -

          Ok great

          Show
          bigbourin@gmail.com Adrien Jarthon added a comment - Ok great
          Hide
          bigbourin@gmail.com Adrien Jarthon added a comment -

          Hi,
          We just had a short but huge slowdown on our primary this morning during 5 minutes, and it was during some slow requests doing fullscan on a big collection. So I'm wondering if it's wasn't the same issue (cache eviction) because the problem started when the cache got full and caused this totally inexplicable global slowdown (on all collections), the machine was totally fine of course, it's running on a RAID10 of SSD and the IO, RAM and CPU usage were totally ok. So for an engine supposed to be concurrent this is a bit of an issue as you can guess..

          This is the 3 slowest queries during the issue and I believe the first and last to be the culprit as they were exceptional (started from a console) whereas the second one happens all the time so it was most probably a consequence here:

          2017-03-20T10:52:08.068+0100 I COMMAND  [conn9763332] command db.contents command: count { count: "contents", query: { language: "" } } planSummary: COLLSCAN keyUpdates:0 writeConflicts:0 numYields:251546 reslen:62 locks:{ Global: { acquireCount: { r: 503094 } }, Database: { acquireCount: { r: 251547 } }, Collection: { acquireCount: { r: 251547 } } } protocol:op_query 151378ms
          2017-03-20T10:53:57.375+0100 I COMMAND  [conn9755330] command db.content_threads command: find { find: "content_threads", filter: { deleted_at: null, source_id: { $in: [ ObjectId('569fbd0e7765628edf000072'), ObjectId('56ec32e8776562644b0004cb') ] }, $or: [ { texts: { $all: [ "kane", "hassan" ] } }, { texts: { $all: [ "kan", "hassan" ] } } ] }, limit: 30, skip: 0, sort: { last_content_at: -1 } } planSummary: IXSCAN { last_content_at: -1 } keysExamined:1240070 docsExamined:1240070 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:9734 nreturned:3 reslen:10926 locks:{ Global: { acquireCount: { r: 19470 } }, Database: { acquireCount: { r: 9735 } }, Collection: { acquireCount: { r: 9735 } } } protocol:op_query 104210ms
          2017-03-20T10:55:38.911+0100 I COMMAND  [conn9763332] command db.contents command: count { count: "contents", query: { language: null } } planSummary: COLLSCAN keyUpdates:0 writeConflicts:0 numYields:252012 reslen:62 locks:{ Global: { acquireCount: { r: 504026 } }, Database: { acquireCount: { r: 252013 } }, Collection: { acquireCount: { r: 252013 } } } protocol:op_query 199269ms
          

          Here is our web requests processing during the incident, we can see it's not totally blocked but almost:

          Here is the mongodb primary memory usage (careful timeframe is not the same), we can see the cache rising quickly as the fullscan runs before starving and slowing down at top capacity. Max cache size is 152G for us (60% of RAM - 1G), we're getting above here but there's probably other things than WT cache I guess?

          I uploaded the diagnostic.data from our primary to the same portal: diagnostic.data.tgz (108M)
          Could please let me know objectively if it's the same problem or not ? Thanks.

          Show
          bigbourin@gmail.com Adrien Jarthon added a comment - Hi, We just had a short but huge slowdown on our primary this morning during 5 minutes, and it was during some slow requests doing fullscan on a big collection. So I'm wondering if it's wasn't the same issue (cache eviction) because the problem started when the cache got full and caused this totally inexplicable global slowdown (on all collections), the machine was totally fine of course, it's running on a RAID10 of SSD and the IO, RAM and CPU usage were totally ok. So for an engine supposed to be concurrent this is a bit of an issue as you can guess.. This is the 3 slowest queries during the issue and I believe the first and last to be the culprit as they were exceptional (started from a console) whereas the second one happens all the time so it was most probably a consequence here: 2017-03-20T10:52:08.068+0100 I COMMAND [conn9763332] command db.contents command: count { count: "contents", query: { language: "" } } planSummary: COLLSCAN keyUpdates:0 writeConflicts:0 numYields:251546 reslen:62 locks:{ Global: { acquireCount: { r: 503094 } }, Database: { acquireCount: { r: 251547 } }, Collection: { acquireCount: { r: 251547 } } } protocol:op_query 151378ms 2017-03-20T10:53:57.375+0100 I COMMAND [conn9755330] command db.content_threads command: find { find: "content_threads", filter: { deleted_at: null, source_id: { $in: [ ObjectId('569fbd0e7765628edf000072'), ObjectId('56ec32e8776562644b0004cb') ] }, $or: [ { texts: { $all: [ "kane", "hassan" ] } }, { texts: { $all: [ "kan", "hassan" ] } } ] }, limit: 30, skip: 0, sort: { last_content_at: -1 } } planSummary: IXSCAN { last_content_at: -1 } keysExamined:1240070 docsExamined:1240070 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:9734 nreturned:3 reslen:10926 locks:{ Global: { acquireCount: { r: 19470 } }, Database: { acquireCount: { r: 9735 } }, Collection: { acquireCount: { r: 9735 } } } protocol:op_query 104210ms 2017-03-20T10:55:38.911+0100 I COMMAND [conn9763332] command db.contents command: count { count: "contents", query: { language: null } } planSummary: COLLSCAN keyUpdates:0 writeConflicts:0 numYields:252012 reslen:62 locks:{ Global: { acquireCount: { r: 504026 } }, Database: { acquireCount: { r: 252013 } }, Collection: { acquireCount: { r: 252013 } } } protocol:op_query 199269ms Here is our web requests processing during the incident, we can see it's not totally blocked but almost: Here is the mongodb primary memory usage (careful timeframe is not the same), we can see the cache rising quickly as the fullscan runs before starving and slowing down at top capacity. Max cache size is 152G for us (60% of RAM - 1G), we're getting above here but there's probably other things than WT cache I guess? I uploaded the diagnostic.data from our primary to the same portal: diagnostic.data.tgz (108M) Could please let me know objectively if it's the same problem or not ? Thanks.
          Hide
          aqueen Aaron Queen added a comment -

          Can this be updated to affects 3.2.12 and can this be backported to 3.2 and hopefully land in 3.2.13? Thanks!

          Show
          aqueen Aaron Queen added a comment - Can this be updated to affects 3.2.12 and can this be backported to 3.2 and hopefully land in 3.2.13? Thanks!
          Hide
          thomas.schubert Thomas Schubert added a comment -

          Hi Aaron Queen,

          The fixVersion, "3.2 required", indicates that the backport to MongoDB 3.2 has been requested and approved. Please continue to watch this ticket for updates.

          Thank you,
          Thomas

          Show
          thomas.schubert Thomas Schubert added a comment - Hi Aaron Queen , The fixVersion, "3.2 required", indicates that the backport to MongoDB 3.2 has been requested and approved. Please continue to watch this ticket for updates. Thank you, Thomas

            People

            • Votes:
              0 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                  Agile