[SERVER-27700] WT secondary performance drops to near-zero with cache full Created: 17/Jan/17  Updated: 06/Feb/19  Resolved: 02/Mar/17

Status: Closed
Project: Core Server
Component/s: Performance, WiredTiger
Affects Version/s: 3.2.9, 3.2.10, 3.2.11
Fix Version/s: 3.2.13, 3.4.3, 3.5.4

Type: Bug Priority: Major - P3
Reporter: Adrien Jarthon Assignee: Michael Cahill (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File 3.2.10 → 3.4.3 WT cache.png     PNG File downtime-2.png     PNG File downtime.png     PNG File image-2017-03-20-15-24-39-540.png     PNG File image-2017-03-20-15-32-14-192.png     PNG File server-27700-after.png     PNG File server-27700-before.png     PNG File server-27700-march20.png    
Issue Links:
Backports
Depends
depends on WT-3115 Change the dhandle lock to a read/wri... Closed
depends on WT-3139 Enhance wtperf to support periodic ta... Closed
depends on WT-3148 Improve eviction efficiency with many... Closed
depends on WT-3150 Reduce impact of checkpoints on evict... Closed
Duplicate
is duplicated by SERVER-27753 wiredTiger cache 95% performance issues Closed
Related
related to SERVER-39355 Collection drops can block the server... Closed
is related to SERVER-28316 Related to SERVER-27753 - Custom bui... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v3.4, v3.2
Steps To Reproduce:

No easy steps of course, in our case we have a secondary with replication load and some mongo dumps running every night, the problem appears progressively.

Sprint: Storage 2017-03-06
Participants:
Case:

 Description   

Hello,

I'm sorry but I'll have to reopen the difficult topic of WT dying when cache is full which has been reported multiple times already but all closed hopping 3.2.10 would fix the issue: SERVER-25070 SERVER-25974 SERVER-26001 SERVER-24580 SERVER-26055

We upgraded to 3.2.10 and later 3.2.11 as suggested but unfortunately it didn't solve this issue (though it improved other performance problems). We're currently running 3.2.11 in RS on 2 capable servers (dual Xeon, 256GB RAM, Raid 10 SSD) and running only mongo dumps on the seconday every night. The servers were running fine for week and we recently added another mongo dump with --oplog option evey night, and since then (about 5 days ago), our secondary started getting slower and slower, up to a point today that we couldn't do anything on it, even running db.getReplicationInfo() take 30 to 60 seconds.

This is the exact same problem I explained in SERVER-26055.
In the end the server was in the following state:

  • mongo process was using 100~150% CPU and 0 IO (normal operation is 20-30% CPU and 1-2% IO)
  • cache is full: 153G
  • most commands would take minutes to complete
  • the mongodump processes were running for more than 10 hours and not doing much: a few records per minute (they usually take just 1h~1.5h)

I've uploaded the diagnostic data dir to your secure portal (diagnostic.tar.bz2,
100 MiB) for the last 6 days, the last file starting at 13:39 the 17th is when we restarted the server to save it so the problem is in the previous files.

I hope we'll be able to fix this once and for all and that our primary server won't suffer from the same issue.

Here is the server status before the restart (at it's worse):

company:SECONDARY> db.serverStatus()
{
  "host" : "mongo2.company",
  "advisoryHostFQDNs" : [
    "mongo2.company"
  ],
  "version" : "3.2.11",
  "process" : "mongod",
  "pid" : NumberLong(11157),
  "uptime" : 3012644,
  "uptimeMillis" : NumberLong("3012644493"),
  "uptimeEstimate" : 2987132,
  "localTime" : ISODate("2017-01-17T13:22:10.765Z"),
  "asserts" : {
    "regular" : 0,
    "warning" : 0,
    "msg" : 0,
    "user" : 356,
    "rollovers" : 0
  },
  "connections" : {
    "current" : 893,
    "available" : 199107,
    "totalCreated" : NumberLong(3089363)
  },
  "extra_info" : {
    "note" : "fields vary by platform",
    "heap_usage_bytes" : -2054714728,
    "page_faults" : 493
  },
  "globalLock" : {
    "totalTime" : NumberLong("3012644491000"),
    "currentQueue" : {
      "total" : 0,
      "readers" : 0,
      "writers" : 0
    },
    "activeClients" : {
      "total" : 926,
      "readers" : 0,
      "writers" : 15
    }
  },
  "locks" : {
    "Global" : {
      "acquireCount" : {
        "r" : NumberLong("16384331421"),
        "w" : NumberLong(1308770204),
        "R" : NumberLong(297276217),
        "W" : NumberLong(297288581)
      },
      "acquireWaitCount" : {
        "r" : NumberLong(22010128),
        "w" : NumberLong(301),
        "R" : NumberLong(255),
        "W" : NumberLong(11397391)
      },
      "timeAcquiringMicros" : {
        "r" : NumberLong("1566808918815"),
        "w" : NumberLong(8786271),
        "R" : NumberLong(44650942),
        "W" : NumberLong("169000143619")
      }
    },
    "Database" : {
      "acquireCount" : {
        "r" : NumberLong("7597006401"),
        "w" : NumberLong(416908607),
        "R" : NumberLong(596275),
        "W" : NumberLong(891861639)
      },
      "acquireWaitCount" : {
        "w" : NumberLong(2),
        "R" : NumberLong(1),
        "W" : NumberLong(540)
      },
      "timeAcquiringMicros" : {
        "w" : NumberLong(288),
        "R" : NumberLong(1371613),
        "W" : NumberLong(51766)
      }
    },
    "Collection" : {
      "acquireCount" : {
        "r" : NumberLong("8108302166"),
        "w" : NumberLong(416907972)
      }
    },
    "Metadata" : {
      "acquireCount" : {
        "w" : NumberLong(1),
        "W" : NumberLong(71)
      },
      "acquireWaitCount" : {
        "W" : NumberLong(3)
      },
      "timeAcquiringMicros" : {
        "W" : NumberLong(2270)
      }
    },
    "oplog" : {
      "acquireCount" : {
        "r" : NumberLong(10153326),
        "w" : NumberLong(681)
      }
    }
  },
  "network" : {
    "bytesIn" : NumberLong("11967403731"),
    "bytesOut" : NumberLong("8446426050037"),
    "numRequests" : NumberLong(146466986)
  },
  "opcounters" : {
    "insert" : 0,
    "query" : 13161552,
    "update" : 0,
    "delete" : 0,
    "getmore" : 3895271,
    "command" : 129410157
  },
  "opcountersRepl" : {
    "insert" : 35012985,
    "query" : 0,
    "update" : 378560144,
    "delete" : 3334842,
    "getmore" : 0,
    "command" : 0
  },
  "repl" : {
    "hosts" : [
      "10.10.11.18:27017",
      "10.10.11.19:27017"
    ],
    "arbiters" : [
      "10.10.11.9:27017"
    ],
    "setName" : "company",
    "setVersion" : 48,
    "ismaster" : false,
    "secondary" : true,
    "primary" : "10.10.11.18:27017",
    "me" : "10.10.11.19:27017",
    "rbid" : 31628773
  },
  "storageEngine" : {
    "name" : "wiredTiger",
    "supportsCommittedReads" : true,
    "persistent" : true
  },
  "tcmalloc" : {
    "generic" : {
      "current_allocated_bytes" : NumberLong("161154044008"),
      "heap_size" : NumberLong("182151512064")
    },
    "tcmalloc" : {
      "pageheap_free_bytes" : NumberLong("14107099136"),
      "pageheap_unmapped_bytes" : NumberLong(1802354688),
      "max_total_thread_cache_bytes" : NumberLong(1073741824),
      "current_total_thread_cache_bytes" : 479349592,
      "total_free_bytes" : NumberLong("5088014232"),
      "central_cache_free_bytes" : NumberLong("4578532256"),
      "transfer_cache_free_bytes" : 30132384,
      "thread_cache_free_bytes" : 479349592,
      "aggressive_memory_decommit" : 0,
      "formattedString" : "------------------------------------------------\nMALLOC:   161154044008 (153688.5 MiB) Bytes in use by application\nMALLOC: +  14107099136 (13453.6 MiB) Bytes in page heap freelist\nMALLOC: +   4578532256 ( 4366.4 MiB) Bytes in central cache freelist\nMALLOC: +     30132384 (   28.7 MiB) Bytes in transfer cache freelist\nMALLOC: +    479349592 (  457.1 MiB) Bytes in thread cache freelists\nMALLOC: +    499527840 (  476.4 MiB) Bytes in malloc metadata\nMALLOC:   ------------\nMALLOC: = 180848685216 (172470.7 MiB) Actual memory used (physical + swap)\nMALLOC: +   1802354688 ( 1718.9 MiB) Bytes released to OS (aka unmapped)\nMALLOC:   ------------\nMALLOC: = 182651039904 (174189.6 MiB) Virtual address space used\nMALLOC:\nMALLOC:        5011342              Spans in use\nMALLOC:            956              Thread heaps in use\nMALLOC:           8192              Tcmalloc page size\n------------------------------------------------\nCall ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).\nBytes released to the OS take up virtual address space but no physical memory.\n"
    }
  },
  "wiredTiger" : {
    "uri" : "statistics:",
    "LSM" : {
      "application work units currently queued" : 0,
      "merge work units currently queued" : 0,
      "rows merged in an LSM tree" : 0,
      "sleep for LSM checkpoint throttle" : 0,
      "sleep for LSM merge throttle" : 0,
      "switch work units currently queued" : 0,
      "tree maintenance operations discarded" : 0,
      "tree maintenance operations executed" : 0,
      "tree maintenance operations scheduled" : 0,
      "tree queue hit maximum" : 0
    },
    "async" : {
      "current work queue length" : 0,
      "maximum work queue length" : 0,
      "number of allocation state races" : 0,
      "number of flush calls" : 0,
      "number of operation slots viewed for allocation" : 0,
      "number of times operation allocation failed" : 0,
      "number of times worker found no work" : 0,
      "total allocations" : 0,
      "total compact calls" : 0,
      "total insert calls" : 0,
      "total remove calls" : 0,
      "total search calls" : 0,
      "total update calls" : 0
    },
    "block-manager" : {
      "blocks pre-loaded" : 563909,
      "blocks read" : 447716114,
      "blocks written" : 466221343,
      "bytes read" : NumberLong("5105232363520"),
      "bytes written" : NumberLong("5242967597056"),
      "bytes written for checkpoint" : NumberLong("5209527078912"),
      "mapped blocks read" : 0,
      "mapped bytes read" : 0
    },
    "cache" : {
      "application threads page read from disk to cache count" : 401675260,
      "application threads page read from disk to cache time (usecs)" : 21385899599,
      "application threads page write from cache to disk count" : 6026,
      "application threads page write from cache to disk time (usecs)" : 515035,
      "bytes belonging to page images in the cache" : 145719393101,
      "bytes currently in the cache" : 153008266330,
      "bytes not belonging to page images in the cache" : 7288873228,
      "bytes read into cache" : NumberLong("8588183106176"),
      "bytes written from cache" : NumberLong("4678177271668"),
      "checkpoint blocked page eviction" : 310,
      "eviction calls to get a page" : 1020803718,
      "eviction calls to get a page found queue empty" : 594912745,
      "eviction calls to get a page found queue empty after locking" : 10716036,
      "eviction currently operating in aggressive mode" : 0,
      "eviction empty score" : 0,
      "eviction server candidate queue empty when topping up" : 10557087,
      "eviction server candidate queue not empty when topping up" : 123218,
      "eviction server evicting pages" : 7185464,
      "eviction server slept, because we did not make progress with eviction" : 895321,
      "eviction server unable to reach eviction goal" : 0,
      "eviction state" : 3,
      "eviction walks abandoned" : 8639822,
      "eviction worker thread evicting pages" : 397247395,
      "failed eviction of pages that exceeded the in-memory maximum" : 40327,
      "files with active eviction walks" : 0,
      "files with new eviction walks started" : 11070568265,
      "hazard pointer blocked page eviction" : 42833,
      "hazard pointer check calls" : 406723085,
      "hazard pointer check entries walked" : 17485244113,
      "hazard pointer maximum array length" : 50,
      "in-memory page passed criteria to be split" : 68631,
      "in-memory page splits" : 34447,
      "internal pages evicted" : 9705753,
      "internal pages split during eviction" : 502,
      "leaf pages split during eviction" : 487134,
      "lookaside table insert calls" : 0,
      "lookaside table remove calls" : 0,
      "maximum bytes configured" : 161061273600,
      "maximum page size at eviction" : 0,
      "modified pages evicted" : 146340076,
      "modified pages evicted by application threads" : 4099,
      "overflow pages read into cache" : 0,
      "overflow values cached in memory" : 0,
      "page split during eviction deepened the tree" : 0,
      "page written requiring lookaside records" : 0,
      "pages currently held in the cache" : 4157514,
      "pages evicted because they exceeded the in-memory maximum" : 36799,
      "pages evicted because they had chains of deleted items" : 5847,
      "pages evicted by application threads" : 2201563,
      "pages queued for eviction" : 416886847,
      "pages queued for urgent eviction" : 97332,
      "pages queued for urgent eviction during walk" : 97332,
      "pages read into cache" : 401701308,
      "pages read into cache requiring lookaside entries" : 0,
      "pages requested from the cache" : 44781043115,
      "pages seen by eviction walk" : 15671920146,
      "pages selected for eviction unable to be evicted" : 7187074,
      "pages walked for eviction" : 1005956628381,
      "pages written from cache" : 374093813,
      "pages written requiring in-memory restoration" : 135689,
      "percentage overhead" : 8,
      "tracked bytes belonging to internal pages in the cache" : 658910666,
      "tracked bytes belonging to leaf pages in the cache" : 152349355664,
      "tracked dirty bytes in the cache" : 1194195084,
      "tracked dirty pages in the cache" : 17960,
      "unmodified pages evicted" : 251160847
    },
    "connection" : {
      "auto adjusting condition resets" : 47670726,
      "auto adjusting condition wait calls" : 61671191,
      "files currently open" : 19480,
      "memory allocations" : 124965158903,
      "memory frees" : 124512558703,
      "memory re-allocations" : 204658183,
      "pthread mutex condition wait calls" : 654232361,
      "pthread mutex shared lock read-lock calls" : 4968504666,
      "pthread mutex shared lock write-lock calls" : 3332317538,
      "total fsync I/Os" : 346443922,
      "total read I/Os" : 447943133,
      "total write I/Os" : 771846256
    },
    "cursor" : {
      "cursor create calls" : 205986549,
      "cursor insert calls" : 2330700111,
      "cursor next calls" : 17706432089,
      "cursor prev calls" : 14969190,
      "cursor remove calls" : 346172449,
      "cursor reset calls" : 21665455518,
      "cursor restarted searches" : 461443,
      "cursor search calls" : 22666117834,
      "cursor search near calls" : 940275132,
      "cursor update calls" : 1,
      "truncate calls" : 340
    },
    "data-handle" : {
      "connection data handles currently active" : 51620,
      "connection sweep candidate became referenced" : 0,
      "connection sweep dhandles closed" : 12864,
      "connection sweep dhandles removed from hash list" : 731162956,
      "connection sweep time-of-death sets" : 730653128,
      "connection sweeps" : 300763,
      "session dhandles swept" : 1661,
      "session sweep attempts" : 123182
    },
    "lock" : {
      "checkpoint lock acquisitions" : 36329,
      "checkpoint lock application thread wait time (usecs)" : 131065550,
      "checkpoint lock internal thread wait time (usecs)" : 124,
      "handle-list lock acquisitions" : 731959926,
      "handle-list lock application thread wait time (usecs)" : 867780820,
      "handle-list lock internal thread wait time (usecs)" : 748445731,
      "metadata lock acquisitions" : 36281,
      "metadata lock application thread wait time (usecs)" : 0,
      "metadata lock internal thread wait time (usecs)" : 196,
      "schema lock acquisitions" : 97621,
      "schema lock application thread wait time (usecs)" : 1728012,
      "schema lock internal thread wait time (usecs)" : 67,
      "table lock acquisitions" : 1647870,
      "table lock application thread time waiting for the table lock (usecs)" : 615679214,
      "table lock internal thread time waiting for the table lock (usecs)" : 2160519
    },
    "log" : {
      "busy returns attempting to switch slots" : 66054,
      "consolidated slot closures" : 303650650,
      "consolidated slot join races" : 379772512,
      "consolidated slot join transitions" : 303650650,
      "consolidated slot joins" : 1308972809,
      "consolidated slot unbuffered writes" : 2539282,
      "log bytes of payload data" : NumberLong("2017530171671"),
      "log bytes written" : NumberLong("2092694487296"),
      "log files manually zero-filled" : 0,
      "log flush operations" : 327040636,
      "log force write operations" : 337333167,
      "log force write operations skipped" : 36763021,
      "log records compressed" : 740009745,
      "log records not compressed" : 565440535,
      "log records too small to compress" : 3025714,
      "log release advances write LSN" : 65308,
      "log scan operations" : 5,
      "log scan records requiring two reads" : 2,
      "log server thread advances write LSN" : 303584401,
      "log server thread write LSN walk skipped" : 31096956,
      "log sync operations" : 300227454,
      "log sync time duration (usecs)" : 48779299297,
      "log sync_dir operations" : 20000,
      "log sync_dir time duration (usecs)" : 3706692,
      "log write operations" : 1308684346,
      "logging bytes consolidated" : NumberLong("2093413492096"),
      "maximum log file size" : 104857600,
      "number of pre-allocated log files to create" : 10,
      "pre-allocated log files not ready and missed" : 9,
      "pre-allocated log files prepared" : 20001,
      "pre-allocated log files used" : 19991,
      "records processed by log scan" : 11,
      "total in-memory size of compressed records" : NumberLong("2862605082266"),
      "total log buffer size" : 33554432,
      "total size of compressed records" : NumberLong("1903704122861"),
      "written slots coalesced" : 941,
      "yields waiting for previous log file close" : 0
    },
    "reconciliation" : {
      "fast-path pages deleted" : 4200903,
      "page reconciliation calls" : 361302879,
      "page reconciliation calls for eviction" : 1905672,
      "pages deleted" : 112805,
      "split bytes currently awaiting free" : 13252,
      "split objects currently awaiting free" : 10
    },
    "session" : {
      "open cursor count" : 49239,
      "open session count" : 69,
      "table compact failed calls" : 0,
      "table compact successful calls" : 0,
      "table create failed calls" : 0,
      "table create successful calls" : 179,
      "table drop failed calls" : 2,
      "table drop successful calls" : 28847,
      "table rebalance failed calls" : 0,
      "table rebalance successful calls" : 0,
      "table rename failed calls" : 0,
      "table rename successful calls" : 0,
      "table salvage failed calls" : 0,
      "table salvage successful calls" : 0,
      "table truncate failed calls" : 0,
      "table truncate successful calls" : 340,
      "table verify failed calls" : 0,
      "table verify successful calls" : 0
    },
    "thread-state" : {
      "active filesystem fsync calls" : 0,
      "active filesystem read calls" : 0,
      "active filesystem write calls" : 0
    },
    "thread-yield" : {
      "application thread time evicting (usecs)" : 65175199,
      "application thread time waiting for cache (usecs)" : NumberLong("1501020187418"),
      "page acquire busy blocked" : 66,
      "page acquire eviction blocked" : 162,
      "page acquire locked blocked" : 30037,
      "page acquire read blocked" : 1830625,
      "page acquire time sleeping (usecs)" : 1970680000
    },
    "transaction" : {
      "number of named snapshots created" : 0,
      "number of named snapshots dropped" : 0,
      "transaction begins" : 2440676065,
      "transaction checkpoint currently running" : 1,
      "transaction checkpoint generation" : 36282,
      "transaction checkpoint max time (msecs)" : 103814,
      "transaction checkpoint min time (msecs)" : 31,
      "transaction checkpoint most recent time (msecs)" : 61561,
      "transaction checkpoint scrub dirty target" : 0,
      "transaction checkpoint scrub time (msecs)" : 0,
      "transaction checkpoint total time (msecs)" : 831969937,
      "transaction checkpoints" : 36282,
      "transaction checkpoints skipped because database was clean" : 0,
      "transaction failures due to cache overflow" : 0,
      "transaction fsync calls for checkpoint after allocating the transaction ID" : 36281,
      "transaction fsync duration for checkpoint after allocating the transaction ID (usecs)" : 404122,
      "transaction range of IDs currently pinned" : 375,
      "transaction range of IDs currently pinned by a checkpoint" : 197,
      "transaction range of IDs currently pinned by named snapshots" : 0,
      "transaction sync calls" : 0,
      "transactions committed" : 1687828511,
      "transactions rolled back" : 752348725
    },
    "concurrentTransactions" : {
      "write" : {
        "out" : 15,
        "available" : 113,
        "totalTickets" : 128
      },
      "read" : {
        "out" : 0,
        "available" : 128,
        "totalTickets" : 128
      }
    }
  },
  "writeBacksQueued" : false,
  "mem" : {
    "bits" : 64,
    "resident" : 168549,
    "virtual" : 175704,
    "supported" : true,
    "mapped" : 0,
    "mappedWithJournal" : 0
  },
  "metrics" : {
    "commands" : {
      "_isSelf" : {
        "failed" : NumberLong(0),
        "total" : NumberLong(22)
      },
      "aggregate" : {
        "failed" : NumberLong(0),
        "total" : NumberLong(88070)
      },
      "buildInfo" : {
        "failed" : NumberLong(0),
        "total" : NumberLong(72)
      },
      "collStats" : {
        "failed" : NumberLong(0),
        "total" : NumberLong(16)
      },
      "count" : {
        "failed" : NumberLong(0),
        "total" : NumberLong(1099981)
      },
      "currentOp" : {
        "failed" : NumberLong(0),
        "total" : NumberLong(5)
      },
      "dbStats" : {
        "failed" : NumberLong(0),
        "total" : NumberLong(596120)
      },
      "find" : {
        "failed" : NumberLong(0),
        "total" : NumberLong(13121028)
      },
      "getLog" : {
        "failed" : NumberLong(0),
        "total" : NumberLong(14)
      },
      "getMore" : {
        "failed" : NumberLong(1),
        "total" : NumberLong(3790358)
      },
      "getnonce" : {
        "failed" : NumberLong(0),
        "total" : NumberLong(546)
      },
      "isMaster" : {
        "failed" : NumberLong(0),
        "total" : NumberLong(118296139)
      },
      "listCollections" : {
        "failed" : NumberLong(0),
        "total" : NumberLong(149)
      },
      "listDatabases" : {
        "failed" : NumberLong(0),
        "total" : NumberLong(99360)
      },
      "listIndexes" : {
        "failed" : NumberLong(0),
        "total" : NumberLong(412853)
      },
      "ping" : {
        "failed" : NumberLong(0),
        "total" : NumberLong(2274345)
      },
      "replSetGetConfig" : {
        "failed" : NumberLong(0),
        "total" : NumberLong(4)
      },
      "replSetGetStatus" : {
        "failed" : NumberLong(1),
        "total" : NumberLong(115004)
      },
      "replSetHeartbeat" : {
        "failed" : NumberLong(3),
        "total" : NumberLong(4592438)
      },
      "replSetRequestVotes" : {
        "failed" : NumberLong(0),
        "total" : NumberLong(4)
      },
      "replSetSyncFrom" : {
        "failed" : NumberLong(0),
        "total" : NumberLong(1)
      },
      "replSetUpdatePosition" : {
        "failed" : NumberLong(0),
        "total" : NumberLong(1682869)
      },
      "serverStatus" : {
        "failed" : NumberLong(0),
        "total" : NumberLong(117814)
      },
      "top" : {
        "failed" : NumberLong(0),
        "total" : NumberLong(316)
      },
      "whatsmyuri" : {
        "failed" : NumberLong(0),
        "total" : NumberLong(34015)
      }
    },
    "cursor" : {
      "timedOut" : NumberLong(8),
      "open" : {
        "noTimeout" : NumberLong(4),
        "pinned" : NumberLong(1),
        "total" : NumberLong(7)
      }
    },
    "document" : {
      "deleted" : NumberLong(0),
      "inserted" : NumberLong(0),
      "returned" : NumberLong("12382612005"),
      "updated" : NumberLong(0)
    },
    "getLastError" : {
      "wtime" : {
        "num" : 0,
        "totalMillis" : 0
      },
      "wtimeouts" : NumberLong(0)
    },
    "operation" : {
      "fastmod" : NumberLong(0),
      "idhack" : NumberLong(8027470),
      "scanAndOrder" : NumberLong(248943),
      "writeConflicts" : NumberLong(7)
    },
    "queryExecutor" : {
      "scanned" : NumberLong(600529645),
      "scannedObjects" : NumberLong("12859908515")
    },
    "record" : {
      "moves" : NumberLong(0)
    },
    "repl" : {
      "executor" : {
        "counters" : {
          "eventCreated" : 2,
          "eventWait" : 2,
          "cancels" : 53261409,
          "waits" : 941744516,
          "scheduledNetCmd" : 5495818,
          "scheduledDBWork" : 7,
          "scheduledXclWork" : 0,
          "scheduledWorkAt" : 59134340,
          "scheduledWork" : 943427733,
          "schedulingFailures" : 0
        },
        "queues" : {
          "networkInProgress" : 0,
          "dbWorkInProgress" : 0,
          "exclusiveInProgress" : 0,
          "sleepers" : 4,
          "ready" : 0,
          "free" : 14
        },
        "unsignaledEvents" : 4,
        "eventWaiters" : 0,
        "shuttingDown" : false,
        "networkInterface" : "NetworkInterfaceASIO inShutdown: 0"
      },
      "apply" : {
        "attemptsToBecomeSecondary" : NumberLong(1),
        "batches" : {
          "num" : 297276216,
          "totalMillis" : 26415
        },
        "ops" : NumberLong(416920316)
      },
      "buffer" : {
        "count" : NumberLong(3034),
        "maxSizeBytes" : 268435456,
        "sizeBytes" : NumberLong(4729764)
      },
      "network" : {
        "bytes" : NumberLong("231348439513"),
        "getmores" : {
          "num" : 402313044,
          "totalMillis" : 2224956627
        },
        "ops" : NumberLong(416925030),
        "readersCreated" : NumberLong(29)
      },
      "preload" : {
        "docs" : {
          "num" : 0,
          "totalMillis" : 0
        },
        "indexes" : {
          "num" : 0,
          "totalMillis" : 0
        }
      }
    },
    "storage" : {
      "freelist" : {
        "search" : {
          "bucketExhausted" : NumberLong(0),
          "requests" : NumberLong(0),
          "scanned" : NumberLong(0)
        }
      }
    },
    "ttl" : {
      "deletedDocuments" : NumberLong(0),
      "passes" : NumberLong(49086)
    }
  },
  "ok" : 1
}



 Comments   
Comment by Scott Glajch [ 04/May/17 ]

Updating from 3.2.12 to 3.2.13 has significantly improved our system's performance (and specifically, the variability in performance), and we are pretty sure it was due to this bug fix.

Comment by Ramon Fernandez Marina [ 28/Apr/17 ]

Thanks for the update bigbourin@gmail.com, and very glad to hear that MongoDB 3.4.3 has taken care of the issue and is working well for you.

Regards,
Ramón.

Comment by Adrien Jarthon [ 28/Apr/17 ]

Thanks, and for the record we are running 3.4.3 in production since last week, no performance issue so far, the WT cache usage level is MUCH lower (~105G now, whereas it was ~150G before):

Comment by Ramon Fernandez Marina [ 28/Apr/17 ]

This is to let watchers know that a 3.2.13-rc0 release candidate, which includes a fix for this ticket, is available for testing. We're planning to release 3.2.13 for production use early next week.

Regards,
Ramón.

Comment by Adrien Jarthon [ 22/Apr/17 ]

Yeah the Request time in queue and Mean requests duration are web server level charts, provided by our ELK stack.
And the WT active cursors and memory usage charts are provided by our hosting partner using Telegraf + Grafana

Comment by Jérémy Carlier [ 22/Apr/17 ]

Yeah those graphs are nice. It's really helpful. We didn't manage this it's totally managed by our web host. The system providing those charts is Grafana. It can give you a lots of informations about mongo about mysql about your cpu etc.

Comment by Chad Kreimendahl [ 22/Apr/17 ]

Jeremy,

Where do you get those pretty charts of your mongo performance?

Comment by Jérémy Carlier [ 12/Apr/17 ]

I work with @adrien jarthon, we are from the same company and we investigate the same MongoDB servers, so could you please have a look to confirm us if you think it's the same issue without asking me to create an other ticket which will finally be a duplicate .

Thank you

Comment by Ramon Fernandez Marina [ 10/Apr/17 ]

keysen, it's always preferable to open separate tickets to avoid confusion and allow for parallel assignments.

Thanks,
Ramón.

Comment by Jérémy Carlier [ 10/Apr/17 ]

Hello,

I come back with some news in that ticket because we had another downtime few days ago in my company on our product using mongoDB, for your information we are still using 3.2.11 in production. Soon we will use the 3.4.3.

Here our graphs:

We think it's the same issue but we would like your confirmation?
I uploaded the diagnostic.data directory on your secure portal. Let me now if you've found the archive (diagnostic.data.tgz — 108MB).

The issue occurs the 04th April. Thank you.

Comment by Adrien Jarthon [ 03/Apr/17 ]

Thank you for the analysis Michael. Ok so hopefully with 3.4.3 we should not have this problem again.
We started upgrading our staging, dev, and secondaries to 3.4.3 and will follow with the primaries after a few weeks of burn-in.

Comment by Michael Cahill (Inactive) [ 03/Apr/17 ]

bigbourin@gmail.com, I realized your most recent report had not gotten a response so I took a look. I note that this data is from an unpatched 3.2.11 server.

This is essentially the same issue as before: there is a phase change in the workload causing many pages to be read into cache. When the cache becomes full, MongoDB's yielding behavior causes many more cursors to be opened (yielding causes churn which here means we don't keep the required cursors cached). Opening all of those cursors (prior to WT-3152) causes a lot of contention for WiredTiger's table lock. That delays queuing of pages for eviction, leading to stalls caused by a full cache and empty eviction queues.

The improvements in concurrency in WT-3152 and WT-3115 are the primary fixes for this incident. The improvements in WT-3148 will further make eviction much more efficient for this workload (so the cache should not become full in the first place). This type of slowdown should not occur in builds with those tickets fixed.

Comment by Kelsey Schubert [ 21/Mar/17 ]

Hi aqueen,

The fixVersion, "3.2 required", indicates that the backport to MongoDB 3.2 has been requested and approved. Please continue to watch this ticket for updates.

Thank you,
Thomas

Comment by Aaron Queen [ 21/Mar/17 ]

Can this be updated to affects 3.2.12 and can this be backported to 3.2 and hopefully land in 3.2.13? Thanks!

Comment by Adrien Jarthon [ 20/Mar/17 ]

Hi,
We just had a short but huge slowdown on our primary this morning during 5 minutes, and it was during some slow requests doing fullscan on a big collection. So I'm wondering if it's wasn't the same issue (cache eviction) because the problem started when the cache got full and caused this totally inexplicable global slowdown (on all collections), the machine was totally fine of course, it's running on a RAID10 of SSD and the IO, RAM and CPU usage were totally ok. So for an engine supposed to be concurrent this is a bit of an issue as you can guess..

This is the 3 slowest queries during the issue and I believe the first and last to be the culprit as they were exceptional (started from a console) whereas the second one happens all the time so it was most probably a consequence here:

2017-03-20T10:52:08.068+0100 I COMMAND  [conn9763332] command db.contents command: count { count: "contents", query: { language: "" } } planSummary: COLLSCAN keyUpdates:0 writeConflicts:0 numYields:251546 reslen:62 locks:{ Global: { acquireCount: { r: 503094 } }, Database: { acquireCount: { r: 251547 } }, Collection: { acquireCount: { r: 251547 } } } protocol:op_query 151378ms
2017-03-20T10:53:57.375+0100 I COMMAND  [conn9755330] command db.content_threads command: find { find: "content_threads", filter: { deleted_at: null, source_id: { $in: [ ObjectId('569fbd0e7765628edf000072'), ObjectId('56ec32e8776562644b0004cb') ] }, $or: [ { texts: { $all: [ "kane", "hassan" ] } }, { texts: { $all: [ "kan", "hassan" ] } } ] }, limit: 30, skip: 0, sort: { last_content_at: -1 } } planSummary: IXSCAN { last_content_at: -1 } keysExamined:1240070 docsExamined:1240070 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:9734 nreturned:3 reslen:10926 locks:{ Global: { acquireCount: { r: 19470 } }, Database: { acquireCount: { r: 9735 } }, Collection: { acquireCount: { r: 9735 } } } protocol:op_query 104210ms
2017-03-20T10:55:38.911+0100 I COMMAND  [conn9763332] command db.contents command: count { count: "contents", query: { language: null } } planSummary: COLLSCAN keyUpdates:0 writeConflicts:0 numYields:252012 reslen:62 locks:{ Global: { acquireCount: { r: 504026 } }, Database: { acquireCount: { r: 252013 } }, Collection: { acquireCount: { r: 252013 } } } protocol:op_query 199269ms

Here is our web requests processing during the incident, we can see it's not totally blocked but almost:

Here is the mongodb primary memory usage (careful timeframe is not the same), we can see the cache rising quickly as the fullscan runs before starving and slowing down at top capacity. Max cache size is 152G for us (60% of RAM - 1G), we're getting above here but there's probably other things than WT cache I guess?

I uploaded the diagnostic.data from our primary to the same portal: diagnostic.data.tgz (108M)
Could please let me know objectively if it's the same problem or not ? Thanks.

Comment by Adrien Jarthon [ 01/Mar/17 ]

Ok great

Comment by Michael Cahill (Inactive) [ 01/Mar/17 ]

bigbourin@gmail.com, the synthetic workload I referenced above can be tweaked to show eviction being much more efficient with this set of changes (e.g., 50x less work in eviction to support the same / higher workload).

We have now heard from several MongoDB users including you that these changes also help in real world workloads, so I'm confident that the synthetic workload is reproducing the condition you reported.

Comment by Adrien Jarthon [ 24/Feb/17 ]

Well we can see the eviction attempts and "queue empty after locking" lasting for hours after all load stops on the first version, which is why I observed that the server didn't recover in a timely manner the first time we had the issue, in my test here the performance went back quickly but when we had the issue in production it was much worse. 12-14 hours at 100% CPU stuck doing eviction is a hang to me

Did this WT test workload showed any improvement with this fix ?

Comment by Michael Cahill (Inactive) [ 24/Feb/17 ]

bigbourin@gmail.com, there is no evidence of a hang in this data: the eviction state did recover in the unpatched version, it just did so inefficiently.

The WiredTiger workload used for testing this is here. It is part of our regular performance regression test suite.

Comment by Adrien Jarthon [ 23/Feb/17 ]

Indeed the numbers looks great, I also like the fact that we can see in the previous version WT trying to walk through pages for eviction long after the server load stopped without managing to do anything, whereas in the patched build it never gets stuck trying to evict pages indefinitely.

I'm glad you figured this out, it's a really scary bug when you think your main database can hang under a regular load and keep dying and getting slower and slower even when the load decrease, without crashing (which would be much better).

Did you manage to reproduce the issue in a synthetic way to add to your performance regression test suite?

Comment by Michael Cahill (Inactive) [ 23/Feb/17 ]

bigbourin@gmail.com, here's a summary of what I can see from the diagnostic data.

In the original workload, there were frequent stalls caused by eviction (on average, out of 728 requests for eviction per second, 523 (or 72%) failed because the eviction queue was not kept full):

In contrast, with the patch build, only 79 / 659 requests (12%) find the queue empty:

Another way of looking at the same data is comparing "pages queued" vs "pages seen", which gives an estimate of how efficiently eviction is working:

  • originally, that was 202 / 10,560 (2%)
  • with the patch build it is 715 / 1167 (61%)

The result is that MongoDB is supporting higher throughput without stalls and with less CPU load from eviction. As I said, I'll let you know when these changes are in a release. Thanks again.

Comment by Adrien Jarthon [ 23/Feb/17 ]

Sure, I just uploaded it! (diagnostic.tar.bz2, 96.7 MB)

Comment by Michael Cahill (Inactive) [ 22/Feb/17 ]

bigbourin@gmail.com, that's great news! Thanks for running this test and letting us know how it went.

Would you mind uploading the diagnostic.data from test test run to the same portal as before?

I will let you know when we have these changes scheduled into a release.

Comment by Adrien Jarthon [ 22/Feb/17 ]

Good news, I wasn't able to reproduce the issue with this custom build, the secondary never went over 129G cache usage (whereas it was going up to 153G before) and always performed as fast (~1200 req/s on my benchmark), mongodump lasted consistently about 1.2h. Looks like a win to me, good job guys \o/

Please let me know in which production release these fixes will land and if you need any more info

Comment by Adrien Jarthon [ 21/Feb/17 ]

Thanks, we'll give that a try!

Comment by Michael Cahill (Inactive) [ 20/Feb/17 ]

bigbourin@gmail.com, I have created a patch build here with the current tip of MongoDB 3.2 combined with the current WiredTiger develop. The link is for Debian 7.1 binaries which matches your previous data: if you need something different, please let me know.

This build includes the WiredTiger changes listed above to make eviction more efficient in workloads with many active collections. Please let me know what you see when you have a chance to try it out.

Comment by Adrien Jarthon [ 10/Feb/17 ]

Mm the only thing running frequently on this mongo is the monitoring (telegraf), which is issuing two commands: "listDatabases" and "dbStats" every 30 seconds:

2017-02-10T10:15:00.426+0100 I COMMAND  [conn1413] command admin.$cmd command: listDatabases { listDatabases: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:490 locks:{ Global: { acquireCount: { r: 12 }, acquireWaitCount: { r: 5 }, timeAcquiringMicros: { r: 5380 } }, Database: { acquireCount: { r: 6 } } } protocol:op_query 412ms
2017-02-10T10:15:00.738+0100 I COMMAND  [conn1413] command engagement_console_production command: dbStats { dbStats: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:212 locks:{ Global: { acquireCount: { r: 2 }, acquireWaitCount: { r: 1 }, timeAcquiringMicros: { r: 1158 } }, Database: { acquireCount: { R: 1 } } } protocol:op_query 263ms

Is this keeping all the handles opens to compute stats ?

Ok let me know when the test version is ready

Comment by Michael Cahill (Inactive) [ 10/Feb/17 ]

bigbourin@gmail.com, thank you for the new data and apologies for the delay.

The setting I recommended doesn't seem to be very effective for your workload, in the statistics I still see the "data handles currently active" statistic sitting at around 12,000. Could there be that many collections and indexes accessed every 30 seconds?

What is happening during the slower periods is as explained before: most threads spend most of their time waiting for shared access to MongoDB's global lock. The global lock is held mostly by threads applying oplog operations on the secondary. When the oplog reads start, the amount of oplog data in cache grows by ~20GB in 20 minutes. Eviction is inefficient during this phase because it spends a lot of time working through the 12,000 trees in cache to find the ones containing most of the data. That slows down oplog application, which in turn slows down reads.

As explained earlier, we have made a number of changes that improve performance in this kind of workload in our testing. There is one further change I would like to make (WT-3152), then it would be great if you can try once more with those changes in place.

Comment by Adrien Jarthon [ 03/Feb/17 ]

Ok that's done

Comment by Kelsey Schubert [ 03/Feb/17 ]

Hi bigbourin@gmail.com,

Sorry for the confusion, portals (and their contents) are routinely deleted to protect user information. Going forward on this ticket, would you please upload to this new portal?

Thank you,
Thomas

Comment by Adrien Jarthon [ 03/Feb/17 ]

Hi,

I've just uploaded diagnostic.tar.bz2 (110M), I'm not sure of the time of my tests but you should find them around the 29-31 of jan I think.
Basically there will be an almost uninterrupted flow of 1200 req/s which slows down at some point (down to 150-300 req/s) and then get back on track, the slowdown in my test occurs during the end of the dump and usually lasts from 5 to 30 minutes.

I am ok to run a custom build of mongoDB but a not too edgy one and not to often because it's still our production secondary and we need to ask our hosting provider to install it ☺

Edit: actually I tried 4 times and the upload on your portal or by curl fail with:

<?xml version="1.0" encoding="UTF-8"?> <Error><Code>AccessDenied</Code><Message>Invalid according to Policy: Policy expired.</Message><RequestId>37FFB0D13207AD83</RequestId><HostId>4pxhSFsQDgAum4cs0qnse+9kRlbnf9+Z0vnuf9Q7TKTzeufum6zP2d7WywjH/8GwlaWDLchJBoE=</HostId></Error>

How shall I upload it ?

Comment by Michael Cahill (Inactive) [ 02/Feb/17 ]

bigbourin@gmail.com, I'm sorry to hear that the experiment was unsuccessful. I think it's telling that it made a difference, in other words I believe we're on the right track.

Would you be able to attach the diagnostic data from that run?

You are correct about the parsing of config strings: the last setting wins, so the experiment was run with the suggested settings.

In terms of what is going on here, having many open handles is causing the existing eviction algorithm to be inefficient. The setting should reduce the number of open handles, which probably explains why performance was better, but the diagnostic data includes statistics that allow us to track the open handles directly.

Note that as I said eariler, we are working on various changes to WiredTiger that should help. Would you be prepared to try a custom build of MongoDB with those changes to see if it helps in your case?

Comment by Adrien Jarthon [ 31/Jan/17 ]

Ok I tried with the suggestion configuration:

close_handle_minimum=100,close_idle_time=30,close_scan_interval=30

Unfortunately it didn't fix the problem, it's a bit better though, I tried 4 times and couldn't get below 200 req/s so the performance drop was smaller, and the time it take to dump the oplog didn't go above 20 minutes, whereas it took 1h in my last comment. But the problem is this here as the normal performance is 1200 req/s and the cache was still full (153G).

This is a bit expected to me, as the problem is not really about how long we keep the handles open when idle but more how we evict them when very active IMO.
How was this setting supposed to help ? by evicting the handles cached during the dump before the end of the dump ?

Is there a way to double check the setting are currently active the for WT engine ? I saw this in the log:

2017-01-27T14:16:02.138+0100 I STORAGE  [initandlisten] wiredtiger_open config: create,cache_size=150G,session_max=20000,eviction=(threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0),file_manager=(close_handle_minimum=100,close_idle_time=30,close_scan_interval=30)

So the file_manager options are duplicated (new ones at the end), I guess this is on purpose and WT keeps the last ones ?

Comment by Eric Milkie [ 26/Jan/17 ]

mongodump does not actually record the oplog during the dump. Instead, it remembers the last entry in the oplog when it started dumping. When it then logs "writing captured oplog to xxxx", it at that moment issues a query to the oplog to read all the oplog entries since that remembered entry. This can be rather expensive.

Comment by Adrien Jarthon [ 26/Jan/17 ]

Ok I think I managed to reproduce the issue, as I said previously I took our existing secondary (so with an active replication steam) and I spammed it with read requests AND ran a mongodump --oplog at the same time. This all goes well until the mongodump reaches the "writing captured oplog to" line.

The dump took 1 hour up to this line, and is now stuck at this line since about 1h. My script which sends read requests in loop was at 1200 req/sec on idle mongo and during the whole dump up to this line when it dropped to about 100 - 150req/sec (on average but basically they are blocked for dozens of seconds and processed in batches all at a time). db.getReplicationInfo() now takes 5-20 sec to respond and shows up to 1 minute of synchronization lag. It seems to slowly be getting slower and slower and if I stop the dump now, the server will slowly regain it's performance back and free memory. I got this slowdown 3 times on 3 attempts including one after a fresh restart of mongo.

Here is the log of the end of the dump where we can see the 55 minutes gap and the oplog counter going whaaaat, as soon as the progress bar started (17:46:50), the performances were getting better and the server freed 20G of cache at once (going from 153G to 133G):

2017-01-26T16:51:16.512+0100	writing captured oplog to 
2017-01-26T17:46:50.683+0100	[##......................]  .oplog  73441/767617  (9.6%)
2017-01-26T17:46:53.684+0100	[####....................]  .oplog  153089/767617  (19.9%)
2017-01-26T17:46:56.683+0100	[#######.................]  .oplog  240893/767617  (31.4%)
2017-01-26T17:46:59.683+0100	[#########...............]  .oplog  317299/767617  (41.3%)
2017-01-26T17:47:02.683+0100	[###########.............]  .oplog  382176/767617  (49.8%)
2017-01-26T17:47:05.683+0100	[##############..........]  .oplog  473312/767617  (61.7%)
2017-01-26T17:47:08.683+0100	[#################.......]  .oplog  554735/767617  (72.3%)
2017-01-26T17:47:11.684+0100	[###################.....]  .oplog  618572/767617  (80.6%)
2017-01-26T17:47:14.683+0100	[#######################.]  .oplog  739041/767617  (96.3%)
2017-01-26T17:47:17.684+0100	[########################]  .oplog  825514/767617  (107.5%)
2017-01-26T17:47:20.683+0100	[########################]  .oplog  915626/767617  (119.3%)
2017-01-26T17:47:23.684+0100	[########################]  .oplog  1001950/767617  (130.5%)
2017-01-26T17:47:26.683+0100	[########################]  .oplog  1089987/767617  (142.0%)
2017-01-26T17:47:29.684+0100	[########################]  .oplog  1167074/767617  (152.0%)
2017-01-26T17:47:32.683+0100	[########################]  .oplog  1239307/767617  (161.4%)
2017-01-26T17:47:35.684+0100	[########################]  .oplog  1330287/767617  (173.3%)
2017-01-26T17:47:37.465+0100	[########################]  .oplog  1392659/767617  (181.4%)
2017-01-26T17:47:37.465+0100		dumped 1392659 oplog entries

Why is this step so intensive? we don't seem to have the problem with mongodump without "--oplog" I know it records the oplog during the dump but where is it stored, maybe this is the part causing trouble with cache eviction?

I'll now try with your suggested configuration to see if I can reproduce the problem.

Comment by Adrien Jarthon [ 25/Jan/17 ]

Ok thanks for the clarifications.

To answer your question about why handles are not closed automatically: MongoDB configures WiredTiger to not do that – the setting change I sent earlier does exactly this. One motivation for MongoDB's default setting is that some users were surprised when handles were closed in an idle server (causing additional I/O when the server became busy again).

I totally understand why the handles are kept open when the server is idle, that's what the cache is for, but in my case it was not idle, it was struggling to handle the replication and dump load, in this case when the cache is full isn't it supposed to evict last used items from cache to free it? was it failing to do that? was it considering everything impossible to evict?

I'm still trying to reproduce the issue but haven't managed so far.
Thanks you for all the work on this.

BTW if this helps I had a very similar issue (probably the same reason but can't be sure) on a much smaller DB with a dozens collections (for https://updown.io). I have a primary and slave, replication is about 120 req/sec and secondary does nothing more than replication and dump. All was going well for months and one day I tried issuing a few queries to the secondary (just some test from the console I couldn't go further because I found a bug in the client ☺) and a few hours after that, I got an MMS alert telling me my secondary was lagging behind primary, so I ssh'ed to check and the mongo was basically in the state you know: 100% cpu, doing nothing, and replication was totally stuck or processing a few ops per minute I don't recall exactly. The cache was full of course but it's always full on this secondary because I configured it with only 1GB cache. I restarted the mongo process and everything is fine since then (it was 4-5 months ago).
So this problem seem to only happen when we start issuing read queries to the secondary (even if it's small ones and not a lot), which is probably why not many people had this problem before.

Is there anything special about them that could explain this? something different than dump queries? maybe because they use indexes? I just had a look if I have the diagnostic.data for this event but it's way too old I only have 4 days of diagnostic.

Comment by Michael Cahill (Inactive) [ 25/Jan/17 ]

bigbourin@gmail.com, thanks for your reply.

This issue is certainly more pronounced on secondaries because any delays while holding the Global lock cause everything to stall. It's certainly possible that some of the same underlying inefficiency is in play on primaries but it is less likely to disrupt other operations.

To answer your question about why handles are not closed automatically: MongoDB configures WiredTiger to not do that – the setting change I sent earlier does exactly this. One motivation for MongoDB's default setting is that some users were surprised when handles were closed in an idle server (causing additional I/O when the server became busy again).

Please do let us know if you are able to reproduce the issue and try with the configuration change. I still believe that may be sufficient to resolve the issue.

The storage team has been working to improve WiredTiger's handling of workloads with many active collections and indexes. In particular, we have changes in code review for:

  • WT-3115 improve concurrency for accessing the list of trees in cache
  • WT-3138 improves the statistics we can report for each tree in cache
  • WT-3139 enhances our workload generator to stress individual tables
  • WT-3148 improves eviction efficiency with many idle trees in cache
  • WT-3150 reduces the impact of checkpoints on eviction efficiency

Based on what we've seen in our testing so far, I'm hopeful that the combination of these changes will significantly improve WiredTiger's handling of workloads like yours, even with MongoDB's default settings.

Comment by Adrien Jarthon [ 20/Jan/17 ]

Thanks for investigating this issue, your diagnosis seems to match what I saw.

You said this is linked to secondary using global exclusive lock when syncing oplog, so I guess the same problem can't happen on primary ? Because we didn't saw this issue on the primary up to now (fortunately)

The WT cache eviction problem is probably the same I guess but without the global lock it's probably much less visible ?

We do have a lot of collections (7k) and indexes (29k), which are all opened (and probably stays "active" forever if you says the default idle time is > 1day) due the daily dump. But I don't really understand how the issue can get bigger across several days as the activity stays the same.

In the beginning we did execute some queries on the secondaries, but we stopped doing that as we noticed their performance were recently getting 10 to 100 times slower while the dump was running. This was 4-5 days before the secondary really hit the ground. So maybe the replication + dumps + queries traffic started to put the db in a bad state ? and even once we stopped the queries, our secondary kept getting slower and slower for 4-5 days until the complete stop.

So when WT reaches a point where the cache is full of active handles, how can it not close some of them to keep going ? It's not like we were requesting all of them at the same time, mongodump only queries 4 collections at the same time. There's the replication load of course but this is write only, does it uses the cache at all (except for updating it of course) ? I know WT is complex engine but I have trouble to understand how can it fail so randomly under such regular traffic.

I would love to test this configuration but I'll try to reproduce the issue first otherwise we have no way to assert if it helped, but I'm not sure if running the secondary without queries (just replication & dumps) can reproduce it, though I can't put the queries back if it's at the risk of impacting all our clients. I developed a script to replay mongo queries from the ruby logs (yeah just before you released mongoreplay ☹) so I'll try to hammer the secondary with typical queries we send there in hope to find a way to reproduce the issue.

Comment by Michael Cahill (Inactive) [ 20/Jan/17 ]

bigbourin@gmail.com, thanks for the diagnostic data: I have analyzed the performance issue you are seeing here.

We are currently trying to reproduce the issue in testing (so far without success). What we see is a combination of two issues causing slow performance: 1. MongoDB holds the Global lock exclusive while applying operations to secondaries; and 2. Applying operations is slow because of inefficiency in WiredTiger's cache eviction implementation.

The main factor causing slowness in eviction is the large number of open collections and indexes, even though it appears that only a small fraction of those collections are active. If you can give us any more information about the access patterns in your application, that may help us reproduce this behavior more quickly.

One configuration change we have suggested in this situation is to tune WiredTiger to close inactive handles more aggressively (by default with MongoDB, inactive handles are kept open for a little over a day). The configuration change is as follows:

storage:
    engine: wiredTiger
    wiredTiger:
        engineConfig:
            configString: "file_manager=(close_handle_minimum=100,close_idle_time=30,close_scan_interval=30)"

If you are able to test this configuration, you may find it resolves the performance issues (the main downside of this change is that if the server becomes idle, handles will be closed so that next time the data is accessed it will have to be read back into cache).

Regardless, we will continue trying to reproduce this behavior locally so that we can be confident it is fixed.

Comment by Kelsey Schubert [ 18/Jan/17 ]

Hi bigbourin@gmail.com,

I'm sorry you're still encountering issues. Thank you for letting us know and providing the diagnostic.data – we're looking into it.

Kind regards,
Thomas

Generated at Thu Feb 08 04:15:56 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.