Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-20008

Stress test deadlock in WiredTiger

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • 3.0.7, 3.1.8
    • Affects Version/s: None
    • Component/s: WiredTiger
    • Fully Compatible
    • ALL
    • Hide

      Insert only workload (hammer.mongo)
      --replSet EitanRs3a --dbpath f:\data\db1 --logpath h:\data\rs3primary.txt --storageEngine wiredTiger --port 5002

      Show
      Insert only workload (hammer.mongo) --replSet EitanRs3a --dbpath f:\data\db1 --logpath h:\data\rs3primary.txt --storageEngine wiredTiger --port 5002

      Issue Status as of Sep 10, 2015

      ISSUE SUMMARY
      MongoDB running with the WiredTiger storage engine, under high load with append-only workloads and no reads, may fail to find pages to evict from cache and hang.

      USER IMPACT
      mongod keeps running but becomes unresponsive.

      WORKAROUNDS
      Once the process becomes stuck, mongod must be restarted.

      AFFECTED VERSIONS
      MongoDB 3.0.0 through 3.0.6

      FIX VERSION
      The fix is included in the 3.0.7 production release.

      Configuration:

      3 members replica set
      db version v3.1.7-pre-
      git version: 4cf56d86a386039839dc10bb761bd28c829be426

      Two problems:

      1) Primary node is up and running but not able to perform any CRUD operations (mongostat and other db. . insert({}) hang), however failover didn't occur.

      2) WiredTiger execute endless loop in !__wt_tree_walk and holding CRUD operations w/o timeout/watchdog for robustness (See debugger output for the lock owner)

      0:460> !cs -l
      -----------------------------------------
      DebugInfo          = 0x000000de00a25740
      Critical section   = 0x000000de7fc780c0 (+0xDE7FC780C0)
      LOCKED
      LockCount          = 0x0
      WaiterWoken        = No
      OwningThread       = 0x000000000000097c
      RecursionCount     = 0x1
      LockSemaphore      = 0xD8C
      SpinCount          = 0x0000000000000fa0
      
        2  Id: 11d4.97c Suspend: 1 Teb: 00007ff7`4fe68000 Unfrozen
      Child-SP          RetAddr           Call Site
      000000de`01dafc30 00007ff7`51567749 mongod!__wt_tree_walk+0x1a8 [c:\data\mci\src\src\third_party\wiredtiger\src\btree\bt_walk.c @ 243]
      000000de`01dafcc0 00007ff7`515672e7 mongod!__evict_walk_file+0x329 [c:\data\mci\src\src\third_party\wiredtiger\src\evict\evict_lru.c @ 1154]
      000000de`01dafd60 00007ff7`51566764 mongod!__evict_walk+0x2b7 [c:\data\mci\src\src\third_party\wiredtiger\src\evict\evict_lru.c @ 1032]
      000000de`01dafdf0 00007ff7`51566d5b mongod!__evict_lru_walk+0x24 [c:\data\mci\src\src\third_party\wiredtiger\src\evict\evict_lru.c @ 789]
      000000de`01dafe20 00007ff7`51566f58 mongod!__evict_pass+0x25b [c:\data\mci\src\src\third_party\wiredtiger\src\evict\evict_lru.c @ 502]
      000000de`01dafe80 00007ffb`5e534f7f mongod!__evict_server+0x38 [c:\data\mci\src\src\third_party\wiredtiger\src\evict\evict_lru.c @ 169]
      000000de`01dafeb0 00007ffb`5e535126 MSVCR120!beginthreadex+0x107
      000000de`01dafee0 00007ffb`6d3f15dd MSVCR120!endthreadex+0x192
      000000de`01daff10 00007ffb`6d7343d1 KERNEL32!BaseThreadInitThunk+0xd
      000000de`01daff40 00000000`00000000 ntdll!RtlUserThreadStart+0x1d
      

      RS.Status

      EitanRs3a:PRIMARY> rs.status()
      {
              "set" : "EitanRs3a",
              "date" : ISODate("2015-08-18T14:45:29.611Z"),
              "myState" : 1,
              "term" : NumberLong(0),
              "heartbeatIntervalMillis" : NumberLong(2000),
              "members" : [
                      {
                              "_id" : 0,
                              "name" : "eitan5:5002",
                              "health" : 1,
                              "state" : 1,
                              "stateStr" : "PRIMARY",
                              "uptime" : 66715,
                              "optime" : Timestamp(1439894846, 12455),
                              "optimeDate" : ISODate("2015-08-18T10:47:26Z"),
                              "electionTime" : Timestamp(1439842421, 2),
                              "electionDate" : ISODate("2015-08-17T20:13:41Z"),
                              "configVersion" : 3,
                              "self" : true
                      },
                      {
                              "_id" : 1,
                              "name" : "Eitan1:5002",
                              "health" : 1,
                              "state" : 1,
                              "stateStr" : "PRIMARY",
                              "uptime" : 66704,
                              "optime" : Timestamp(1439894723, 4673),
                              "optimeDate" : ISODate("2015-08-18T10:45:23Z"),
                              "lastHeartbeat" : ISODate("2015-08-18T14:45:29.030Z"),
                              "lastHeartbeatRecv" : ISODate("2015-08-18T14:45:28.841Z"),
                              "pingMs" : 2,
                              "electionTime" : Timestamp(1439906145, 1),
                              "electionDate" : ISODate("2015-08-18T13:55:45Z"),
                              "configVersion" : 3
                      },
                      {
                              "_id" : 2,
                              "name" : "Eitan6:5002",
                              "health" : 1,
                              "state" : 2,
                              "stateStr" : "SECONDARY",
                              "uptime" : 66663,
                              "optime" : Timestamp(1439893521, 7147),
                              "optimeDate" : ISODate("2015-08-18T10:25:21Z"),
                              "lastHeartbeat" : ISODate("2015-08-18T14:45:29.041Z"),
                              "lastHeartbeatRecv" : ISODate("2015-08-18T14:45:28.844Z"),
                              "pingMs" : 1,
                              "syncingTo" : "eitan5:5002",
                              "configVersion" : 3
                      }
              ],
              "ok" : 1
      }
      

        1. 317.html
          4.52 MB
        2. deadlockduringstress.txt
          1.51 MB
        3. Opspersec-duringstress.png
          Opspersec-duringstress.png
          28 kB
        4. rs3primary.txt
          89.39 MB
        5. timeseries-20008.png
          timeseries-20008.png
          126 kB

            Assignee:
            michael.cahill@mongodb.com Michael Cahill (Inactive)
            Reporter:
            eitan.klein Eitan Klein
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

              Created:
              Updated:
              Resolved: