Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-9905

mapreduce can lock the server (both read and write) for long periods of time

    • Type: Icon: Bug Bug
    • Resolution: Duplicate
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 2.4.4
    • Component/s: MapReduce
    • Labels:
      None
    • ALL

      this bug has been in mongod since v2.2 at least.
      MR obtains a cursor to iterate over the input docs, and then checks that the doc really matches using cursor->currentMatches() (line 1208 mr.cpp).
      Then MR does the map/reduce logic and checks to yield the lock every 100 docs.

      In case lots of documents are in the cursor but most of them dont actually match then the loop would iterate over many thousand objects without yielding, since the yield is only in the MR logic.

      I have noted the issue in the following case:

      • query is on an indexed field + a complex $where query
      • 1 million docs match the index and are iterated over
      • 950k docs get filtered out in matcher by the $where
      • due to the distribution of documents, the loop ends up not yielding for about 30s

      It appears that even though no explicit writes are being issued, the absence of yielding results in both reads and writes being locked.
      This is obviously a problem for the delay on operations, but what's worse is that it will trigger a replica set reelection if authentication is used since secondaries cannot get heartbeat from the primary.
      This creates many further errors, and makes sharded MR unstable.

      The following show up in currentOp() when trying to issue a read:

                      {
                              "opid" : 386828,
                              "active" : false,
                              "op" : "query",
                              "ns" : "",
                              "query" : {
      
                              },
                              "client" : "127.0.0.1:49427",
                              "desc" : "conn8680",
                              "threadId" : "0x7f4b6991d700",
                              "connectionId" : 8680,
                              "locks" : {
                                      "^" : "r",
                                      "^testdb" : "R"
                              },
                              "waitingForLock" : true,
                              "numYields" : 0,
                              "lockStats" : {
                                      "timeLockedMicros" : {
      
                                      },
                                      "timeAcquiringMicros" : {
      
                                      }
                              }
                      },
      
                     {
                              "opid" : 386762,
                              "active" : true,
                              "secs_running" : 36,
                              "op" : "query",
                              "ns" : "testdb.testcol",
                              "query" : {
                                      "$msg" : "query not recording (too large)"
                              },
                              "client" : "127.0.0.1:40824",
                              "desc" : "conn79",
                              "threadId" : "0x7f4e22b9d700",
                              "connectionId" : 79,
                              "locks" : {
                                      "^" : "r",
                                      "^testdb" : "R"
                              },
                              "waitingForLock" : false,
                              "msg" : "m/r: (1/3) emit phase M/R: (1/3) Emit Progress: 77293/77293 100%",
                              "progress" : {
                                      "done" : 77293,
                                      "total" : 77293
                              },
                              "numYields" : 800,
                              "lockStats" : {
                                      "timeLockedMicros" : {
                                              "r" : NumberLong(67071671),
                                              "w" : NumberLong(1084)
                                      },
                                      "timeAcquiringMicros" : {
                                              "r" : NumberLong(33551125),
                                              "w" : NumberLong(28)
                                      }
                              }
                      }
      

            Assignee:
            Unassigned Unassigned
            Reporter:
            antoine Antoine Girbal
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: