Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-3157

Replicaset becomes inaccessable and instable after mapreduce job

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Cannot Reproduce
    • Affects Version/s: 1.8.1
    • Fix Version/s: None
    • Component/s: None
    • Environment:
      Replicat Set with 2 Nodes (24 GB RAM) + 2 Arbiter all openSuse 11.3 Kernel 2.6.34 64bit
    • Operating System:
      Linux

      Description

      After we run a mapreduce job which updates thousends of records the primary mongodb server becomes inaccessable. It was not able to connect via PHP webnode or local mongo shell. In a short time the server reached his connection limit (in normal operation we have around 10/s; after the mapreduce job they step up to > 13000; the PHP webnodes use non-persistent conncetions; see lx03_mongostat_cutted.txt). the 13000 connection where full established but idle (see attachment overview_mongod_after_midnight.html).

      Our first action was to shutdown the php webserver nodes. connections jumps back to 10 and the system becomes accessable again.

      Second action was to shutdown the secondary and start the map reduce job again. everything run smooths seeming without probems. During the mapreduce job used ram increased steadily (see munin graphs). When the job was finished we start the secondary again. From here everything works as expected running the operations from oplog. After a short sleep we saw in the morning there was a connection jump again to 1000. So I decieded to stop and start the current primary and let the secondary take over to get a clean state again.

      The attachments containing mongostats, munin graphs, mongodb.logs and the home view from mongodb internal webserver. The munin graphs contains some leaks where the primary was inaccessable to gather data. (server lx03 is primary, lx04 is secondary)

      In the past we had the situation once or twice per month probably from a cron job starting another mapreduce operation. But until yesterday we couldn't track it down.

        Attachments

        1. logs_and_mongostat_complete.zip
          499 kB
        2. lx03_mongostat_cutted.txt
          50 kB
        3. munin_graphs.png
          munin_graphs.png
          1.62 MB
        4. overview_mongod_after_midnight.html
          127 kB
        5. overview_mongod_normal.html
          15 kB
        6. overview_mongod_today_morning.html
          166 kB

          Activity

            People

            • Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: