Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-18752

Huge Data Loss after Mongo Restart

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 3.0.2
    • Component/s: Stability, Storage
    • None
    • ALL

      Hi,

      We're running a single dev server mongodb (almost every parameter is default, except directoryPerDB), no replication, just for testing (we'll definitely put replication, even on dev, after those problems encountered...).

      We regularly run very long process in order to generate on database cicr_results about 200 GB of "aggregated" datas with "custom" process (PHP / Javascript / Shell).

      On May the 21st, we had to restart mongo.

      A command launched using mongodb.admin was running for too long, and we we're unable to locate which process to kill in order to stop all the update this process was doing.

      Just after Mongo restarted itself, data stored we're a lot different than before the restart.

      In fact, we jumped for cicr_results database back in time at about 5 months before. Datas we're the same as on January, all inserts, updates done during those 5 months we're gone.

      Here is the log surrounding the server restart :

      2015-05-21T11:22:05.505+0200 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:38527 #416368 (140 connecti
      ons now open)
      2015-05-21T11:22:07.998+0200 I COMMAND  [conn416368] going to kill op: op: 1934153368
      2015-05-21T11:22:07.998+0200 I NETWORK  [conn416368] end connection 127.0.0.1:38527 (139 connections now open)
      2015-05-21T11:22:14.678+0200 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:38531 #416369 (140 connecti
      ons now open)
      2015-05-21T11:22:17.703+0200 I NETWORK  [conn416369] end connection 127.0.0.1:38531 (139 connections now open)
      2015-05-21T11:22:32.930+0200 I NETWORK  [conn416365] end connection 127.0.0.1:38485 (138 connections now open)
      2015-05-21T11:22:33.431+0200 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:38543 #416370 (139 connecti
      ons now open)
      2015-05-21T11:22:48.560+0200 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:38558 #416371 (140 connecti
      ons now open)
      2015-05-21T11:23:01.565+0200 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:38568 #416372 (141 connecti
      ons now open)
      2015-05-21T11:23:03.567+0200 I CONTROL  [signalProcessingThread] got signal 15 (Terminated), will terminate after curr
      ent cmd ends
      2015-05-21T11:23:03.567+0200 E QUERY    [conn416202] JavaScript execution terminated
      2015-05-21T11:23:03.578+0200 I CONTROL  [signalProcessingThread] now exiting
      2015-05-21T11:23:03.578+0200 I NETWORK  [signalProcessingThread] shutdown: going to close listening sockets...
      2015-05-21T11:23:03.578+0200 I COMMAND  [conn416202] command cicr_results.$cmd command: eval { $eval: CodeWScope( db.c
      r.find({}).forEach(function (cur) {
                  db.cr.update({_id: cur._id}, {$set: {"s.ra": Math.floor(Math.random() * (1000 - 0)) + 0}})
              }), {}), args: [] } keyUpdates:0 writeConflicts:0 numYields:0 reslen:102 locks:{ Global: { acquireCount: { r:
      2161, w: 9335668, W: 1 } }, MMAPV1Journal: { acquireCount: { w: 9326394 }, acquireWaitCount: { w: 33532 }, timeAcquiri
      ngMicros: { w: 40647604 } }, Database: { acquireCount: { r: 2161, w: 9335668 } }, Collection: { acquireCount: { R: 216
      1, W: 9335668 } } } 3496045ms
      2015-05-21T11:23:03.578+0200 I NETWORK  [signalProcessingThread] closing listening socket: 6
      2015-05-21T11:23:03.578+0200 I NETWORK  [signalProcessingThread] closing listening socket: 7
      2015-05-21T11:23:03.578+0200 I NETWORK  [conn416202] SocketException handling request, closing client connection: 9001
       socket exception [SEND_ERROR] server [127.0.0.1:34736]
      2015-05-21T11:23:03.579+0200 I NETWORK  [signalProcessingThread] removing socket file: /tmp/mongodb-27017.sock
      2015-05-21T11:23:03.579+0200 I NETWORK  [signalProcessingThread] shutdown: going to flush diaglog...
      2015-05-21T11:23:03.579+0200 I NETWORK  [signalProcessingThread] shutdown: going to close sockets...
      2015-05-21T11:23:03.579+0200 I STORAGE  [signalProcessingThread] shutdown: waiting for fs preallocator...
      2015-05-21T11:23:03.579+0200 I STORAGE  [signalProcessingThread] shutdown: final commit...
      2015-05-21T11:23:03.579+0200 I NETWORK  [conn416254] end connection 127.0.0.1:35853 (139 connections now open)
      2015-05-21T11:23:03.580+0200 I NETWORK  [conn333980] end connection 127.0.0.1:50058 (139 connections now open)
      2015-05-21T11:23:03.580+0200 I NETWORK  [conn416371] end connection 127.0.0.1:38558 (139 connections now open)
      2015-05-21T11:23:03.614+0200 I JOURNAL  [signalProcessingThread] journalCleanup...
      2015-05-21T11:23:03.615+0200 I JOURNAL  [signalProcessingThread] removeJournalFiles
      2015-05-21T11:23:03.639+0200 I JOURNAL  [signalProcessingThread] Terminating durability thread ...
      2015-05-21T11:23:03.682+0200 I JOURNAL  [journal writer] Journal writer thread stopped
      2015-05-21T11:23:03.711+0200 I JOURNAL  [durability] Durability thread stopped
      2015-05-21T11:23:03.727+0200 I STORAGE  [signalProcessingThread] shutdown: closing all files...
      2015-05-21T11:23:05.086+0200 I -        [signalProcessingThread]   File Closing Progress: 36/248 14% (files)
      2015-05-21T11:23:07.017+0200 I -        [signalProcessingThread]   File Closing Progress: 176/248 70% (files)
      2015-05-21T11:23:09.094+0200 I -        [signalProcessingThread]   File Closing Progress: 215/248 86% (files)
      2015-05-21T11:23:10.516+0200 I STORAGE  [signalProcessingThread] closeAllFiles() finished
      2015-05-21T11:23:10.517+0200 I STORAGE  [signalProcessingThread] shutdown: removing fs lock...
      2015-05-21T11:23:10.517+0200 I CONTROL  [signalProcessingThread] dbexit:  rc: 0
      2015-05-21T11:23:13.519+0200 I CONTROL  ***** SERVER RESTARTED *****
      2015-05-21T11:23:13.573+0200 I JOURNAL  [initandlisten] journal dir=/home/mongodb/journal
      2015-05-21T11:23:13.574+0200 I JOURNAL  [initandlisten] recover : no journal files present, no recovery needed
      2015-05-21T11:23:13.600+0200 I JOURNAL  [durability] Durability thread started
      2015-05-21T11:23:13.600+0200 I JOURNAL  [journal writer] Journal writer thread started
      2015-05-21T11:23:13.602+0200 I CONTROL  [initandlisten] MongoDB starting : pid=10206 port=27017 dbpath=/home/mongodb 64-bit host=dev-01-mngr.pcomperf.com
      2015-05-21T11:23:13.603+0200 I CONTROL  [initandlisten] db version v3.0.2
      2015-05-21T11:23:13.603+0200 I CONTROL  [initandlisten] git version: 6201872043ecbbc0a4cc169b5482dcf385fc464f
      2015-05-21T11:23:13.603+0200 I CONTROL  [initandlisten] OpenSSL version: OpenSSL 1.0.1e 11 Feb 2013
      2015-05-21T11:23:13.603+0200 I CONTROL  [initandlisten] build info: Linux ip-10-171-120-232 3.2.0-4-amd64 #1 SMP Debian 3.2.46-1 x86_64 BOOST_LIB_VERSION=1_49
      2015-05-21T11:23:13.603+0200 I CONTROL  [initandlisten] allocator: tcmalloc
      2015-05-21T11:23:13.603+0200 I CONTROL  [initandlisten] options: { config: "/etc/mongod.conf", net: { bindIp: "127.0.0.1" }, storage: { dbPath: "/home/mongodb", directoryPerDB: true }, systemLog: { destination: "file", logAppend: true, path: "/home/mongodb/mongod.log" } }
      2015-05-21T11:23:14.921+0200 I NETWORK  [initandlisten] waiting for connections on port 27017
      2015-05-21T11:23:15.096+0200 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:38630 #1 (1 connection now open)
      2015-05-21T11:23:17.441+0200 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:38631 #2 (2 connections now open)
      2015-05-21T11:23:18.485+0200 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:38633 #3 (3 connections now open)
      2015-05-21T11:23:24.054+0200 I NETWORK  [conn3] end connection 127.0.0.1:38633 (2 connections now open)
      2015-05-21T11:23:38.070+0200 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:38640 #4 (3 connections now open)
      2015-05-21T11:24:01.209+0200 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:38663 #5 (4 connections now open)
      2015-05-21T11:24:01.289+0200 I NETWORK  [conn5] end connection 127.0.0.1:38663 (3 connections now open)
      2015-05-21T11:24:01.364+0200 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:38664 #6 (4 connections now open)
      2015-05-21T11:24:01.449+0200 I NETWORK  [conn6] end connection 127.0.0.1:38664 (3 connections now open)
      2015-05-21T11:24:01.535+0200 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:38665 #7 (4 connections now open)
      2015-05-21T11:24:01.616+0200 I NETWORK  [conn7] end connection 127.0.0.1:38665 (3 connections now open)
      2015-05-21T11:24:15.646+0200 I NETWORK  [conn4] end connection 127.0.0.1:38640 (2 connections now open)
      2015-05-21T11:24:17.651+0200 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:38679 #8 (3 connections now open)
      2015-05-21T11:24:23.115+0200 I NETWORK  [conn8] end connection 127.0.0.1:38679 (2 connections now open)
      2015-05-21T11:24:46.586+0200 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:38700 #9 (3 connections now open)
      2015-05-21T11:25:01.823+0200 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:38714 #10 (4 connections now open)
      2015-05-21T11:25:01.901+0200 I NETWORK  [conn10] end connection 127.0.0.1:38714 (3 connections now open)
      2015-05-21T11:25:01.997+0200 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:38715 #11 (4 connections now open)
      2015-05-21T11:25:02.084+0200 I NETWORK  [conn11] end connection 127.0.0.1:38715 (3 connections now open)
      2015-05-21T11:25:02.172+0200 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:38716 #12 (4 connections now open)
      2015-05-21T11:25:02.251+0200 I NETWORK  [conn12] end connection 127.0.0.1:38716 (3 connections now open)
      2015-05-21T11:25:24.216+0200 I COMMAND  [conn9] CMD: drop segments_results.sr_555a15a04a620d867e52897e
      2015-05-21T11:25:30.807+0200 I COMMAND  [conn9] CMD: drop segments_results.sr_555c85f84a620d1279e70b06
      2015-05-21T11:25:32.923+0200 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:38745 #13 (4 connections now open)
      

      As complementary information, cicr_results database, which is here the database in problem, is located on a 300 GB SSD drive, while all the other databases are actually stored on a 3TB Hard Disk.

      At the moment where Mongo was restarted, Free Space on SSD was at about 4GB.

      cicr_results is stored on SSD using directory symlinks :

      lrwxrwxrwx  1 mongodb mongodb         31 mai   24 21:49 cicr_results -> /home/dbs/mongodb/cicr_results/
      drwxr-xr-x  2 mongodb mongodb       4096 mai   30 09:59 journal
      drwxr-xr-x  3 mongodb mongodb       4096 janv. 15 14:13 journal_buffer
      -rwxr-xr-x  1 mongodb mongodb          6 mai   26 10:14 mongod.lock
      -rwxr-xr-x  1 mongodb mongodb 1307059367 mai   30 10:04 mongod.log
      

      We check the files written on disk, they seem to be up to date (some of the files were updated just before the crash happened).

      We tried restarting the database once more, launching repairDatabase and validate, everything ran smooth but data were still the same as January's one.

      Of course, replication may have prevented that, but it seems rather alarming that, even with journaling enabled, we've lost this much data.

      Has anyone encountered something similar ?

            Assignee:
            ramon.fernandez@mongodb.com Ramon Fernandez Marina
            Reporter:
            shanicot@prismamedia.com Sébastien Hanicotte
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: