Hi,
We're running a single dev server mongodb (almost every parameter is default, except directoryPerDB), no replication, just for testing (we'll definitely put replication, even on dev, after those problems encountered...).
We regularly run very long process in order to generate on database cicr_results about 200 GB of "aggregated" datas with "custom" process (PHP / Javascript / Shell).
On May the 21st, we had to restart mongo.
A command launched using mongodb.admin was running for too long, and we we're unable to locate which process to kill in order to stop all the update this process was doing.
Just after Mongo restarted itself, data stored we're a lot different than before the restart.
In fact, we jumped for cicr_results database back in time at about 5 months before. Datas we're the same as on January, all inserts, updates done during those 5 months we're gone.
Here is the log surrounding the server restart :
2015-05-21T11:22:05.505+0200 I NETWORK [initandlisten] connection accepted from 127.0.0.1:38527 #416368 (140 connecti ons now open) 2015-05-21T11:22:07.998+0200 I COMMAND [conn416368] going to kill op: op: 1934153368 2015-05-21T11:22:07.998+0200 I NETWORK [conn416368] end connection 127.0.0.1:38527 (139 connections now open) 2015-05-21T11:22:14.678+0200 I NETWORK [initandlisten] connection accepted from 127.0.0.1:38531 #416369 (140 connecti ons now open) 2015-05-21T11:22:17.703+0200 I NETWORK [conn416369] end connection 127.0.0.1:38531 (139 connections now open) 2015-05-21T11:22:32.930+0200 I NETWORK [conn416365] end connection 127.0.0.1:38485 (138 connections now open) 2015-05-21T11:22:33.431+0200 I NETWORK [initandlisten] connection accepted from 127.0.0.1:38543 #416370 (139 connecti ons now open) 2015-05-21T11:22:48.560+0200 I NETWORK [initandlisten] connection accepted from 127.0.0.1:38558 #416371 (140 connecti ons now open) 2015-05-21T11:23:01.565+0200 I NETWORK [initandlisten] connection accepted from 127.0.0.1:38568 #416372 (141 connecti ons now open) 2015-05-21T11:23:03.567+0200 I CONTROL [signalProcessingThread] got signal 15 (Terminated), will terminate after curr ent cmd ends 2015-05-21T11:23:03.567+0200 E QUERY [conn416202] JavaScript execution terminated 2015-05-21T11:23:03.578+0200 I CONTROL [signalProcessingThread] now exiting 2015-05-21T11:23:03.578+0200 I NETWORK [signalProcessingThread] shutdown: going to close listening sockets... 2015-05-21T11:23:03.578+0200 I COMMAND [conn416202] command cicr_results.$cmd command: eval { $eval: CodeWScope( db.c r.find({}).forEach(function (cur) { db.cr.update({_id: cur._id}, {$set: {"s.ra": Math.floor(Math.random() * (1000 - 0)) + 0}}) }), {}), args: [] } keyUpdates:0 writeConflicts:0 numYields:0 reslen:102 locks:{ Global: { acquireCount: { r: 2161, w: 9335668, W: 1 } }, MMAPV1Journal: { acquireCount: { w: 9326394 }, acquireWaitCount: { w: 33532 }, timeAcquiri ngMicros: { w: 40647604 } }, Database: { acquireCount: { r: 2161, w: 9335668 } }, Collection: { acquireCount: { R: 216 1, W: 9335668 } } } 3496045ms 2015-05-21T11:23:03.578+0200 I NETWORK [signalProcessingThread] closing listening socket: 6 2015-05-21T11:23:03.578+0200 I NETWORK [signalProcessingThread] closing listening socket: 7 2015-05-21T11:23:03.578+0200 I NETWORK [conn416202] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [127.0.0.1:34736] 2015-05-21T11:23:03.579+0200 I NETWORK [signalProcessingThread] removing socket file: /tmp/mongodb-27017.sock 2015-05-21T11:23:03.579+0200 I NETWORK [signalProcessingThread] shutdown: going to flush diaglog... 2015-05-21T11:23:03.579+0200 I NETWORK [signalProcessingThread] shutdown: going to close sockets... 2015-05-21T11:23:03.579+0200 I STORAGE [signalProcessingThread] shutdown: waiting for fs preallocator... 2015-05-21T11:23:03.579+0200 I STORAGE [signalProcessingThread] shutdown: final commit... 2015-05-21T11:23:03.579+0200 I NETWORK [conn416254] end connection 127.0.0.1:35853 (139 connections now open) 2015-05-21T11:23:03.580+0200 I NETWORK [conn333980] end connection 127.0.0.1:50058 (139 connections now open) 2015-05-21T11:23:03.580+0200 I NETWORK [conn416371] end connection 127.0.0.1:38558 (139 connections now open) 2015-05-21T11:23:03.614+0200 I JOURNAL [signalProcessingThread] journalCleanup... 2015-05-21T11:23:03.615+0200 I JOURNAL [signalProcessingThread] removeJournalFiles 2015-05-21T11:23:03.639+0200 I JOURNAL [signalProcessingThread] Terminating durability thread ... 2015-05-21T11:23:03.682+0200 I JOURNAL [journal writer] Journal writer thread stopped 2015-05-21T11:23:03.711+0200 I JOURNAL [durability] Durability thread stopped 2015-05-21T11:23:03.727+0200 I STORAGE [signalProcessingThread] shutdown: closing all files... 2015-05-21T11:23:05.086+0200 I - [signalProcessingThread] File Closing Progress: 36/248 14% (files) 2015-05-21T11:23:07.017+0200 I - [signalProcessingThread] File Closing Progress: 176/248 70% (files) 2015-05-21T11:23:09.094+0200 I - [signalProcessingThread] File Closing Progress: 215/248 86% (files) 2015-05-21T11:23:10.516+0200 I STORAGE [signalProcessingThread] closeAllFiles() finished 2015-05-21T11:23:10.517+0200 I STORAGE [signalProcessingThread] shutdown: removing fs lock... 2015-05-21T11:23:10.517+0200 I CONTROL [signalProcessingThread] dbexit: rc: 0 2015-05-21T11:23:13.519+0200 I CONTROL ***** SERVER RESTARTED ***** 2015-05-21T11:23:13.573+0200 I JOURNAL [initandlisten] journal dir=/home/mongodb/journal 2015-05-21T11:23:13.574+0200 I JOURNAL [initandlisten] recover : no journal files present, no recovery needed 2015-05-21T11:23:13.600+0200 I JOURNAL [durability] Durability thread started 2015-05-21T11:23:13.600+0200 I JOURNAL [journal writer] Journal writer thread started 2015-05-21T11:23:13.602+0200 I CONTROL [initandlisten] MongoDB starting : pid=10206 port=27017 dbpath=/home/mongodb 64-bit host=dev-01-mngr.pcomperf.com 2015-05-21T11:23:13.603+0200 I CONTROL [initandlisten] db version v3.0.2 2015-05-21T11:23:13.603+0200 I CONTROL [initandlisten] git version: 6201872043ecbbc0a4cc169b5482dcf385fc464f 2015-05-21T11:23:13.603+0200 I CONTROL [initandlisten] OpenSSL version: OpenSSL 1.0.1e 11 Feb 2013 2015-05-21T11:23:13.603+0200 I CONTROL [initandlisten] build info: Linux ip-10-171-120-232 3.2.0-4-amd64 #1 SMP Debian 3.2.46-1 x86_64 BOOST_LIB_VERSION=1_49 2015-05-21T11:23:13.603+0200 I CONTROL [initandlisten] allocator: tcmalloc 2015-05-21T11:23:13.603+0200 I CONTROL [initandlisten] options: { config: "/etc/mongod.conf", net: { bindIp: "127.0.0.1" }, storage: { dbPath: "/home/mongodb", directoryPerDB: true }, systemLog: { destination: "file", logAppend: true, path: "/home/mongodb/mongod.log" } } 2015-05-21T11:23:14.921+0200 I NETWORK [initandlisten] waiting for connections on port 27017 2015-05-21T11:23:15.096+0200 I NETWORK [initandlisten] connection accepted from 127.0.0.1:38630 #1 (1 connection now open) 2015-05-21T11:23:17.441+0200 I NETWORK [initandlisten] connection accepted from 127.0.0.1:38631 #2 (2 connections now open) 2015-05-21T11:23:18.485+0200 I NETWORK [initandlisten] connection accepted from 127.0.0.1:38633 #3 (3 connections now open) 2015-05-21T11:23:24.054+0200 I NETWORK [conn3] end connection 127.0.0.1:38633 (2 connections now open) 2015-05-21T11:23:38.070+0200 I NETWORK [initandlisten] connection accepted from 127.0.0.1:38640 #4 (3 connections now open) 2015-05-21T11:24:01.209+0200 I NETWORK [initandlisten] connection accepted from 127.0.0.1:38663 #5 (4 connections now open) 2015-05-21T11:24:01.289+0200 I NETWORK [conn5] end connection 127.0.0.1:38663 (3 connections now open) 2015-05-21T11:24:01.364+0200 I NETWORK [initandlisten] connection accepted from 127.0.0.1:38664 #6 (4 connections now open) 2015-05-21T11:24:01.449+0200 I NETWORK [conn6] end connection 127.0.0.1:38664 (3 connections now open) 2015-05-21T11:24:01.535+0200 I NETWORK [initandlisten] connection accepted from 127.0.0.1:38665 #7 (4 connections now open) 2015-05-21T11:24:01.616+0200 I NETWORK [conn7] end connection 127.0.0.1:38665 (3 connections now open) 2015-05-21T11:24:15.646+0200 I NETWORK [conn4] end connection 127.0.0.1:38640 (2 connections now open) 2015-05-21T11:24:17.651+0200 I NETWORK [initandlisten] connection accepted from 127.0.0.1:38679 #8 (3 connections now open) 2015-05-21T11:24:23.115+0200 I NETWORK [conn8] end connection 127.0.0.1:38679 (2 connections now open) 2015-05-21T11:24:46.586+0200 I NETWORK [initandlisten] connection accepted from 127.0.0.1:38700 #9 (3 connections now open) 2015-05-21T11:25:01.823+0200 I NETWORK [initandlisten] connection accepted from 127.0.0.1:38714 #10 (4 connections now open) 2015-05-21T11:25:01.901+0200 I NETWORK [conn10] end connection 127.0.0.1:38714 (3 connections now open) 2015-05-21T11:25:01.997+0200 I NETWORK [initandlisten] connection accepted from 127.0.0.1:38715 #11 (4 connections now open) 2015-05-21T11:25:02.084+0200 I NETWORK [conn11] end connection 127.0.0.1:38715 (3 connections now open) 2015-05-21T11:25:02.172+0200 I NETWORK [initandlisten] connection accepted from 127.0.0.1:38716 #12 (4 connections now open) 2015-05-21T11:25:02.251+0200 I NETWORK [conn12] end connection 127.0.0.1:38716 (3 connections now open) 2015-05-21T11:25:24.216+0200 I COMMAND [conn9] CMD: drop segments_results.sr_555a15a04a620d867e52897e 2015-05-21T11:25:30.807+0200 I COMMAND [conn9] CMD: drop segments_results.sr_555c85f84a620d1279e70b06 2015-05-21T11:25:32.923+0200 I NETWORK [initandlisten] connection accepted from 127.0.0.1:38745 #13 (4 connections now open)
As complementary information, cicr_results database, which is here the database in problem, is located on a 300 GB SSD drive, while all the other databases are actually stored on a 3TB Hard Disk.
At the moment where Mongo was restarted, Free Space on SSD was at about 4GB.
cicr_results is stored on SSD using directory symlinks :
lrwxrwxrwx 1 mongodb mongodb 31 mai 24 21:49 cicr_results -> /home/dbs/mongodb/cicr_results/ drwxr-xr-x 2 mongodb mongodb 4096 mai 30 09:59 journal drwxr-xr-x 3 mongodb mongodb 4096 janv. 15 14:13 journal_buffer -rwxr-xr-x 1 mongodb mongodb 6 mai 26 10:14 mongod.lock -rwxr-xr-x 1 mongodb mongodb 1307059367 mai 30 10:04 mongod.log
We check the files written on disk, they seem to be up to date (some of the files were updated just before the crash happened).
We tried restarting the database once more, launching repairDatabase and validate, everything ran smooth but data were still the same as January's one.
Of course, replication may have prevented that, but it seems rather alarming that, even with journaling enabled, we've lost this much data.
Has anyone encountered something similar ?