Loading...

Type: Bug
Resolution: Done
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 3.0.2
Component/s: Stability, Storage
Labels:
None

Operating System:
ALL
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Hi,

We're running a single dev server mongodb (almost every parameter is default, except directoryPerDB), no replication, just for testing (we'll definitely put replication, even on dev, after those problems encountered...).

We regularly run very long process in order to generate on database cicr_results about 200 GB of "aggregated" datas with "custom" process (PHP / Javascript / Shell).

On May the 21st, we had to restart mongo.

A command launched using mongodb.admin was running for too long, and we we're unable to locate which process to kill in order to stop all the update this process was doing.

Just after Mongo restarted itself, data stored we're a lot different than before the restart.

In fact, we jumped for cicr_results database back in time at about 5 months before. Datas we're the same as on January, all inserts, updates done during those 5 months we're gone.

Here is the log surrounding the server restart :

2015-05-21T11:22:05.505+0200 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:38527 #416368 (140 connecti
ons now open)
2015-05-21T11:22:07.998+0200 I COMMAND  [conn416368] going to kill op: op: 1934153368
2015-05-21T11:22:07.998+0200 I NETWORK  [conn416368] end connection 127.0.0.1:38527 (139 connections now open)
2015-05-21T11:22:14.678+0200 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:38531 #416369 (140 connecti
ons now open)
2015-05-21T11:22:17.703+0200 I NETWORK  [conn416369] end connection 127.0.0.1:38531 (139 connections now open)
2015-05-21T11:22:32.930+0200 I NETWORK  [conn416365] end connection 127.0.0.1:38485 (138 connections now open)
2015-05-21T11:22:33.431+0200 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:38543 #416370 (139 connecti
ons now open)
2015-05-21T11:22:48.560+0200 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:38558 #416371 (140 connecti
ons now open)
2015-05-21T11:23:01.565+0200 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:38568 #416372 (141 connecti
ons now open)
2015-05-21T11:23:03.567+0200 I CONTROL  [signalProcessingThread] got signal 15 (Terminated), will terminate after curr
ent cmd ends
2015-05-21T11:23:03.567+0200 E QUERY    [conn416202] JavaScript execution terminated
2015-05-21T11:23:03.578+0200 I CONTROL  [signalProcessingThread] now exiting
2015-05-21T11:23:03.578+0200 I NETWORK  [signalProcessingThread] shutdown: going to close listening sockets...
2015-05-21T11:23:03.578+0200 I COMMAND  [conn416202] command cicr_results.$cmd command: eval { $eval: CodeWScope( db.c
r.find({}).forEach(function (cur) {
            db.cr.update({_id: cur._id}, {$set: {"s.ra": Math.floor(Math.random() * (1000 - 0)) + 0}})
        }), {}), args: [] } keyUpdates:0 writeConflicts:0 numYields:0 reslen:102 locks:{ Global: { acquireCount: { r:
2161, w: 9335668, W: 1 } }, MMAPV1Journal: { acquireCount: { w: 9326394 }, acquireWaitCount: { w: 33532 }, timeAcquiri
ngMicros: { w: 40647604 } }, Database: { acquireCount: { r: 2161, w: 9335668 } }, Collection: { acquireCount: { R: 216
1, W: 9335668 } } } 3496045ms
2015-05-21T11:23:03.578+0200 I NETWORK  [signalProcessingThread] closing listening socket: 6
2015-05-21T11:23:03.578+0200 I NETWORK  [signalProcessingThread] closing listening socket: 7
2015-05-21T11:23:03.578+0200 I NETWORK  [conn416202] SocketException handling request, closing client connection: 9001
 socket exception [SEND_ERROR] server [127.0.0.1:34736]
2015-05-21T11:23:03.579+0200 I NETWORK  [signalProcessingThread] removing socket file: /tmp/mongodb-27017.sock
2015-05-21T11:23:03.579+0200 I NETWORK  [signalProcessingThread] shutdown: going to flush diaglog...
2015-05-21T11:23:03.579+0200 I NETWORK  [signalProcessingThread] shutdown: going to close sockets...
2015-05-21T11:23:03.579+0200 I STORAGE  [signalProcessingThread] shutdown: waiting for fs preallocator...
2015-05-21T11:23:03.579+0200 I STORAGE  [signalProcessingThread] shutdown: final commit...
2015-05-21T11:23:03.579+0200 I NETWORK  [conn416254] end connection 127.0.0.1:35853 (139 connections now open)
2015-05-21T11:23:03.580+0200 I NETWORK  [conn333980] end connection 127.0.0.1:50058 (139 connections now open)
2015-05-21T11:23:03.580+0200 I NETWORK  [conn416371] end connection 127.0.0.1:38558 (139 connections now open)
2015-05-21T11:23:03.614+0200 I JOURNAL  [signalProcessingThread] journalCleanup...
2015-05-21T11:23:03.615+0200 I JOURNAL  [signalProcessingThread] removeJournalFiles
2015-05-21T11:23:03.639+0200 I JOURNAL  [signalProcessingThread] Terminating durability thread ...
2015-05-21T11:23:03.682+0200 I JOURNAL  [journal writer] Journal writer thread stopped
2015-05-21T11:23:03.711+0200 I JOURNAL  [durability] Durability thread stopped
2015-05-21T11:23:03.727+0200 I STORAGE  [signalProcessingThread] shutdown: closing all files...
2015-05-21T11:23:05.086+0200 I -        [signalProcessingThread]   File Closing Progress: 36/248 14% (files)
2015-05-21T11:23:07.017+0200 I -        [signalProcessingThread]   File Closing Progress: 176/248 70% (files)
2015-05-21T11:23:09.094+0200 I -        [signalProcessingThread]   File Closing Progress: 215/248 86% (files)
2015-05-21T11:23:10.516+0200 I STORAGE  [signalProcessingThread] closeAllFiles() finished
2015-05-21T11:23:10.517+0200 I STORAGE  [signalProcessingThread] shutdown: removing fs lock...
2015-05-21T11:23:10.517+0200 I CONTROL  [signalProcessingThread] dbexit:  rc: 0
2015-05-21T11:23:13.519+0200 I CONTROL  ***** SERVER RESTARTED *****
2015-05-21T11:23:13.573+0200 I JOURNAL  [initandlisten] journal dir=/home/mongodb/journal
2015-05-21T11:23:13.574+0200 I JOURNAL  [initandlisten] recover : no journal files present, no recovery needed
2015-05-21T11:23:13.600+0200 I JOURNAL  [durability] Durability thread started
2015-05-21T11:23:13.600+0200 I JOURNAL  [journal writer] Journal writer thread started
2015-05-21T11:23:13.602+0200 I CONTROL  [initandlisten] MongoDB starting : pid=10206 port=27017 dbpath=/home/mongodb 64-bit host=dev-01-mngr.pcomperf.com
2015-05-21T11:23:13.603+0200 I CONTROL  [initandlisten] db version v3.0.2
2015-05-21T11:23:13.603+0200 I CONTROL  [initandlisten] git version: 6201872043ecbbc0a4cc169b5482dcf385fc464f
2015-05-21T11:23:13.603+0200 I CONTROL  [initandlisten] OpenSSL version: OpenSSL 1.0.1e 11 Feb 2013
2015-05-21T11:23:13.603+0200 I CONTROL  [initandlisten] build info: Linux ip-10-171-120-232 3.2.0-4-amd64 #1 SMP Debian 3.2.46-1 x86_64 BOOST_LIB_VERSION=1_49
2015-05-21T11:23:13.603+0200 I CONTROL  [initandlisten] allocator: tcmalloc
2015-05-21T11:23:13.603+0200 I CONTROL  [initandlisten] options: { config: "/etc/mongod.conf", net: { bindIp: "127.0.0.1" }, storage: { dbPath: "/home/mongodb", directoryPerDB: true }, systemLog: { destination: "file", logAppend: true, path: "/home/mongodb/mongod.log" } }
2015-05-21T11:23:14.921+0200 I NETWORK  [initandlisten] waiting for connections on port 27017
2015-05-21T11:23:15.096+0200 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:38630 #1 (1 connection now open)
2015-05-21T11:23:17.441+0200 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:38631 #2 (2 connections now open)
2015-05-21T11:23:18.485+0200 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:38633 #3 (3 connections now open)
2015-05-21T11:23:24.054+0200 I NETWORK  [conn3] end connection 127.0.0.1:38633 (2 connections now open)
2015-05-21T11:23:38.070+0200 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:38640 #4 (3 connections now open)
2015-05-21T11:24:01.209+0200 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:38663 #5 (4 connections now open)
2015-05-21T11:24:01.289+0200 I NETWORK  [conn5] end connection 127.0.0.1:38663 (3 connections now open)
2015-05-21T11:24:01.364+0200 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:38664 #6 (4 connections now open)
2015-05-21T11:24:01.449+0200 I NETWORK  [conn6] end connection 127.0.0.1:38664 (3 connections now open)
2015-05-21T11:24:01.535+0200 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:38665 #7 (4 connections now open)
2015-05-21T11:24:01.616+0200 I NETWORK  [conn7] end connection 127.0.0.1:38665 (3 connections now open)
2015-05-21T11:24:15.646+0200 I NETWORK  [conn4] end connection 127.0.0.1:38640 (2 connections now open)
2015-05-21T11:24:17.651+0200 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:38679 #8 (3 connections now open)
2015-05-21T11:24:23.115+0200 I NETWORK  [conn8] end connection 127.0.0.1:38679 (2 connections now open)
2015-05-21T11:24:46.586+0200 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:38700 #9 (3 connections now open)
2015-05-21T11:25:01.823+0200 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:38714 #10 (4 connections now open)
2015-05-21T11:25:01.901+0200 I NETWORK  [conn10] end connection 127.0.0.1:38714 (3 connections now open)
2015-05-21T11:25:01.997+0200 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:38715 #11 (4 connections now open)
2015-05-21T11:25:02.084+0200 I NETWORK  [conn11] end connection 127.0.0.1:38715 (3 connections now open)
2015-05-21T11:25:02.172+0200 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:38716 #12 (4 connections now open)
2015-05-21T11:25:02.251+0200 I NETWORK  [conn12] end connection 127.0.0.1:38716 (3 connections now open)
2015-05-21T11:25:24.216+0200 I COMMAND  [conn9] CMD: drop segments_results.sr_555a15a04a620d867e52897e
2015-05-21T11:25:30.807+0200 I COMMAND  [conn9] CMD: drop segments_results.sr_555c85f84a620d1279e70b06
2015-05-21T11:25:32.923+0200 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:38745 #13 (4 connections now open)

As complementary information, cicr_results database, which is here the database in problem, is located on a 300 GB SSD drive, while all the other databases are actually stored on a 3TB Hard Disk.

At the moment where Mongo was restarted, Free Space on SSD was at about 4GB.

cicr_results is stored on SSD using directory symlinks :

lrwxrwxrwx  1 mongodb mongodb         31 mai   24 21:49 cicr_results -> /home/dbs/mongodb/cicr_results/
drwxr-xr-x  2 mongodb mongodb       4096 mai   30 09:59 journal
drwxr-xr-x  3 mongodb mongodb       4096 janv. 15 14:13 journal_buffer
-rwxr-xr-x  1 mongodb mongodb          6 mai   26 10:14 mongod.lock
-rwxr-xr-x  1 mongodb mongodb 1307059367 mai   30 10:04 mongod.log

We check the files written on disk, they seem to be up to date (some of the files were updated just before the crash happened).

We tried restarting the database once more, launching repairDatabase and validate, everything ran smooth but data were still the same as January's one.

Of course, replication may have prevented that, but it seems rather alarming that, even with journaling enabled, we've lost this much data.

Has anyone encountered something similar ?

Details

Description

Attachments

Forms

Activity

People

Dates