[SERVER-18752] Huge Data Loss after Mongo Restart Created: 30/May/15  Updated: 18/Jun/15  Resolved: 18/Jun/15

Status: Closed
Project: Core Server
Component/s: Stability, Storage
Affects Version/s: 3.0.2
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Sébastien Hanicotte Assignee: Ramon Fernandez Marina
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Participants:

 Description   

Hi,

We're running a single dev server mongodb (almost every parameter is default, except directoryPerDB), no replication, just for testing (we'll definitely put replication, even on dev, after those problems encountered...).

We regularly run very long process in order to generate on database cicr_results about 200 GB of "aggregated" datas with "custom" process (PHP / Javascript / Shell).

On May the 21st, we had to restart mongo.

A command launched using mongodb.admin was running for too long, and we we're unable to locate which process to kill in order to stop all the update this process was doing.

Just after Mongo restarted itself, data stored we're a lot different than before the restart.

In fact, we jumped for cicr_results database back in time at about 5 months before. Datas we're the same as on January, all inserts, updates done during those 5 months we're gone.

Here is the log surrounding the server restart :

2015-05-21T11:22:05.505+0200 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:38527 #416368 (140 connecti
ons now open)
2015-05-21T11:22:07.998+0200 I COMMAND  [conn416368] going to kill op: op: 1934153368
2015-05-21T11:22:07.998+0200 I NETWORK  [conn416368] end connection 127.0.0.1:38527 (139 connections now open)
2015-05-21T11:22:14.678+0200 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:38531 #416369 (140 connecti
ons now open)
2015-05-21T11:22:17.703+0200 I NETWORK  [conn416369] end connection 127.0.0.1:38531 (139 connections now open)
2015-05-21T11:22:32.930+0200 I NETWORK  [conn416365] end connection 127.0.0.1:38485 (138 connections now open)
2015-05-21T11:22:33.431+0200 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:38543 #416370 (139 connecti
ons now open)
2015-05-21T11:22:48.560+0200 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:38558 #416371 (140 connecti
ons now open)
2015-05-21T11:23:01.565+0200 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:38568 #416372 (141 connecti
ons now open)
2015-05-21T11:23:03.567+0200 I CONTROL  [signalProcessingThread] got signal 15 (Terminated), will terminate after curr
ent cmd ends
2015-05-21T11:23:03.567+0200 E QUERY    [conn416202] JavaScript execution terminated
2015-05-21T11:23:03.578+0200 I CONTROL  [signalProcessingThread] now exiting
2015-05-21T11:23:03.578+0200 I NETWORK  [signalProcessingThread] shutdown: going to close listening sockets...
2015-05-21T11:23:03.578+0200 I COMMAND  [conn416202] command cicr_results.$cmd command: eval { $eval: CodeWScope( db.c
r.find({}).forEach(function (cur) {
            db.cr.update({_id: cur._id}, {$set: {"s.ra": Math.floor(Math.random() * (1000 - 0)) + 0}})
        }), {}), args: [] } keyUpdates:0 writeConflicts:0 numYields:0 reslen:102 locks:{ Global: { acquireCount: { r:
2161, w: 9335668, W: 1 } }, MMAPV1Journal: { acquireCount: { w: 9326394 }, acquireWaitCount: { w: 33532 }, timeAcquiri
ngMicros: { w: 40647604 } }, Database: { acquireCount: { r: 2161, w: 9335668 } }, Collection: { acquireCount: { R: 216
1, W: 9335668 } } } 3496045ms
2015-05-21T11:23:03.578+0200 I NETWORK  [signalProcessingThread] closing listening socket: 6
2015-05-21T11:23:03.578+0200 I NETWORK  [signalProcessingThread] closing listening socket: 7
2015-05-21T11:23:03.578+0200 I NETWORK  [conn416202] SocketException handling request, closing client connection: 9001
 socket exception [SEND_ERROR] server [127.0.0.1:34736]
2015-05-21T11:23:03.579+0200 I NETWORK  [signalProcessingThread] removing socket file: /tmp/mongodb-27017.sock
2015-05-21T11:23:03.579+0200 I NETWORK  [signalProcessingThread] shutdown: going to flush diaglog...
2015-05-21T11:23:03.579+0200 I NETWORK  [signalProcessingThread] shutdown: going to close sockets...
2015-05-21T11:23:03.579+0200 I STORAGE  [signalProcessingThread] shutdown: waiting for fs preallocator...
2015-05-21T11:23:03.579+0200 I STORAGE  [signalProcessingThread] shutdown: final commit...
2015-05-21T11:23:03.579+0200 I NETWORK  [conn416254] end connection 127.0.0.1:35853 (139 connections now open)
2015-05-21T11:23:03.580+0200 I NETWORK  [conn333980] end connection 127.0.0.1:50058 (139 connections now open)
2015-05-21T11:23:03.580+0200 I NETWORK  [conn416371] end connection 127.0.0.1:38558 (139 connections now open)
2015-05-21T11:23:03.614+0200 I JOURNAL  [signalProcessingThread] journalCleanup...
2015-05-21T11:23:03.615+0200 I JOURNAL  [signalProcessingThread] removeJournalFiles
2015-05-21T11:23:03.639+0200 I JOURNAL  [signalProcessingThread] Terminating durability thread ...
2015-05-21T11:23:03.682+0200 I JOURNAL  [journal writer] Journal writer thread stopped
2015-05-21T11:23:03.711+0200 I JOURNAL  [durability] Durability thread stopped
2015-05-21T11:23:03.727+0200 I STORAGE  [signalProcessingThread] shutdown: closing all files...
2015-05-21T11:23:05.086+0200 I -        [signalProcessingThread]   File Closing Progress: 36/248 14% (files)
2015-05-21T11:23:07.017+0200 I -        [signalProcessingThread]   File Closing Progress: 176/248 70% (files)
2015-05-21T11:23:09.094+0200 I -        [signalProcessingThread]   File Closing Progress: 215/248 86% (files)
2015-05-21T11:23:10.516+0200 I STORAGE  [signalProcessingThread] closeAllFiles() finished
2015-05-21T11:23:10.517+0200 I STORAGE  [signalProcessingThread] shutdown: removing fs lock...
2015-05-21T11:23:10.517+0200 I CONTROL  [signalProcessingThread] dbexit:  rc: 0
2015-05-21T11:23:13.519+0200 I CONTROL  ***** SERVER RESTARTED *****
2015-05-21T11:23:13.573+0200 I JOURNAL  [initandlisten] journal dir=/home/mongodb/journal
2015-05-21T11:23:13.574+0200 I JOURNAL  [initandlisten] recover : no journal files present, no recovery needed
2015-05-21T11:23:13.600+0200 I JOURNAL  [durability] Durability thread started
2015-05-21T11:23:13.600+0200 I JOURNAL  [journal writer] Journal writer thread started
2015-05-21T11:23:13.602+0200 I CONTROL  [initandlisten] MongoDB starting : pid=10206 port=27017 dbpath=/home/mongodb 64-bit host=dev-01-mngr.pcomperf.com
2015-05-21T11:23:13.603+0200 I CONTROL  [initandlisten] db version v3.0.2
2015-05-21T11:23:13.603+0200 I CONTROL  [initandlisten] git version: 6201872043ecbbc0a4cc169b5482dcf385fc464f
2015-05-21T11:23:13.603+0200 I CONTROL  [initandlisten] OpenSSL version: OpenSSL 1.0.1e 11 Feb 2013
2015-05-21T11:23:13.603+0200 I CONTROL  [initandlisten] build info: Linux ip-10-171-120-232 3.2.0-4-amd64 #1 SMP Debian 3.2.46-1 x86_64 BOOST_LIB_VERSION=1_49
2015-05-21T11:23:13.603+0200 I CONTROL  [initandlisten] allocator: tcmalloc
2015-05-21T11:23:13.603+0200 I CONTROL  [initandlisten] options: { config: "/etc/mongod.conf", net: { bindIp: "127.0.0.1" }, storage: { dbPath: "/home/mongodb", directoryPerDB: true }, systemLog: { destination: "file", logAppend: true, path: "/home/mongodb/mongod.log" } }
2015-05-21T11:23:14.921+0200 I NETWORK  [initandlisten] waiting for connections on port 27017
2015-05-21T11:23:15.096+0200 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:38630 #1 (1 connection now open)
2015-05-21T11:23:17.441+0200 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:38631 #2 (2 connections now open)
2015-05-21T11:23:18.485+0200 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:38633 #3 (3 connections now open)
2015-05-21T11:23:24.054+0200 I NETWORK  [conn3] end connection 127.0.0.1:38633 (2 connections now open)
2015-05-21T11:23:38.070+0200 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:38640 #4 (3 connections now open)
2015-05-21T11:24:01.209+0200 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:38663 #5 (4 connections now open)
2015-05-21T11:24:01.289+0200 I NETWORK  [conn5] end connection 127.0.0.1:38663 (3 connections now open)
2015-05-21T11:24:01.364+0200 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:38664 #6 (4 connections now open)
2015-05-21T11:24:01.449+0200 I NETWORK  [conn6] end connection 127.0.0.1:38664 (3 connections now open)
2015-05-21T11:24:01.535+0200 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:38665 #7 (4 connections now open)
2015-05-21T11:24:01.616+0200 I NETWORK  [conn7] end connection 127.0.0.1:38665 (3 connections now open)
2015-05-21T11:24:15.646+0200 I NETWORK  [conn4] end connection 127.0.0.1:38640 (2 connections now open)
2015-05-21T11:24:17.651+0200 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:38679 #8 (3 connections now open)
2015-05-21T11:24:23.115+0200 I NETWORK  [conn8] end connection 127.0.0.1:38679 (2 connections now open)
2015-05-21T11:24:46.586+0200 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:38700 #9 (3 connections now open)
2015-05-21T11:25:01.823+0200 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:38714 #10 (4 connections now open)
2015-05-21T11:25:01.901+0200 I NETWORK  [conn10] end connection 127.0.0.1:38714 (3 connections now open)
2015-05-21T11:25:01.997+0200 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:38715 #11 (4 connections now open)
2015-05-21T11:25:02.084+0200 I NETWORK  [conn11] end connection 127.0.0.1:38715 (3 connections now open)
2015-05-21T11:25:02.172+0200 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:38716 #12 (4 connections now open)
2015-05-21T11:25:02.251+0200 I NETWORK  [conn12] end connection 127.0.0.1:38716 (3 connections now open)
2015-05-21T11:25:24.216+0200 I COMMAND  [conn9] CMD: drop segments_results.sr_555a15a04a620d867e52897e
2015-05-21T11:25:30.807+0200 I COMMAND  [conn9] CMD: drop segments_results.sr_555c85f84a620d1279e70b06
2015-05-21T11:25:32.923+0200 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:38745 #13 (4 connections now open)

As complementary information, cicr_results database, which is here the database in problem, is located on a 300 GB SSD drive, while all the other databases are actually stored on a 3TB Hard Disk.

At the moment where Mongo was restarted, Free Space on SSD was at about 4GB.

cicr_results is stored on SSD using directory symlinks :

lrwxrwxrwx  1 mongodb mongodb         31 mai   24 21:49 cicr_results -> /home/dbs/mongodb/cicr_results/
drwxr-xr-x  2 mongodb mongodb       4096 mai   30 09:59 journal
drwxr-xr-x  3 mongodb mongodb       4096 janv. 15 14:13 journal_buffer
-rwxr-xr-x  1 mongodb mongodb          6 mai   26 10:14 mongod.lock
-rwxr-xr-x  1 mongodb mongodb 1307059367 mai   30 10:04 mongod.log

We check the files written on disk, they seem to be up to date (some of the files were updated just before the crash happened).

We tried restarting the database once more, launching repairDatabase and validate, everything ran smooth but data were still the same as January's one.

Of course, replication may have prevented that, but it seems rather alarming that, even with journaling enabled, we've lost this much data.

Has anyone encountered something similar ?



 Comments   
Comment by Ramon Fernandez Marina [ 18/Jun/15 ]

After further investigation we're convinced there are two explanation for this behavior:

  1. Alteration of any files in dbpath in a running instance
  2. The volume where the affected database sits has changed

Since you mention (1) is not the cause it must be (2). Here's a sample scenario:

  1. A filesystem snapshot is created for a given volume
  2. The link in dbpath for a database is made to point to the live version of this volume, and mongod is started
  3. Date is inserted into this database
  4. Upon restart, the link in dbpath for this database points to the older version/snapshot of this volume

When journalling is enabled as in your case, data will be written to disk at most every 300ms, so even if your storage layer is aggressively cacheing data it's impossible for 5 months of data to vanish. I'd suggest you check your startup scripts and the configuration of your storage layer.

We haven't been able to find evidence of a bug in the MongoDB server so I'm closing this ticket, as the SERVER project is for reporting bugs or feature suggestions for the MongoDB server. For MongoDB-related support discussion please post on the mongodb-user group or Stack Overflow with the mongodb tag, where your question will reach a larger audience. A question like this involving more discussion would be best posted on the mongodb-user group. See also our Technical Support page for additional support resources.

Regards,
Ramón.

Comment by Sébastien Hanicotte [ 01/Jun/15 ]

Hi Ramon,

For this Mongo "bug", there was no alter on any database files.

Symlink was put in place before the Mongo start and never updated.

No restore was done, nore was there any file moved by any external process (logrotate, aso...).

The only "things" done was to restart Mongo by the /etc/init.d/ script.

Comment by Ramon Fernandez Marina [ 01/Jun/15 ]

shanicot@prismamedia.com, were the database files altered in any way by external processes while mongo was running? This would include changing symbolic links, moving files around, restoring old files from backups, etc.

Generated at Thu Feb 08 03:48:38 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.