[SERVER-16702] Mongod fails during journal replay with mmapv1 after power cycle Created: 31/Dec/14  Updated: 09/Dec/15  Resolved: 18/Nov/15

Status: Closed
Project: Core Server
Component/s: Storage
Affects Version/s: 2.8.0-rc4, 3.1.6
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Jonathan Abrahams Assignee: Jonathan Abrahams
Resolution: Cannot Reproduce Votes: 1
Labels: 28qa, 32powercycle, 32qa
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File mmapv1.tar.gz     File mongod.log.gz     File powertest.sh    
Issue Links:
Tested
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

Repetitive power cycle test (over 10 minutes) , where the mongod host is power cycled (non-clean shutdown) randomly (every 30 to 120 seconds) , while clients are sending various CRUD operations.

The operations which produce a mongod crashes of
Invariant failure false src/mongo/db/storage/mmap_v1/mmap_v1_extent_manager.cpp 179
and
Invariant failure false src/mongo/db/storage/mmap_v1/btree/btree_logic.cpp 395

db.foo.validate()
bulk.insert()
db.foo.remove()

Use the attached powertest.sh, which relies on FSM & CRUD clients to generate various DB operations to reproduce the issue (in 3.1.6).

./powertest.sh -s <ssh server>  -e mmapv1 -F $HOME/mongo

<ssh server> ssh server connection, i.e., "-i ident.pem user@host"

Sprint: Quint Iteration 3.1.1, QuInt A (10/12/15), QuInt B (11/02/15), TIG C (11/20/15)
Participants:

 Description   

Mongod mmapv1 storage files are corrupted from power cycle testing. mongod will crash on certain commands with Invariant failure.



 Comments   
Comment by Charlie Swanson [ 13/Nov/15 ]

Assigning back to Jonathan. I was unable to reproduce after 2 days. I think in order to reliably reproduce any crash like this, we need to figure out what workload was running at the time of the crash. So please include the -L option, and upload the client logs if you are able to reproduce. Thanks!

Comment by Jonathan Abrahams [ 20/Jul/15 ]

This was reproduced in 3.1.6-pre:

2015-07-20T09:37:14.626-0400 I CONTROL  [main] ***** SERVER RESTARTED *****
2015-07-20T09:37:14.672-0400 W -        [initandlisten] Detected unclean shutdown - /home/jonathan/pt-mmapv1/data/mmapv1/mongod.lock is not empty.
2015-07-20T09:37:14.677-0400 I JOURNAL  [initandlisten] journal dir=/home/jonathan/pt-mmapv1/data/mmapv1/journal
2015-07-20T09:37:14.677-0400 I JOURNAL  [initandlisten] recover begin
2015-07-20T09:37:14.677-0400 I JOURNAL  [initandlisten] recover lsn: 0
2015-07-20T09:37:14.677-0400 I JOURNAL  [initandlisten] recover /home/jonathan/pt-mmapv1/data/mmapv1/journal/j._0
2015-07-20T09:37:14.681-0400 I JOURNAL  [initandlisten] recover cleaning up
2015-07-20T09:37:14.681-0400 I JOURNAL  [initandlisten] removeJournalFiles
2015-07-20T09:37:14.682-0400 I JOURNAL  [initandlisten] recover done
2015-07-20T09:37:14.772-0400 I JOURNAL  [durability] Durability thread started
2015-07-20T09:37:14.772-0400 I JOURNAL  [journal writer] Journal writer thread started
2015-07-20T09:37:14.773-0400 I CONTROL  [initandlisten] MongoDB starting : pid=9196 port=37017 dbpath=/home/jonathan/pt-mmapv1/data/mmapv1 64-bit host=capja2
2015-07-20T09:37:14.773-0400 I CONTROL  [initandlisten]
2015-07-20T09:37:14.773-0400 I CONTROL  [initandlisten] ** NOTE: This is a development version (3.1.6-pre-) of MongoDB.
2015-07-20T09:37:14.773-0400 I CONTROL  [initandlisten] **       Not recommended for production.
2015-07-20T09:37:14.773-0400 I CONTROL  [initandlisten]
2015-07-20T09:37:14.773-0400 I CONTROL  [initandlisten]
2015-07-20T09:37:14.773-0400 I CONTROL  [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/enabled is 'always'.
2015-07-20T09:37:14.773-0400 I CONTROL  [initandlisten] **        We suggest setting it to 'never'
2015-07-20T09:37:14.773-0400 I CONTROL  [initandlisten]
2015-07-20T09:37:14.773-0400 I CONTROL  [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/defrag is 'always'.
2015-07-20T09:37:14.773-0400 I CONTROL  [initandlisten] **        We suggest setting it to 'never'
2015-07-20T09:37:14.773-0400 I CONTROL  [initandlisten]
2015-07-20T09:37:14.773-0400 I CONTROL  [initandlisten] db version v3.1.6-pre-
2015-07-20T09:37:14.773-0400 I CONTROL  [initandlisten] git version: 48f79b30ff75c6310869ea8d0e34925d5f63252f
2015-07-20T09:37:14.773-0400 I CONTROL  [initandlisten] allocator: tcmalloc
2015-07-20T09:37:14.773-0400 I CONTROL  [initandlisten] options: { net: { port: 37017 }, processManagement: { fork: true }, replication: { replSet: "pc" }, storage: { dbPath: "pt-mmapv1/data/mmapv1", engine: "mmapv1" }, systemLog: { destination: "file", logAppend: true, path: "pt-mmapv1/logs/mongod-mmapv1-recovery.log" } }
2015-07-20T09:37:14.786-0400 I INDEX    [initandlisten] found 1 index(es) that wasn't finished before shutdown
2015-07-20T09:37:14.794-0400 I INDEX    [initandlisten] found 1 interrupted index build(s) on power.cycle
2015-07-20T09:37:14.794-0400 I INDEX    [initandlisten] note: restart the server with --noIndexBuildRetry to skip index rebuilds
2015-07-20T09:37:14.794-0400 I INDEX    [initandlisten] build index on: power.cycle properties: { v: 1, key: { x: 1.0 }, name: "x_1", ns: "power.cycle" }
2015-07-20T09:37:14.794-0400 I INDEX    [initandlisten] 	 building index using bulk method
2015-07-20T09:37:14.963-0400 I -        [initandlisten] Fatal Assertion 17441
2015-07-20T09:37:14.969-0400 I CONTROL  [initandlisten]
 0x10d8ce2 0x10872c4 0x1072b25 0xe80656 0xe8067b 0xe8f04d 0xe8f12f 0xa6d53d 0xc693dd 0x98923a 0xb275a2 0x8377d7 0x83b014 0x7faa0e407af5 0x8345c9
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"400000","o":"CD8CE2"},{"b":"400000","o":"C872C4"},{"b":"400000","o":"C72B25"},{"b":"400000","o":"A80656"},{"b":"400000","o":"A8067B"},{"b":"400000","o":"A8F04D"},{"b":"400000","o":"A8F12F"},{"b":"400000","o":"66D53D"},{"b":"400000","o":"8693DD"},{"b":"400000","o":"58923A"},{"b":"400000","o":"7275A2"},{"b":"400000","o":"4377D7"},{"b":"400000","o":"43B014"},{"b":"7FAA0E3E6000","o":"21AF5"},{"b":"400000","o":"4345C9"}],"processInfo":{ "mongodbVersion" : "3.1.6-pre-", "gitVersion" : "48f79b30ff75c6310869ea8d0e34925d5f63252f", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "3.10.0-229.4.2.el7.x86_64", "version" : "#1 SMP Wed May 13 10:06:09 UTC 2015", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000" }, { "b" : "7FFFEAAFE000", "elfType" : 3 }, { "b" : "7FAA0F3E6000", "path" : "/lib64/librt.so.1", "elfType" : 3 }, { "b" : "7FAA0F1E2000", "path" : "/lib64/libdl.so.2", "elfType" : 3 }, { "b" : "7FAA0EEDB000", "path" : "/lib64/libstdc++.so.6", "elfType" : 3 }, { "b" : "7FAA0EBD9000", "path" : "/lib64/libm.so.6", "elfType" : 3 }, { "b" : "7FAA0E9C3000", "path" : "/lib64/libgcc_s.so.1", "elfType" : 3 }, { "b" : "7FAA0E7A7000", "path" : "/lib64/libpthread.so.0", "elfType" : 3 }, { "b" : "7FAA0E3E6000", "path" : "/lib64/libc.so.6", "elfType" : 3 }, { "b" : "7FAA0F5EE000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3 } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x32) [0x10d8ce2]
 mongod(_ZN5mongo10logContextEPKc+0x134) [0x10872c4]
 mongod(_ZN5mongo13fassertFailedEi+0x55) [0x1072b25]
 mongod(+0xA80656) [0xe80656]
 mongod(_ZNK5mongo17RecordStoreV1Base13getNextRecordEPNS_16OperationContextERKNS_7DiskLocE+0x1B) [0xe8067b]
 mongod(_ZN5mongo27SimpleRecordStoreV1Iterator7advanceEv+0x3D) [0xe8f04d]
 mongod(_ZN5mongo27SimpleRecordStoreV1Iterator4nextEv+0x4F) [0xe8f12f]
 mongod(_ZN5mongo14CollectionScan4workEPm+0x91D) [0xa6d53d]
 mongod(_ZN5mongo12PlanExecutor18getNextSnapshottedEPNS_11SnapshottedINS_7BSONObjEEEPNS_8RecordIdE+0x26D) [0xc693dd]
 mongod(_ZN5mongo15MultiIndexBlock30insertAllDocumentsInCollectionEPSt3setINS_8RecordIdESt4lessIS2_ESaIS2_EE+0x22A) [0x98923a]
 mongod(_ZN5mongo40restartInProgressIndexesFromLastShutdownEPNS_16OperationContextE+0x8D2) [0xb275a2]
 mongod(_ZN5mongo13initAndListenEi+0x15C7) [0x8377d7]
 mongod(main+0x154) [0x83b014]
 libc.so.6(__libc_start_main+0xF5) [0x7faa0e407af5]
 mongod(+0x4345C9) [0x8345c9]
-----  END BACKTRACE  -----
2015-07-20T09:37:14.969-0400 I -        [initandlisten]
 
***aborting after fassert() failure

Comment by Daniel Pasette (Inactive) [ 31/Dec/14 ]

This is running with journaling. I'm not sure what the "skipping application of section" message is, but worth looking into. Validate fasserts right after and it hadn't shown up before.

2014-12-31T10:53:38.562-0500 I JOURNAL  [initandlisten] journal dir=/data/db/journal
2014-12-31T10:53:38.562-0500 I JOURNAL  [initandlisten] recover begin
2014-12-31T10:53:38.570-0500 I JOURNAL  [initandlisten] recover lsn: 42038
2014-12-31T10:53:38.570-0500 I JOURNAL  [initandlisten] recover /data/db/journal/j._0
2014-12-31T10:53:38.571-0500 I JOURNAL  [initandlisten] recover skipping application of section seq:0 < lsn:42038
2014-12-31T10:53:49.256-0500 I JOURNAL  [initandlisten] recover cleaning up
2014-12-31T10:53:49.256-0500 I JOURNAL  [initandlisten] removeJournalFiles
2014-12-31T10:53:49.262-0500 I JOURNAL  [initandlisten] recover done

Comment by Scott Hernandez (Inactive) [ 31/Dec/14 ]

Is journalling on?

If the files are corrupted underneath mongodb, non the filesystem, then this expected behavior (which is to say that anything can happen more or less).

Generated at Thu Feb 08 03:41:59 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.