[SERVER-32544] Corrupt DB after Hard Crash (repeated 3 times) Created: 04/Jan/18  Updated: 29/Jan/18  Resolved: 05/Jan/18

Status: Closed
Project: Core Server
Component/s: Stability, Storage
Affects Version/s: 3.6.1
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: John Page Assignee: Daniel Gottlieb (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

This won't be easy.

Centos 7 VM - VirtualBox 5.1.30 on OSX, 2 cores, 8GB RAM
Use Mlaunch to start a 2 shard CSRS MDB 3.6.1 cluster with 0.5GB RAM per WT cache
mongorestore 5GB of data (from POCDriver)

Crashes - sometimes leaving DB corrupted.

Database files are available.

Participants:

 Description   

MongoDB 3.6 - Centos 7

Running in VM (Virtualbox on OSX) - several times Virtualbox has crashed with memory error. Not this is NOT the bug.

After this crash - WT databases being written to at the time corrupted and cannot start.

2018-01-04T16:45:57.564+0000 I CONTROL  [initandlisten] MongoDB starting : pid=1339 port=27024 dbpath=/root/data/configRepl/rs1/db 64-bit host=centos7.local
2018-01-04T16:45:57.564+0000 I CONTROL  [initandlisten] db version v3.6.1
2018-01-04T16:45:57.564+0000 I CONTROL  [initandlisten] git version: 025d4f4fe61efd1fb6f0005be20cb45a004093d1
2018-01-04T16:45:57.564+0000 I CONTROL  [initandlisten] OpenSSL version: OpenSSL 1.0.1e-fips 11 Feb 2013
2018-01-04T16:45:57.564+0000 I CONTROL  [initandlisten] allocator: tcmalloc
2018-01-04T16:45:57.564+0000 I CONTROL  [initandlisten] modules: none
2018-01-04T16:45:57.564+0000 I CONTROL  [initandlisten] build environment:
2018-01-04T16:45:57.564+0000 I CONTROL  [initandlisten]     distmod: rhel70
2018-01-04T16:45:57.564+0000 I CONTROL  [initandlisten]     distarch: x86_64
2018-01-04T16:45:57.564+0000 I CONTROL  [initandlisten]     target_arch: x86_64
2018-01-04T16:45:57.564+0000 I CONTROL  [initandlisten] options: { net: { port: 27024 }, processManagement: { fork: true }, replication: { replSet: "configRepl" }, security: { keyFile: "/root/data/keyfile" }, sharding: { clusterRole: "configsvr" }, storage: { dbPath: "/root/data/configRepl/rs1/db", wiredTiger: { engineConfig: { cacheSizeGB: 0.5 } } }, systemLog: { destination: "file", path: "/root/data/configRepl/rs1/mongod.log" } }
2018-01-04T16:45:57.565+0000 W -        [initandlisten] Detected unclean shutdown - /root/data/configRepl/rs1/db/mongod.lock is not empty.
2018-01-04T16:45:57.566+0000 I -        [initandlisten] Detected data files in /root/data/configRepl/rs1/db created by the 'wiredTiger' storage engine, so setting the active storage engine to 'wiredTiger'.
2018-01-04T16:45:57.566+0000 W STORAGE  [initandlisten] Recovering data from the last clean checkpoint.
2018-01-04T16:45:57.566+0000 I STORAGE  [initandlisten] wiredtiger_open config: create,cache_size=512M,session_max=20000,eviction=(threads_min=4,threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),statistics_log=(wait=0),verbose=(recovery_progress),
2018-01-04T16:45:57.950+0000 I STORAGE  [initandlisten] WiredTiger message [1515084357:950005][1339:0x7f560203bb00], txn-recover: Main recovery loop: starting at 2/42880
2018-01-04T16:45:57.950+0000 I STORAGE  [initandlisten] WiredTiger message [1515084357:950414][1339:0x7f560203bb00], txn-recover: Recovering log 2 through 3
2018-01-04T16:45:58.029+0000 I STORAGE  [initandlisten] WiredTiger message [1515084358:29048][1339:0x7f560203bb00], file:index-49--1899953198262174362.wt, txn-recover: Recovering log 3 through 3
2018-01-04T16:45:58.186+0000 I STORAGE  [initandlisten] Starting WiredTigerRecordStoreThread local.oplog.rs
2018-01-04T16:45:58.186+0000 I STORAGE  [initandlisten] The size storer reports that the oplog contains 113 records totaling to 24685 bytes
2018-01-04T16:45:58.186+0000 I STORAGE  [initandlisten] Scanning the oplog to determine where to place markers for truncation
2018-01-04T16:45:58.199+0000 I CONTROL  [initandlisten] ** WARNING: You are running this process as the root user, which is not recommended.
2018-01-04T16:45:58.199+0000 I CONTROL  [initandlisten] 
2018-01-04T16:45:58.199+0000 I CONTROL  [initandlisten] ** WARNING: This server is bound to localhost.
2018-01-04T16:45:58.199+0000 I CONTROL  [initandlisten] **          Remote systems will be unable to connect to this server. 
2018-01-04T16:45:58.199+0000 I CONTROL  [initandlisten] **          Start the server with --bind_ip <address> to specify which IP 
2018-01-04T16:45:58.199+0000 I CONTROL  [initandlisten] **          addresses it should serve responses from, or with --bind_ip_all to
2018-01-04T16:45:58.199+0000 I CONTROL  [initandlisten] **          bind to all interfaces. If this behavior is desired, start the
2018-01-04T16:45:58.199+0000 I CONTROL  [initandlisten] **          server with --bind_ip 127.0.0.1 to disable this warning.
2018-01-04T16:45:58.199+0000 I CONTROL  [initandlisten] 
2018-01-04T16:45:58.200+0000 I CONTROL  [initandlisten] 
2018-01-04T16:45:58.200+0000 I CONTROL  [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/enabled is 'always'.
2018-01-04T16:45:58.200+0000 I CONTROL  [initandlisten] **        We suggest setting it to 'never'
2018-01-04T16:45:58.200+0000 I CONTROL  [initandlisten] 
2018-01-04T16:45:58.200+0000 I CONTROL  [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/defrag is 'always'.
2018-01-04T16:45:58.200+0000 I CONTROL  [initandlisten] **        We suggest setting it to 'never'
2018-01-04T16:45:58.200+0000 I CONTROL  [initandlisten] 
2018-01-04T16:45:58.245+0000 I FTDC     [initandlisten] Initializing full-time diagnostic data capture with directory '/root/data/configRepl/rs1/db/diagnostic.data'
2018-01-04T16:45:58.246+0000 I SHARDING [initandlisten] first cluster operation detected, adding sharding hook to enable versioning and authentication to remote servers
2018-01-04T16:45:58.246+0000 I SHARDING [thread1] creating distributed lock ping thread for process ConfigServer (sleeping for 30000ms)
2018-01-04T16:45:58.248+0000 I REPL     [initandlisten] Starting recovery oplog application at the appliedThrough: { ts: Timestamp(1515080359, 2), t: 2 }
2018-01-04T16:45:58.248+0000 I REPL     [initandlisten] Replaying stored operations from { : Timestamp(1515080359, 2) } (exclusive) to { : Timestamp(1515082949, 2) } (inclusive).
2018-01-04T16:45:58.249+0000 F REPL     [initandlisten] Oplog entry at { : Timestamp(1515080359, 2) } is missing; actual entry found is { : Timestamp(1515082746, 1) }
2018-01-04T16:45:58.249+0000 F -        [initandlisten] Fatal Assertion 40292 at src/mongo/db/repl/replication_recovery.cpp 217
2018-01-04T16:45:58.249+0000 F -        [initandlisten]



 Comments   
Comment by Daniel Gottlieb (Inactive) [ 05/Jan/18 ]

Thanks John. Feel free to let us know if some new data pops up.

Comment by John Page [ 05/Jan/18 ]

I cannot reproduce this short of having a physical box to pull the plug on, not reproducible with a kill -9 of mongod on AWS. I'm happy to go with bruce.lucas view it's a (virtual) hardware fault in the disk system and all data was on the same drive.

Comment by John Page [ 04/Jan/18 ]

Go clarify, when it broke I was running mongorestore on the included dump to an undsharded collection

Comment by Susan LoVerso [ 04/Jan/18 ]

The content of configRepl/rs3/db/WiredTiger.wt is clearer VM/OS level corruption. Trying to list the metadata gives this error:

[1515091605:537623][6225:0x7fffb162e3c0], file:WiredTiger.wt, WT_CURSOR.next: read checksum error for 32768B block at offset 192512: block header checksum of 3738547056 doesn't match expected checksum of 1129453242
[1515091605:537708][6225:0x7fffb162e3c0], file:WiredTiger.wt, WT_CURSOR.next: WiredTiger.wt: encountered an illegal file format or internal value: (__wt_block_read_off, 227)
[1515091605:537716][6225:0x7fffb162e3c0], file:WiredTiger.wt, WT_CURSOR.next: the process must exit and restart: WT_PANIC: WiredTiger library panic

Offset 192515 is 0x2f000. The content at that address is garbage (mostly 0's) and the following block contains data from what appears to be the mongodb informational log file. Again I include a bit of the previous, uncorrupted block with actual WiredTiger metadata content:

0002ef00  72 00 80 88 61 70 70 5f  6d 65 74 61 64 61 74 61  |r...app_metadata|
0002ef10  3d 2c 63 6f 6c 67 72 6f  75 70 73 3d 2c 63 6f 6c  |=,colgroups=,col|
0002ef20  6c 61 74 6f 72 3d 2c 63  6f 6c 75 6d 6e 73 3d 2c  |lator=,columns=,|
0002ef30  6b 65 79 5f 66 6f 72 6d  61 74 3d 75 2c 76 61 6c  |key_format=u,val|
0002ef40  75 65 5f 66 6f 72 6d 61  74 3d 75 00 00 00 00 00  |ue_format=u.....|
0002ef50  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
0002f000  00 00 00 00 00 00 00 00  69 00 00 00 00 00 00 00  |........i.......|
0002f010  33 00 00 00 02 00 00 00  06 00 00 00 00 10 00 00  |3...............|
0002f020  70 b3 d5 de 01 00 00 00  05 00 30 87 80 81 e4 a1  |p.........0.....|
0002f030  a8 a6 94 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
0002f040  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00030000  5b 63 6f 6e 6e 32 39 5d  20 63 6f 6d 6d 61 6e 64  |[conn29] command|
00030010  20 50 4f 43 44 42 2e 50  4f 43 43 4f 4c 4c 20 63  | POCDB.POCCOLL c|
00030020  6f 6d 6d 61 6e 64 3a 20  69 6e 73 65 72 74 20 7b  |ommand: insert {|
00030030  20 69 6e 73 65 72 74 3a  20 22 50 4f 43 43 4f 4c  | insert: "POCCOL|
00030040  4c 22 2c 20 62 79 70 61  73 73 44 6f 63 75 6d 65  |L", bypassDocume|
00030050  6e 74 56 61 6c 69 64 61  74 69 6f 6e 3a 20 66 61  |ntValidation: fa|
00030060  6c 73 65 2c 20 6f 72 64  65 72 65 64 3a 20 66 61  |lse, ordered: fa|
00030070  6c 73 65 2c 20 64 6f 63  75 6d 65 6e 74 73 3a 20  |lse, documents: |
00030080  31 30 30 30 2c 20 73 68  61 72 64 56 65 72 73 69  |1000, shardVersi|
00030090  6f 6e 3a 20 5b 20 54 69  6d 65 73 74 61 6d 70 28  |on: [ Timestamp(|
000300a0  30 2c 20 30 29 2c 20 4f  62 6a 65 63 74 49 64 28  |0, 0), ObjectId(|
000300b0  27 30 30 30 30 30 30 30  30 30 30 30 30 30 30 30  |'000000000000000|
000300c0  30 30 30 30 30 30 30 30  30 27 29 20 5d 2c 20 77  |000000000') ], w|
000300d0  72 69 74 65 43 6f 6e 63  65 72 6e 3a 20 7b 20 67  |riteConcern: { g|
000300e0  65 74 4c 61 73 74 45 72  72 6f 72 3a 20 31 2c 20  |etLastError: 1, |
000300f0  77 3a 20 22 6d 61 6a 6f  72 69 74 79 22 20 7d 2c  |w: "majority" },|
00030100  20 24 63 6c 75 73 74 65  72 54 69 6d 65 3a 20 7b  | $clusterTime: {|
00030110  20 63 6c 75 73 74 65 72  54 69 6d 65 3a 20 54 69  | clusterTime: Ti|

Comment by Daniel Gottlieb (Inactive) [ 04/Jan/18 ]

Same for configRepl/rs3. All members of the shards started up fine. john.page can you clarify what you mean when you say the crash can happen when "restoring BSON"?

The FTDC data for the shards isn't as complete as I would expect, but from what I can see, would it be correct that no application was running against the system? The only operations being performed were internal sharded cluster procedures?

Comment by Susan LoVerso [ 04/Jan/18 ]

That file definitely shows on-disk corruption. Offset 196608 is 0x30000. Here's what is in the file (from hexdump -C. I include some of the correct and typical content right before it too:

0002fcb0  72 6d 61 74 3d 75 00 45  74 61 62 6c 65 3a 73 69  |rmat=u.Etable:si|
0002fcc0  7a 65 53 74 6f 72 65 72  00 80 88 61 70 70 5f 6d  |zeStorer...app_m|
0002fcd0  65 74 61 64 61 74 61 3d  2c 63 6f 6c 67 72 6f 75  |etadata=,colgrou|
0002fce0  70 73 3d 2c 63 6f 6c 6c  61 74 6f 72 3d 2c 63 6f  |ps=,collator=,co|
0002fcf0  6c 75 6d 6e 73 3d 2c 6b  65 79 5f 66 6f 72 6d 61  |lumns=,key_forma|
0002fd00  74 3d 75 2c 76 61 6c 75  65 5f 66 6f 72 6d 61 74  |t=u,value_format|
0002fd10  3d 75 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |=u..............|
0002fd20  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00030000  69 00 2c 3d b8 00 04 0f  29 7f c1 0e 0b ba 00 04  |i.,=....).......|
00030010  0f 29 81 c9 0e 0b bc 00  04 0f 29 83 41 0e 0b be  |.)........).A...|
00030020  00 04 0f 29 81 d1 0e 0b  c0 00 04 0f 29 81 11 0e  |...)........)...|
00030030  0b c2 00 04 0f 29 7e 91  0e 0b c4 00 04 0f 29 80  |.....)~.......).|
00030040  59 0e 0b c6 00 04 0f 29  7e f1 0e 0b c8 00 04 0f  |Y......)~.......|
00030050  29 84 39 0e 0b ca 00 04  0f 29 82 b9 0e 0b cc 00  |).9......)......|
00030060  04 0f 29 82 49 0e 0b ce  00 04 0f 29 7f 61 0e 0b  |..).I......).a..|
00030070  d0 00 04 0f 29 83 19 0e  0b d2 00 04 0f 29 81 d9  |....)........)..|
00030080  0e 0b d4 00 04 0f 29 82  c1 0e 0b d6 00 04 0f 29  |......)........)|
00030090  7e 99 0e 0b d8 00 04 0f  29 7e f9 0e 0b da 00 04  |~.......)~......|
000300a0  0f 29 7f 01 0e 0b dc 00  04 0f 29 7f 09 0e 0b de  |.)........).....|
000300b0  00 04 0f 29 7f 69 0e 0b  e0 00 04 0f 29 80 61 0e  |...).i......).a.|
...
00034fa0  00 04 0f 29 c3 69 0e 0b  f2 00 04 0f 29 c1 f1 0e  |...).i......)...|
00034fb0  0b f4 00 04 0f 29 c1 f9  0e 0b f6 00 04 0f 29 c4  |.....)........).|
00034fc0  69 0e 0b f8 00 04 0f 29  c2 a9 0e 0b fa 00 04 0f  |i......)........|
00034fd0  29 c1 b1 0e 0b fc 00 04  0f 29 c4 d1 0e 0b fe 00  |)........)......|
00034fe0  04 0f 29 c2 e1 12 0a 08  00 00 04 0f 29 c3 71 0e  |..).........).q.|
00034ff0  0b 02 00 04 0f 29 c3 19  0e 0b 04 00 04 0f 29 c4  |.....)........).|
00035000

Comment by Daniel Gottlieb (Inactive) [ 04/Jan/18 ]

Trying to start configRepl/rs2 (with the intention of comparing the oplog to rs1), WT gets checksum failures reading WiredTiger.wt:

dgottlieb@chimichurri ~/xgen/mongo/bf[master]$ ../mongod --dbpath data/configRepl/rs2/db/
2018-01-04T13:24:05.421-0500 I CONTROL  [initandlisten] MongoDB starting : pid=4259 port=27017 dbpath=data/configRepl/rs2/db/ 64-bit host=chimichurri
2018-01-04T13:24:05.421-0500 I CONTROL  [initandlisten] db version v0.0.0
2018-01-04T13:24:05.421-0500 I CONTROL  [initandlisten] git version: unknown
2018-01-04T13:24:05.421-0500 I CONTROL  [initandlisten] allocator: tcmalloc
2018-01-04T13:24:05.421-0500 I CONTROL  [initandlisten] modules: ninja 
2018-01-04T13:24:05.421-0500 I CONTROL  [initandlisten] build environment:
2018-01-04T13:24:05.421-0500 I CONTROL  [initandlisten]     distarch: x86_64
2018-01-04T13:24:05.421-0500 I CONTROL  [initandlisten]     target_arch: x86_64
2018-01-04T13:24:05.421-0500 I CONTROL  [initandlisten] options: { storage: { dbPath: "data/configRepl/rs2/db/" } }
2018-01-04T13:24:05.421-0500 W STORAGE  [initandlisten] Detected unclean shutdown - data/configRepl/rs2/db/mongod.lock is not empty.
2018-01-04T13:24:05.421-0500 I STORAGE  [initandlisten] Detected data files in data/configRepl/rs2/db/ created by the 'wiredTiger' storage engine, so setting the active storage engine to 'wiredTiger'.
2018-01-04T13:24:05.421-0500 W STORAGE  [initandlisten] Recovering data from the last clean checkpoint.
2018-01-04T13:24:05.421-0500 I STORAGE  [initandlisten] 
2018-01-04T13:24:05.421-0500 I STORAGE  [initandlisten] ** WARNING: Using the XFS filesystem is strongly recommended with the WiredTiger storage engine
2018-01-04T13:24:05.421-0500 I STORAGE  [initandlisten] **          See http://dochub.mongodb.org/core/prodnotes-filesystem
2018-01-04T13:24:05.421-0500 I STORAGE  [initandlisten] wiredtiger_open config: create,cache_size=31635M,session_max=20000,eviction=(threads_min=4,threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),statistics_log=(wait=0),verbose=(recovery_progress),
2018-01-04T13:24:05.474-0500 E STORAGE  [initandlisten] WiredTiger error (0) [1515090245:474346][4259:0x7f2270ab29c0], file:WiredTiger.wt, WT_CURSOR.next: read checksum error for 20480B block at offset 196608: block header checksum of 688849920 doesn't match expected checksum of 66192900
2018-01-04T13:24:05.474-0500 E STORAGE  [initandlisten] WiredTiger error (0) [1515090245:474382][4259:0x7f2270ab29c0], file:WiredTiger.wt, WT_CURSOR.next: WiredTiger.wt: encountered an illegal file format or internal value: (__wt_block_read_off, 227)
2018-01-04T13:24:05.474-0500 E STORAGE  [initandlisten] WiredTiger error (-31804) [1515090245:474390][4259:0x7f2270ab29c0], file:WiredTiger.wt, WT_CURSOR.next: the process must exit and restart: WT_PANIC: WiredTiger library panic
2018-01-04T13:24:05.474-0500 F -        [initandlisten] Fatal Assertion 28558 at src/mongo/db/storage/wiredtiger/wiredtiger_util.cpp 361
2018-01-04T13:24:05.474-0500 F -        [initandlisten] 

Comment by Bruce Lucas (Inactive) [ 04/Jan/18 ]

The kernel backtrace above comes from OSX, and has been reported (based the "Kernel Extensions in backtrace" section) by other users. The listed kernel extensions in backtrace include "com.apple.iokit.IOStorageFamily". This makes me suspect that the root cause could be an OSX bug that can cause storage corruption, but that's purely based on thin circumstantial evidence.

I think a repro on AWS would be useful.

Comment by Susan LoVerso [ 04/Jan/18 ]

Particularly if you think there is corruption, saving a copy of the pre-restart files would be helpful as well as the post corruption files.

Comment by John Page [ 04/Jan/18 ]

Would you like me to fuzz test it on AWS tomorrow and see if I can cause any issues once I take Virtualbox out of the picture?

Comment by John Page [ 04/Jan/18 ]

pasette - it's not consistent , I can try to reproduce but I don't like doing so as It involved a Kernel panic on my laptop caused by the VM.

In broad terms it's Small sharded cluster on VM, VM Causes Host OS Kernel Panic, After that a WT files corrupted. Kernel panic happens when WT (a) restoring BSON or (b) Balancing a bunch of chunks.

Not sure how I can give you reproducible case but have given you the files post corruption.

Comment by Daniel Pasette (Inactive) [ 04/Jan/18 ]

John, can you give more detailed repro steps?

Comment by John Page [ 04/Jan/18 ]

Virtual Box caused a Kernel Panic on my Mac

Anonymous UUID:       93442E27-CB9B-A529-B50F-3190B67B42C0
 
Thu Jan  4 16:23:25 2018
 
*** Panic Report ***
panic(cpu 3 caller 0xffffff800a29e69d): "a freed zone element has been modified in zone kalloc.16: expected 0xe4988eb1070c832a but found 0xffffff804de0bfc0, bits changed 0x1b6771314aec3cea, at offset 0 of 16 in element 0xffffff803f04fec0, cookies 0x3f00119a0e270bea 0x53521c4d786362f"@/BuildRoot/Library/Caches/com.apple.xbs/Sources/xnu/xnu-4570.31.3/osfmk/kern/zalloc.c:1120
Backtrace (CPU 3), Frame : Return Address
0xffffff91fddeb170 : 0xffffff800a2505f6 
0xffffff91fddeb1c0 : 0xffffff800a37d604 
0xffffff91fddeb200 : 0xffffff800a36f0f9 
0xffffff91fddeb280 : 0xffffff800a202120 
0xffffff91fddeb2a0 : 0xffffff800a25002c 
0xffffff91fddeb3d0 : 0xffffff800a24fdac 
0xffffff91fddeb430 : 0xffffff800a29e69d 
0xffffff91fddeb490 : 0xffffff800a29df70 
0xffffff91fddeb4e0 : 0xffffff800a29c9b3 
0xffffff91fddeb610 : 0xffffff800a25c204 
0xffffff91fddeb650 : 0xffffff800a4ba1ab 
0xffffff91fddeb680 : 0xffffff7f8b1bb3d4 
0xffffff91fddeb910 : 0xffffff7f8b1b9826 
0xffffff91fddeba00 : 0xffffff7f8b1afb5b 
0xffffff91fddeba50 : 0xffffff7f8b1b3152 
0xffffff91fddebbc0 : 0xffffff800a900fcd 
0xffffff91fddebc00 : 0xffffff800a4af0c5 
0xffffff91fddebe00 : 0xffffff800a6f29eb 
0xffffff91fddebe30 : 0xffffff800a7444ec 
0xffffff91fddebf40 : 0xffffff800a7fb3e8 
0xffffff91fddebfa0 : 0xffffff800a202906 
      Kernel Extensions in backtrace:
         com.apple.security.sandbox(300.0)[0E42B540-D3DB-3904-BD91-F604DB2F5712]@0xffffff7f8b1ac000->0xffffff7f8b1cbfff
            dependency: com.apple.driver.AppleMobileFileIntegrity(1.0.5)[58669FC2-CC90-3594-AD69-DB89B923FD20]@0xffffff7f8b192000
            dependency: com.apple.kext.AppleMatch(1.0.0d1)[2D2996CA-A15E-39E6-9D6B-2BE6DF131972]@0xffffff7f8b1a7000
            dependency: com.apple.iokit.IOStorageFamily(2.1)[B0FCE898-1542-34C0-B845-46FAE81DD9E6]@0xffffff7f8ac46000
 
BSD process name corresponding to current thread: VirtualBoxVM

So yes, a virtual hardware error causing a pysical OS panic.

The first time I saw this I think it was WT corruption - but Given the VM crash I ignored it , this time it looks like a replication issue.

Comment by Bruce Lucas (Inactive) [ 04/Jan/18 ]

Can you clarify what you mean by "memory error"? What were the specific messages and where and when did you see them? If there are (virtual) hardware errors, i.e. the VM is not operating properly, then data integrity can't be guaranteed.

Having said that this doesn't appear to be data corruption per se, i.e. the data is well-formed at the storage engine level. It appears rather that (as you mention) to be a replication issue - the oplog appears to be in an unexpected state. This could be a replication bug related to a crash and unrelated to the reason for the crash ("memory errors").

Comment by John Page [ 04/Jan/18 ]

Files available at https://s3-eu-west-1.amazonaws.com/johnpage2016/SERVER32544.tgz

Comment by John Page [ 04/Jan/18 ]

This message looks subtly different to last which I didnt keep - perhaps replication related.

Generated at Thu Feb 08 04:30:33 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.