[SERVER-20724] Collection Corruption that won't fix with repairdatabase or mongodump Created: 01/Oct/15  Updated: 28/Oct/15  Resolved: 28/Oct/15

Status: Closed
Project: Core Server
Component/s: MMAPv1
Affects Version/s: 3.0.6
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Allan Edwards Assignee: Ramon Fernandez Marina
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Steps To Reproduce:

run repairdatabase or mongodump on the given collection

Participants:

 Description   

I have a collection that is about 300 gbs and it contains about 5.5 million docs. When I run repairDatabase or mongodump it gets to the 15th data file for the collection and both components fail.



 Comments   
Comment by Ramon Fernandez Marina [ 02/Oct/15 ]

siliconplains44, can you please upload the following files?

  • filestore.15
  • filestore.16

That may be sufficient for us to investigate further. Also, can you post the output of:

ls -laR <dbpath>

where <dbpath> is the database path for this mongod instance?

Thanks,
Ramón.

Comment by Ramon Fernandez Marina [ 02/Oct/15 ]

siliconplains44, we may only need the filestore.15 file – I've asked to see if that's the case, which would make the upload the simplest option. If we need more than that we'll evaluate your proposal.

Thanks for your patience,
Ramón.

Comment by Allan Edwards [ 01/Oct/15 ]

The tared file is in the 400 gb size in range. There is not way I can upload this much data to you guys. Would you by chance be open to me giving you access to a server made from a snapshot of this data and then you guys remotly login to Google cloud and work on the data?

Comment by Ramon Fernandez Marina [ 01/Oct/15 ]

Thanks for the log siliconplains44; the error is being triggered at:

0x0000000000d4ad40: mongo::RecordStoreV1Base::getNextRecordInExtent(mongo::OperationContext*, mongo::DiskLoc const&) const at /data/mci/src/src/mongo/db/storage/mmap_v1/record_store_v1_base.cpp:274

which could be caused by the data being corrupted on disk. This could have happened due to an error in the storage layer, so I'd recommend you look for storage errors in the system logs. I reckon that this may not yield any useful data: if the error happened a while back but was only detected now the logs may not be available.

In order to rule out a bug in mongod as the cause of this problem I'd like to ask you to share this database with us so we can inspect the nature of the corruption. I've created an upload portal so you can send us data privately and securely. You'll need to split the data into chunks to be able to upload it, here's how:

cd <your_mongodb_dbpath>
tar czf upload.tgz local.* filestorage.*
split -d -b 5300000000 upload.tgz part. 

This will create a set of part.NN files that you can upload; you'll need to perform the tar and split operations in a disk containing enough space. Alternatively you can upload the local.* and filestorage.* directly – the tar+split method is to reduce the number of files to upload.

Thanks,
Ramón.

Comment by Allan Edwards [ 01/Oct/15 ]

Here is the dump...

rDatabase_3/filestorage.15, size: 2047MB,  took 0.014 secs
2015-10-01T19:24:45.019+0000 F -        [initandlisten] Invalid access at address: 0x7ef99907c0b8
2015-10-01T19:24:45.073+0000 F -        [initandlisten] Got signal: 7 (Bus error).
 0xf5bfc9 0xf5b892 0xf5bbee 0x7effddcf9340 0xd4ad40 0xd4ad96 0xd578a5 0xd5a3ac 0xbf7f74 0x808dfc 0x7d6b19 0x7eff
dc6dfec5 0x8065d7
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"400000","o":"B5BFC9"},{"b":"400000","o":"B5B892"},{"b":"400000","o":"B5BBEE"},{"b":"7EFFDDCE
9000","o":"10340"},{"b":"400000","o":"94AD40"},{"b":"400000","o":"94AD96"},{"b":"400000","o":"9578A5"},{"b":"400
000","o":"95A3AC"},{"b":"400000","o":"7F7F74"},{"b":"400000","o":"408DFC"},{"b":"400000","o":"3D6B19"},{"b":"7EF
FDC6BE000","o":"21EC5"},{"b":"400000","o":"4065D7"}],"processInfo":{ "mongodbVersion" : "3.0.6", "gitVersion" : 
"1ef45a23a4c5e3480ac919b28afcba3c615488f2", "uname" : { "sysname" : "Linux", "release" : "3.16.0-50-generic", "v
ersion" : "#66~14.04.1-Ubuntu SMP Thu Sep 10 17:05:00 UTC 2015", "machine" : "x86_64" }, "somap" : [ { "elfType"
 : 2, "b" : "400000", "buildId" : "BF5AC37B50D416FD8D6D427E561426ED60291032" }, { "b" : "7FFCFBDDC000", "elfType
" : 3, "buildId" : "46CB04C65D24DA120A0E9175373F76CA879E1B3A" }, { "b" : "7EFFDDCE9000", "path" : "/lib/x86_64-l
inux-gnu/libpthread.so.0", "elfType" : 3, "buildId" : "9318E8AF0BFBE444731BB0461202EF57F7C39542" }, { "b" : "7EF
FDDA8A000", "path" : "/lib/x86_64-linux-gnu/libssl.so.1.0.0", "elfType" : 3, "buildId" : "A20EFFEC993A8441FA17F2
079F923CBD04079E19" }, { "b" : "7EFFDD6AF000", "path" : "/lib/x86_64-linux-gnu/libcrypto.so.1.0.0", "elfType" : 
3, "buildId" : "F000D29917E9B6E94A35A8F02E5C62846E5916BC" }, { "b" : "7EFFDD4A7000", "path" : "/lib/x86_64-linux
-gnu/librt.so.1", "elfType" : 3, "buildId" : "92FCF41EFE012D6186E31A59AD05BDBB487769AB" }, { "b" : "7EFFDD2A3000
", "path" : "/lib/x86_64-linux-gnu/libdl.so.2", "elfType" : 3, "buildId" : "C1AE4CB7195D337A77A3C689051DABAA3980
CA0C" }, { "b" : "7EFFDCF9F000", "path" : "/usr/lib/x86_64-linux-gnu/libstdc++.so.6", "elfType" : 3, "buildId" :
 "4BF6F7ADD8244AD86008E6BF40D90F8873892197" }, { "b" : "7EFFDCC99000", "path" : "/lib/x86_64-linux-gnu/libm.so.6
", "elfType" : 3, "buildId" : "1D76B71E905CB867B27CEF230FCB20F01A3178F5" }, { "b" : "7EFFDCA83000", "path" : "/l
ib/x86_64-linux-gnu/libgcc_s.so.1", "elfType" : 3, "buildId" : "8D0AA71411580EE6C08809695C3984769F25725B" }, { "
b" : "7EFFDC6BE000", "path" : "/lib/x86_64-linux-gnu/libc.so.6", "elfType" : 3, "buildId" : "30C94DC66A1FE95180C
3D68D2B89E576D5AE213C" }, { "b" : "7EFFDDF07000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildI
d" : "9F00581AB3C73E3AEA35995A0C50D24D59A01D47" } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x29) [0xf5bfc9]
 mongod(+0xB5B892) [0xf5b892]
 mongod(+0xB5BBEE) [0xf5bbee]
 libpthread.so.0(+0x10340) [0x7effddcf9340]
 mongod(_ZNK5mongo17RecordStoreV1Base21getNextRecordInExtentEPNS_16OperationContextERKNS_7DiskLocE+0x10) [0xd4ad
40]
 mongod(_ZNK5mongo17RecordStoreV1Base13getNextRecordEPNS_16OperationContextERKNS_7DiskLocE+0x16) [0xd4ad96]
 mongod(_ZN5mongo27SimpleRecordStoreV1Iterator7getNextEv+0x85) [0xd578a5]
 mongod(_ZN5mongo12MMAPV1Engine14repairDatabaseEPNS_16OperationContextERKSsbb+0xEFC) [0xd5a3ac]
 mongod(_ZN5mongo14repairDatabaseEPNS_16OperationContextEPNS_13StorageEngineERKSsbb+0xEB4) [0xbf7f74]
 mongod(_ZN5mongo13initAndListenEi+0xA5C) [0x808dfc]
 mongod(main+0x139) [0x7d6b19]
 libc.so.6(__libc_start_main+0xF5) [0x7effdc6dfec5]
 mongod(+0x4065D7) [0x8065d7]
-----  END BACKTRACE  -----
wallanedwards@mongodb-flagship-2cpu:~$ 

Comment by Allan Edwards [ 01/Oct/15 ]

I will run the process again and send you the full dump.

Comment by Ramon Fernandez Marina [ 01/Oct/15 ]

siliconplains44, can you please post the logs you get when you run the repair operation? What is the exact error message?

Generated at Thu Feb 08 03:55:05 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.