|
Hi clarkx,
Thank you for answering my questions. Please be aware that using kill -9 to stop the mongod process may compromise the validity of the data files, and likely explains the issues you have encountered. I would recommend executing one of the commands listed in our documentation to stop the instance instead.
Unfortunately, the steps to restore the WiredTiger.wt file are not ready to be shared. For MongoDB-related support discussion please post on the mongodb-users group or Stack Overflow with the mongodb tag. A question like this involving more discussion would be best posted on the mongodb-users group.
Kind regards,
Thomas
|
|
hi,
thanks, I will upgrade the server hardware and use journal option.
answer1: I deleted all files in db folder, started mongod, it copied data from replica set automatically.
answer2: I don't know exactly, not SSD, guess should be network, because server is on cloud: www.aliyun.com.
answer3: no. before copy or move db files, I always kill mongod: killall -9 mongod.
answer4: yes. second case, becuase there was no replica set, I had to restore data from backup, some data missed.
answer5: shell script, do mongodump everyday.
btw, can you give me a tool that can edit wt file? if so, I can remove missed collections from _mdb_catalog.wt, repair most data in the furture.
|
|
Hi clarkx,
From the logs you have uploaded it, the original crash was the cause of an OOM condition. Looking forward, I would recommend allocating additional resources for your system.
I also observed that you are running without journaling enabled. Please note that if MongoDB exits unexpectedly in between checkpoints, journaling is required to recover information that occurred after the last checkpoint.
Unfortunately, repairDatabase cannot continue with the repair attempt with this level of corruption. There is an open ticket, SERVER-19815, that would make repairDatabase more robust when executed against WiredTiger data files. Please feel free to vote for SERVER-19815 and watch it for updates. In this situation, my recommendation would be to restore your database from a back up.
To help us better understand what's going on here, I have a few questions about data storage and the configuration of your environment. But, please note that in these sorts of situations it can be difficult to understand the cause of the corruption without a straightforward reproduction.
- How did you restore the database after the the unclean shutdown on 2016-07-11?
- What kind of underlying storage mechanism are you using? Are the storage devices attached locally or over the network? Are the disks SSDs or HDDs? What kind of RAID and/or volume management system are you using?
- Have you manipulated (copied or moved) the underlying database files? If so, was mongod running?
- Have you ever restored this instance from backups?
- What method do you use to create backups?
Thank you,
Thomas
|
|
can not repair db:
2016-08-15T18:47:09.576+0800 I STORAGE [initandlisten] Repairing collection kpiEngineV3.projectInfoView
|
2016-08-15T18:47:09.576+0800 I STORAGE [initandlisten] Verify failed on uri table:collection-5252--4455765299055713693. Running a salvage operation.
|
2016-08-15T18:47:09.587+0800 I - [initandlisten] Invariant failure rs.get() src/mongo/db/catalog/database.cpp 190
|
2016-08-15T18:47:09.587+0800 I - [initandlisten]
|
|
***aborting after invariant() failure
|
|
|
2016-08-15T18:47:09.595+0800 F - [initandlisten] Got signal: 6 (Aborted).
|
|
0x131ce72 0x131bfc9 0x131c7d2 0x7f50043a0340 0x7f5004000bb9 0x7f5004003fc8 0x12a66db 0xafb4de 0xafcfd4 0xaffe5d 0xe5beb3 0xe5c8ad 0x9b214a 0x9b4e50 0x96e04d 0x7f5003febec5 0x9b1037
|
----- BEGIN BACKTRACE -----
|
{"backtrace":[{"b":"400000","o":"F1CE72","s":"_ZN5mongo15printStackTraceERSo"},{"b":"400000","o":"F1BFC9"},{"b":"400000","o":"F1C7D2"},{"b":"7F5004390000","o":"10340"},{"b":"7F5003FCA000","o":"36BB9","s":"gsignal"},{"b":"7F5003FCA000","o":"39FC8","s":"abort"},{"b":"400000","o":"EA66DB","s":"_ZN5mongo15invariantFailedEPKcS1_j"},{"b":"400000","o":"6FB4DE","s":"_ZN5mongo8Database30_getOrCreateCollectionInstanceEPNS_16OperationContextENS_10StringDataE"},{"b":"400000","o":"6FCFD4","s":"_ZN5mongo8DatabaseC1EPNS_16OperationContextENS_10StringDataEPNS_20DatabaseCatalogEntryE"},{"b":"400000","o":"6FFE5D","s":"_ZN5mongo14DatabaseHolder6openDbEPNS_16OperationContextENS_10StringDataEPb"},{"b":"400000","o":"A5BEB3"},{"b":"400000","o":"A5C8AD","s":"_ZN5mongo14repairDatabaseEPNS_16OperationContextEPNS_13StorageEngineERKSsbb"},{"b":"400000","o":"5B214A"},{"b":"400000","o":"5B4E50","s":"_ZN5mongo13initAndListenEi"},{"b":"400000","o":"56E04D","s":"main"},{"b":"7F5003FCA000","o":"21EC5","s":"__libc_start_main"},{"b":"400000","o":"5B1037"}],"processInfo":{ "mongodbVersion" : "3.2.8", "gitVersion" : "ed70e33130c977bda0024c125b56d159573dbaf0", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "3.13.0-32-generic", "version" : "#57-Ubuntu SMP Tue Jul 15 03:51:08 UTC 2014", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000", "buildId" : "A53FF676E1D627BD1D9B1BF524DEFA13B667EE83" }, { "b" : "7FFFB78FE000", "elfType" : 3, "buildId" : "E464DBB7341B7B9E7874DC0619C5F429416E6AC6" }, { "b" : "7F50052B1000", "path" : "/lib/x86_64-linux-gnu/libssl.so.1.0.0", "elfType" : 3, "buildId" : "A20EFFEC993A8441FA17F2079F923CBD04079E19" }, { "b" : "7F5004ED6000", "path" : "/lib/x86_64-linux-gnu/libcrypto.so.1.0.0", "elfType" : 3, "buildId" : "F000D29917E9B6E94A35A8F02E5C62846E5916BC" }, { "b" : "7F5004CCE000", "path" : "/lib/x86_64-linux-gnu/librt.so.1", "elfType" : 3, "buildId" : "92FCF41EFE012D6186E31A59AD05BDBB487769AB" }, { "b" : "7F5004ACA000", "path" : "/lib/x86_64-linux-gnu/libdl.so.2", "elfType" : 3, "buildId" : "C1AE4CB7195D337A77A3C689051DABAA3980CA0C" }, { "b" : "7F50047C4000", "path" : "/lib/x86_64-linux-gnu/libm.so.6", "elfType" : 3, "buildId" : "574C6350381DA194C00FF555E0C1784618C05569" }, { "b" : "7F50045AE000", "path" : "/usr/local/lib64/libgcc_s.so.1", "elfType" : 3 }, { "b" : "7F5004390000", "path" : "/lib/x86_64-linux-gnu/libpthread.so.0", "elfType" : 3, "buildId" : "FE662C4D7B14EE804E0C1902FB55218A106BC5CB" }, { "b" : "7F5003FCA000", "path" : "/lib/x86_64-linux-gnu/libc.so.6", "elfType" : 3, "buildId" : "7603ABF78951CC138A4105F4516B075D859DFC9A" }, { "b" : "7F5005510000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "9F00581AB3C73E3AEA35995A0C50D24D59A01D47" } ] }}
|
mongod(_ZN5mongo15printStackTraceERSo+0x32) [0x131ce72]
|
mongod(+0xF1BFC9) [0x131bfc9]
|
mongod(+0xF1C7D2) [0x131c7d2]
|
libpthread.so.0(+0x10340) [0x7f50043a0340]
|
libc.so.6(gsignal+0x39) [0x7f5004000bb9]
|
libc.so.6(abort+0x148) [0x7f5004003fc8]
|
mongod(_ZN5mongo15invariantFailedEPKcS1_j+0xCB) [0x12a66db]
|
mongod(_ZN5mongo8Database30_getOrCreateCollectionInstanceEPNS_16OperationContextENS_10StringDataE+0xFE) [0xafb4de]
|
mongod(_ZN5mongo8DatabaseC1EPNS_16OperationContextENS_10StringDataEPNS_20DatabaseCatalogEntryE+0x284) [0xafcfd4]
|
mongod(_ZN5mongo14DatabaseHolder6openDbEPNS_16OperationContextENS_10StringDataEPb+0x18D) [0xaffe5d]
|
mongod(+0xA5BEB3) [0xe5beb3]
|
mongod(_ZN5mongo14repairDatabaseEPNS_16OperationContextEPNS_13StorageEngineERKSsbb+0x81D) [0xe5c8ad]
|
mongod(+0x5B214A) [0x9b214a]
|
mongod(_ZN5mongo13initAndListenEi+0x930) [0x9b4e50]
|
mongod(main+0x15D) [0x96e04d]
|
libc.so.6(__libc_start_main+0xF5) [0x7f5003febec5]
|
mongod(+0x5B1037) [0x9b1037]
|
----- END BACKTRACE -----
|
Aborted (core dumped)
|
can I ignore this colleciton(kpiEngineV3.projectInfoView), repair others?
|
|
crush again, on mongodb v3.2.8, single server, no replica set, I have uploaded all the related files.
|
|
Hi clarkx,
Unfortunately, there are liability concerns which prevent us from connecting to external servers.
However, there is a workaround that should allow you to use the upload portal. Would you please use split as follows:
split -d -b 5300000000 dbpath.tgz part.
|
and upload all the part.* files to portal?
Thank you for your help,
Thomas
|
|
I put mongodb info file on my ftp server, and upload a txt file: mongo-file-get.txt which contains ftp path, file name, account info by " secure upload portal". you can download the fmongod info file.
|
|
I have collected these files, but they are too big to upload:
<Error><Code>RequestTimeout</Code><Message>Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed.</Message>
|
|
Hi clarkx,
Thank you for opening this ticket. I'm sorry you've run into these issues. To help our investigation of these failures, would you please provide additional details?
To better understand what caused the invariant failure you first observed, please upload the following information:
- The complete logs for each node in the replica set
- An archive (tar or zip) the $dbpath/diagnostic.data directory for each node in the replica set
- The output of rs.config()
- The output of rs.status
To help us examine the fatal assertion you encountered when the node was restarted, please upload the following:
- tarball of the WiredTiger files (_mdb_catalog.wt, sizeStorer.wt, WiredTiger* files)
- output of ls -l of the database directory
- WiredTigerLog.<long number>
Finally, please consider creating a backup of the current $dbpath of the affected node.
For your convenience, I have have created a secure upload portal for you to use. Files uploaded to this portal only be visible to MongoDB employees investigating this issue and are routinely deleted after some time.
To resolve this issue, I would recommend a clean resync from another member in the replica set.
Kind regards,
Thomas
|
|
last logs before server down:
2016-07-11T18:23:02.339+0800 I COMMAND [conn11200] command admin.$cmd command: replSetHeartbeat { replSetHeartbeat: "main", configVersion: 3, from: "10.45.18.106:9003", fromId: 2, term: 3 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:348 locks:{} protocol:op_command 47597ms
|
2016-07-11T18:23:02.339+0800 I COMMAND [conn11205] command admin.$cmd command: isMaster { isMaster: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:368 locks:{} protocol:op_query 23444ms
|
2016-07-11T18:23:02.373+0800 I NETWORK [conn11200] end connection 10.45.18.106:36956 (9 connections now open)
|
2016-07-11T18:23:03.887+0800 I COMMAND [conn11201] command admin.$cmd command: replSetHeartbeat { replSetHeartbeat: "main", configVersion: 3, from: "10.45.18.106:9002", fromId: 1, term: 3 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:348 locks:{} protocol:op_command 40158ms
|
2016-07-11T18:23:04.086+0800 I - [replExecDBWorker-0] Invariant failure _rsConfig.getNumMembers() == 1 && _selfIndex == 0 && _rsConfig.getMemberAt(0).isElectable() src/mongo/db/repl/replication_coordinator_impl.cpp 2232
|
2016-07-11T18:23:04.186+0800 I - [replExecDBWorker-0]
|
|
***aborting after invariant() failure
|
|
|
2016-07-11T18:23:04.859+0800 I COMMAND [conn253] command admin.$cmd command: ping { ping: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:37 locks:{} protocol:op_query 562ms
|
2016-07-11T18:23:04.848+0800 I NETWORK [conn11201] end connection 10.45.18.106:36957 (6 connections now open)
|
2016-07-11T18:23:05.180+0800 I NETWORK [initandlisten] connection accepted from 10.45.18.106:36962 #11207 (8 connections now open)
|
2016-07-11T18:23:05.526+0800 I COMMAND [ftdc] serverStatus was very slow: { after basic: 180, after asserts: 260, after connections: 290, after extra_info: 350, after globalLock: 370, after locks: 410, after network: 410, after opcounters: 430, after opcountersRepl: 430, after storageEngine: 500, after tcmalloc: 640, after wiredTiger: 1260, at end: 1630 }
|
2016-07-11T18:23:08.097+0800 I COMMAND [conn11122] command kpiEngineV3.projectView_578365f75ffe052d18000034 command: mapReduce { mapreduce: "projectView_578365f75ffe052d18000034", map: "function() {
|
var nameMap = {};
|
var keys = [];
|
|
var getNum = function(key, value) {
|
var name, num;
|
if (keys.length == 0) {
|
if (key.le...", reduce: "function(key, numArr) {
|
var max = 0;
|
for (var i = 0; i < numArr.length; i++) {
|
if (numArr[i] > max) {
|
max = numArr[i];
|
}
|
}
|
|
return max;
|
}", out: { inline: 1 }, query: { type: "data" } } keyUpdates:0 writeConflicts:0 exception: uncaught exception: out of memory code:139 numYields:116 reslen:93 locks:{ Global: { acquireCount: { r: 342 } }, Database: { acquireCount: { r: 2, R: 169 }, acquireWaitCount: { R: 1 }, timeAcquiringMicros: { R: 6284 } }, Collection: { acquireCount: { r: 2 } } } protocol:op_query 230949ms
|
2016-07-11T18:23:09.752+0800 I NETWORK [conn11122] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [10.45.18.119:50779]
|
2016-07-11T18:23:10.120+0800 I COMMAND [ftdc] serverStatus was very slow: { after basic: 190, after asserts: 320, after connections: 510, after extra_info: 610, after globalLock: 670, after locks: 700, after network: 700, after opcounters: 710, after opcountersRepl: 710, after storageEngine: 790, after tcmalloc: 1060, after wiredTiger: 1920, at end: 2270 }
|
2016-07-11T18:23:11.636+0800 F - [SyncSourceFeedback] std::exception::what(): Resource temporarily unavailable
|
Actual exception type: std::system_error
|
|
0x12f14b2 0x12f1002 0x7fcf55d43836 0x7fcf55d43863 0x7fcf55d96c85 0x7fcf555b3182 0x7fcf552e000d
|
----- BEGIN BACKTRACE -----
|
{"backtrace":[{"b":"400000","o":"EF14B2","s":"_ZN5mongo15printStackTraceERSo"},{"b":"400000","o":"EF1002"},{"b":"7FCF55CE5000","o":"5E836"},{"b":"7FCF55CE5000","o":"5E863"},{"b":"7FCF55CE5000","o":"B1C85"},{"b":"7FCF555AB000","o":"8182"},{"b":"7FCF551E5000","o":"FB00D","s":"clone"}],"processInfo":{ "mongodbVersion" : "3.2.3", "gitVersion" : "b326ba837cf6f49d65c2f85e1b70f6f31ece7937", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "3.13.0-32-generic", "version" : "#57-Ubuntu SMP Tue Jul 15 03:51:08 UTC 2014", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000", "buildId" : "C1CD0F405485844DA016C6B5275C8BEF3D68DB7A" }, { "b" : "7FFF6E4E6000", "elfType" : 3, "buildId" : "E464DBB7341B7B9E7874DC0619C5F429416E6AC6" }, { "b" : "7FCF567D0000", "path" : "/lib/x86_64-linux-gnu/libssl.so.1.0.0", "elfType" : 3, "buildId" : "D08DD65F97859C71BB2CBBF1043BD968EFE18AAD" }, { "b" : "7FCF563F5000", "path" : "/lib/x86_64-linux-gnu/libcrypto.so.1.0.0", "elfType" : 3, "buildId" : "F86FA9FB4ECEB4E06B40DBDF761A4172B70A4229" }, { "b" : "7FCF561ED000", "path" : "/lib/x86_64-linux-gnu/librt.so.1", "elfType" : 3, "buildId" : "92FCF41EFE012D6186E31A59AD05BDBB487769AB" }, { "b" : "7FCF55FE9000", "path" : "/lib/x86_64-linux-gnu/libdl.so.2", "elfType" : 3, "buildId" : "C1AE4CB7195D337A77A3C689051DABAA3980CA0C" }, { "b" : "7FCF55CE5000", "path" : "/usr/lib/x86_64-linux-gnu/libstdc++.so.6", "elfType" : 3, "buildId" : "19EFDDAB11B3BF5C71570078C59F91CF6592CE9E" }, { "b" : "7FCF559DF000", "path" : "/lib/x86_64-linux-gnu/libm.so.6", "elfType" : 3, "buildId" : "1D76B71E905CB867B27CEF230FCB20F01A3178F5" }, { "b" : "7FCF557C9000", "path" : "/lib/x86_64-linux-gnu/libgcc_s.so.1", "elfType" : 3, "buildId" : "CC0D578C2E0D86237CA7B0CE8913261C506A629A" }, { "b" : "7FCF555AB000", "path" : "/lib/x86_64-linux-gnu/libpthread.so.0", "elfType" : 3, "buildId" : "9318E8AF0BFBE444731BB0461202EF57F7C39542" }, { "b" : "7FCF551E5000", "path" : "/lib/x86_64-linux-gnu/libc.so.6", "elfType" : 3, "buildId" : "882AD7AAD54790E2FA6EF64CA2E6188F06BF9207" }, { "b" : "7FCF56A2F000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "9F00581AB3C73E3AEA35995A0C50D24D59A01D47" } ] }}
|
mongod(_ZN5mongo15printStackTraceERSo+0x32) [0x12f14b2]
|
mongod(+0xEF1002) [0x12f1002]
|
libstdc++.so.6(+0x5E836) [0x7fcf55d43836]
|
libstdc++.so.6(+0x5E863) [0x7fcf55d43863]
|
libstdc++.so.6(+0xB1C85) [0x7fcf55d96c85]
|
libpthread.so.0(+0x8182) [0x7fcf555b3182]
|
libc.so.6(clone+0x6D) [0x7fcf552e000d]
|
----- END BACKTRACE -----
|
2016-07-11T18:23:13.166+0800 I COMMAND [conn251] query kpisms.smsinfo query: { deadline: { $gte: 1468232489 } } planSummary: IXSCAN { deadline: -1 } ntoreturn:0 ntoskip:0 keysExamined:0 docsExamined:0 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:0 nreturned:0 reslen:20 locks:{ Global: { acquireCount: { r: 2 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { r: 1 } } } 332ms
|
2016-07-12T08:00:01.479+0800 I COMMAND [conn678] command sms.info command: mapReduce { mapreduce: "info", map: "function() {
|
var ret = this.resp.ret;
|
if(0 == ret){
|
emit("success", 1);
|
}else{
|
emit("fail", 1);
|
}
|
}", reduce: "function(key, value) { return Array.sum(value) }", out: { inline: 1 } } planSummary: COUNT keyUpdates:0 writeConflicts:0 numYields:3 reslen:165 locks:{ Global: { acquireCount: { r: 22 } }, Database: { acquireCount: { r: 3, R: 8 } }, Collection: { acquireCount: { r: 3 } } } protocol:op_query 1306ms
|
|
Generated at Thu Feb 08 04:08:04 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.