[SERVER-19134] primary can't restart after being killed Created: 25/Jun/15  Updated: 03/Aug/15  Resolved: 03/Aug/15

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: 3.0.3
Fix Version/s: None

Type: Question Priority: Major - P3
Reporter: DaixiShi Assignee: Ramon Fernandez Marina
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-19293 Unable to repair corrupted data after... Closed
related to WT-1919 Segfault in reconciliation during che... Closed
Participants:

 Description   

I use MongoDB 3.0.3 on Cent OS 6.4 with WiredTiger engine.
I have a sharded cluster with 4 shards, each of which is a replset with one primary and one arbiter.
I killed one of the primary.
When I restarted it, it didn't work with the log below:

2015-06-26T00:02:22.361+0800 D SHARDING isInRangeTest passed
2015-06-26T00:02:22.361+0800 D NETWORK  [initandlisten] fd limit hard:65535 soft:65535 max conn: 52428
2015-06-26T00:02:22.383+0800 W -        [initandlisten] Detected unclean shutdown - /data2/mongodb3.0/shard02_4/data/mongod.lock is not empty.
2015-06-26T00:02:22.383+0800 W STORAGE  [initandlisten] Recovering data from the last clean checkpoint.
2015-06-26T00:02:22.383+0800 I STORAGE  [initandlisten] wiredtiger_open config: create,cache_size=31G,session_max=20000,eviction=(threads_max=4),statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=zlib),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0),
2015-06-26T00:05:33.153+0800 F -        [initandlisten] Invalid access at address: 0
2015-06-26T00:05:33.216+0800 F -        [initandlisten] Got signal: 11 (Segmentation fault).
 
 0xf6a889 0xf69f02 0xf6a25e 0x7f42870f0710 0x13812c9 0x1383215 0x1354424 0x1351e2d 0x130d333 0x13204ce 0x1320c8c 0x1319e82 0x13a4aa2 0x1398876 0x13a813c 0x1335ee1 0x132fe83 0xd72b8b 0xd70a28 0xa8104d 0x7f3e62 0x7f93c4 0x7f4285b70d5d 0x7f1bbd
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"400000","o":"B6A889"},{"b":"400000","o":"B69F02"},{"b":"400000","o":"B6A25E"},{"b":"7F42870E1000","o":"F710"},{"b":"400000","o":"F812C9"},{"b":"400000","o":"F83215"},{"b":"400000","o":"F54424"},{"b":"400000","o":"F51E2D"},{"b":"400000","o":"F0D333"},{"b":"400000","o":"F204CE"},{"b":"400000","o":"F20C8C"},{"b":"400000","o":"F19E82"},{"b":"400000","o":"FA4AA2"},{"b":"400000","o":"F98876"},{"b":"400000","o":"FA813C"},{"b":"400000","o":"F35EE1"},{"b":"400000","o":"F2FE83"},{"b":"400000","o":"972B8B"},{"b":"400000","o":"970A28"},{"b":"400000","o":"68104D"},{"b":"400000","o":"3F3E62"},{"b":"400000","o":"3F93C4"},{"b":"7F4285B52000","o":"1ED5D"},{"b":"400000","o":"3F1BBD"}],"processInfo":{ "mongodbVersion" : "3.0.3", "gitVersion" : "b40106b36eecd1b4407eb1ad1af6bc60593c6105", "uname" : { "sysname" : "Linux", "release" : "2.6.32-504.el6.x86_64", "version" : "#1 SMP Wed Oct 15 04:27:16 UTC 2014", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000", "buildId" : "108A63CA14A4BD5E599BAC10885DBD3A85DA5439" }, { "b" : "7FFFF4645000", "elfType" : 3, "buildId" : "08E42C6C3D2CD1E5D68A43B717C9EB3D310F2DF0" }, { "b" : "7F42870E1000", "path" : "/lib64/libpthread.so.0", "elfType" : 3, "buildId" : "B8DFF8E53D9F2B80C3C382E83EC17C828B536A39" }, { "b" : "7F4286E75000", "path" : "/usr/lib64/libssl.so.10", "elfType" : 3, "buildId" : "461F2BAE0A837A77E768EE843E127853B266E78E" }, { "b" : "7F4286A92000", "path" : "/usr/lib64/libcrypto.so.10", "elfType" : 3, "buildId" : "0EC1D06F20ED0D15B5380D27566A0847699480BB" }, { "b" : "7F428688A000", "path" : "/lib64/librt.so.1", "elfType" : 3, "buildId" : "583411D8786F86A1D6B8741C502831E6122445A7" }, { "b" : "7F4286686000", "path" : "/lib64/libdl.so.2", "elfType" : 3, "buildId" : "454F8FC6CC6502C6401E5F9E221564D80665D277" }, { "b" : "7F4286380000", "path" : "/usr/lib64/libstdc++.so.6", "elfType" : 3, "buildId" : "F07F2E7CF4BFB393CC9BBE8CDC6463652E14DB07" }, { "b" : "7F42860FC000", "path" : "/lib64/libm.so.6", "elfType" : 3, "buildId" : "7D8E9374F4A4EA38A7C1E763F32240EA113E4208" }, { "b" : "7F4285EE6000", "path" : "/lib64/libgcc_s.so.1", "elfType" : 3, "buildId" : "246C3BAB0AB093AFD59D34C8CBF29E786DE4BE97" }, { "b" : "7F4285B52000", "path" : "/lib64/libc.so.6", "elfType" : 3, "buildId" : "494ED3AE74E066D8F23DD697D3D95D4F8317ADBE" }, { "b" : "7F42872FE000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "29EA72F5B09ADC9E955977F5AEFDA978C9AE4C03" }, { "b" : "7F428590E000", "path" : "/lib64/libgssapi_krb5.so.2", "elfType" : 3, "buildId" : "B7F7FF323B3A4A12310A6285412F01ACE8C74E47" }, { "b" : "7F4285628000", "path" : "/lib64/libkrb5.so.3", "elfType" : 3, "buildId" : "7920917F74AFAD0B8CB197CABBE472AF39D94C34" }, { "b" : "7F4285424000", "path" : "/lib64/libcom_err.so.2", "elfType" : 3, "buildId" : "8CE28F280150E62296240E70ECAC64E4A57AB826" }, { "b" : "7F42851F8000", "path" : "/lib64/libk5crypto.so.3", "elfType" : 3, "buildId" : "05733977F4E41652B86070B27A0CFC2C1EA7719D" }, { "b" : "7F4284FE2000", "path" : "/lib64/libz.so.1", "elfType" : 3, "buildId" : "5FA8E5038EC04A774AF72A9BB62DC86E1049C4D6" }, { "b" : "7F4284DD7000", "path" : "/lib64/libkrb5support.so.0", "elfType" : 3, "buildId" : "C8D01C2839F6950988CE32B4266A8F89C521ACB0" }, { "b" : "7F4284BD4000", "path" : "/lib64/libkeyutils.so.1", "elfType" : 3, "buildId" : "AF374BAFB7F5B139A0B431D3F06D82014AFF3251" }, { "b" : "7F42849BA000", "path" : "/lib64/libresolv.so.2", "elfType" : 3, "buildId" : "BF14F341632720944391ED5088196F8836AF4115" }, { "b" : "7F428479B000", "path" : "/lib64/libselinux.so.1", "elfType" : 3, "buildId" : "E6798A06BEE17CF102BBA44FD512FF8B805CEAF1" } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x29) [0xf6a889]
 mongod(+0xB69F02) [0xf69f02]
 mongod(+0xB6A25E) [0xf6a25e]
 libpthread.so.0(+0xF710) [0x7f42870f0710]
 mongod(+0xF812C9) [0x13812c9]
 mongod(__wt_reconcile+0x1B5) [0x1383215]
 mongod(__wt_evict+0x104) [0x1354424]
 mongod(__wt_evict_page+0x2D) [0x1351e2d]
 mongod(__wt_page_in_func+0x553) [0x130d333]
 mongod(+0xF204CE) [0x13204ce]
 mongod(__wt_tree_walk+0x2CC) [0x1320c8c]
 mongod(__wt_cache_op+0x112) [0x1319e82]
 mongod(__wt_txn_checkpoint+0x402) [0x13a4aa2]
 mongod(+0xF98876) [0x1398876]
 mongod(__wt_txn_recover+0x4BC) [0x13a813c]
 mongod(__wt_connection_workers+0x61) [0x1335ee1]
 mongod(wiredtiger_open+0x11B3) [0x132fe83]
 mongod(_ZN5mongo18WiredTigerKVEngineC1ERKSsS2_bb+0x2EB) [0xd72b8b]
 mongod(+0x970A28) [0xd70a28]
 mongod(_ZN5mongo23GlobalEnvironmentMongoD22setGlobalStorageEngineERKSs+0x30D) [0xa8104d]
 mongod(_ZN5mongo13initAndListenEi+0x422) [0x7f3e62]
 mongod(main+0x134) [0x7f93c4]
 libc.so.6(__libc_start_main+0xFD) [0x7f4285b70d5d]
 mongod(+0x3F1BBD) [0x7f1bbd]

I tried repairing it but this didn't work with the log below:

2015-06-25T16:53:23.198+0800 I INDEX    [initandlisten] build index on: MediaDissector.HtmlRawData properties: { v: 1, key: { CrawlTime: 1.0 }, name: "CrawlTime_1", ns: "MediaDissector.HtmlRawData" }
2015-06-25T16:53:23.198+0800 I INDEX    [initandlisten]          building index using bulk method
2015-06-25T19:26:21.968+0800 I INDEX    [initandlisten]          done building bottom layer, going to commit
2015-06-25T19:26:30.903+0800 I -        [initandlisten] Invariant failure rs.get() src/mongo/db/catalog/database.cpp 186
2015-06-25T19:26:30.942+0800 I CONTROL  [initandlisten] 
 0xf6a889 0xf08321 0xeec092 0x911291 0x913073 0x915d50 0xc011c4 0xc021ec 0x7f4484 0x7f93c4 0x7f204c518d5d 0x7f1bbd
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"400000","o":"B6A889"},{"b":"400000","o":"B08321"},{"b":"400000","o":"AEC092"},{"b":"400000","o":"511291"},{"b":"400000","o":"513073"},{"b":"400000","o":"515D50"},{"b":"400000","o":"8011C4"},{"b":"400000","o":"8021EC"},{"b":"400000","o":"3F4484"},{"b":"400000","o":"3F93C4"},{"b":"7F204C4FA000","o":"1ED5D"},{"b":"400000","o":"3F1BBD"}],"processInfo":{ "mongodbVersion" : "3.0.3", "gitVersion" : "b40106b36eecd1b4407eb1ad1af6bc60593c6105", "uname" : { "sysname" : "Linux", "release" : "2.6.32-504.el6.x86_64", "version" : "#1 SMP Wed Oct 15 04:27:16 UTC 2014", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000", "buildId" : "108A63CA14A4BD5E599BAC10885DBD3A85DA5439" }, { "b" : "7FFFE22DD000", "elfType" : 3, "buildId" : "08E42C6C3D2CD1E5D68A43B717C9EB3D310F2DF0" }, { "b" : "7F204DA89000", "path" : "/lib64/libpthread.so.0", "elfType" : 3, "buildId" : "B8DFF8E53D9F2B80C3C382E83EC17C828B536A39" }, { "b" : "7F204D81D000", "path" : "/usr/lib64/libssl.so.10", "elfType" : 3, "buildId" : "461F2BAE0A837A77E768EE843E127853B266E78E" }, { "b" : "7F204D43A000", "path" : "/usr/lib64/libcrypto.so.10", "elfType" : 3, "buildId" : "0EC1D06F20ED0D15B5380D27566A0847699480BB" }, { "b" : "7F204D232000", "path" : "/lib64/librt.so.1", "elfType" : 3, "buildId" : "583411D8786F86A1D6B8741C502831E6122445A7" }, { "b" : "7F204D02E000", "path" : "/lib64/libdl.so.2", "elfType" : 3, "buildId" : "454F8FC6CC6502C6401E5F9E221564D80665D277" }, { "b" : "7F204CD28000", "path" : "/usr/lib64/libstdc++.so.6", "elfType" : 3, "buildId" : "F07F2E7CF4BFB393CC9BBE8CDC6463652E14DB07" }, { "b" : "7F204CAA4000", "path" : "/lib64/libm.so.6", "elfType" : 3, "buildId" : "7D8E9374F4A4EA38A7C1E763F32240EA113E4208" }, { "b" : "7F204C88E000", "path" : "/lib64/libgcc_s.so.1", "elfType" : 3, "buildId" : "246C3BAB0AB093AFD59D34C8CBF29E786DE4BE97" }, { "b" : "7F204C4FA000", "path" : "/lib64/libc.so.6", "elfType" : 3, "buildId" : "494ED3AE74E066D8F23DD697D3D95D4F8317ADBE" }, { "b" : "7F204DCA6000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "29EA72F5B09ADC9E955977F5AEFDA978C9AE4C03" }, { "b" : "7F204C2B6000", "path" : "/lib64/libgssapi_krb5.so.2", "elfType" : 3, "buildId" : "B7F7FF323B3A4A12310A6285412F01ACE8C74E47" }, { "b" : "7F204BFD0000", "path" : "/lib64/libkrb5.so.3", "elfType" : 3, "buildId" : "7920917F74AFAD0B8CB197CABBE472AF39D94C34" }, { "b" : "7F204BDCC000", "path" : "/lib64/libcom_err.so.2", "elfType" : 3, "buildId" : "8CE28F280150E62296240E70ECAC64E4A57AB826" }, { "b" : "7F204BBA0000", "path" : "/lib64/libk5crypto.so.3", "elfType" : 3, "buildId" : "05733977F4E41652B86070B27A0CFC2C1EA7719D" }, { "b" : "7F204B98A000", "path" : "/lib64/libz.so.1", "elfType" : 3, "buildId" : "5FA8E5038EC04A774AF72A9BB62DC86E1049C4D6" }, { "b" : "7F204B77F000", "path" : "/lib64/libkrb5support.so.0", "elfType" : 3, "buildId" : "C8D01C2839F6950988CE32B4266A8F89C521ACB0" }, { "b" : "7F204B57C000", "path" : "/lib64/libkeyutils.so.1", "elfType" : 3, "buildId" : "AF374BAFB7F5B139A0B431D3F06D82014AFF3251" }, { "b" : "7F204B362000", "path" : "/lib64/libresolv.so.2", "elfType" : 3, "buildId" : "BF14F341632720944391ED5088196F8836AF4115" }, { "b" : "7F204B143000", "path" : "/lib64/libselinux.so.1", "elfType" : 3, "buildId" : "E6798A06BEE17CF102BBA44FD512FF8B805CEAF1" } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x29) [0xf6a889]
 mongod(_ZN5mongo10logContextEPKc+0xE1) [0xf08321]
 mongod(_ZN5mongo15invariantFailedEPKcS1_j+0xB2) [0xeec092]
 mongod(_ZN5mongo8Database30_getOrCreateCollectionInstanceEPNS_16OperationContextERKNS_10StringDataE+0xE1) [0x911291]
 mongod(_ZN5mongo8DatabaseC1EPNS_16OperationContextERKNS_10StringDataEPNS_20DatabaseCatalogEntryE+0x1E3) [0x913073]
 mongod(_ZN5mongo14DatabaseHolder6openDbEPNS_16OperationContextERKNS_10StringDataEPb+0x150) [0x915d50]
 mongod(+0x8011C4) [0xc011c4]
 mongod(_ZN5mongo14repairDatabaseEPNS_16OperationContextEPNS_13StorageEngineERKSsbb+0x101C) [0xc021ec]
 mongod(_ZN5mongo13initAndListenEi+0xA44) [0x7f4484]
 mongod(main+0x134) [0x7f93c4]
 libc.so.6(__libc_start_main+0xFD) [0x7f204c518d5d]
 mongod(+0x3F1BBD) [0x7f1bbd]
-----  END BACKTRACE  -----
2015-06-25T19:26:30.942+0800 I -        [initandlisten] 
 
***aborting after invariant() failure

Does anyone know how to restart this primary node?



 Comments   
Comment by Ramon Fernandez Marina [ 03/Aug/15 ]

tottishi05, we haven't heard back from you for a while, so I'm going to close this ticket. We believe this is an instance of WT-1919, which was fixed for 3.0.4. At the time of this writing the latest stable MongoDB release is 3.0.5, so if this is still an issue for you I'd recommend you upgrade to 3.0.5. If you're still running into the same issue please let us know so we can reopen this ticket; if you run into some other issue feel free to open a new ticket.

Regards,
Ramón.

Comment by Ramon Fernandez Marina [ 30/Jun/15 ]

tottishi05, replica sets with only one data-bearing node is not a recommended configuration, as in the event of node failure there's no redundancy.

That being said, can you please upload the full logs for this server? Ideally since the "SERVER RESTARTED" line before you saw the Segmentation fault error message above until now (which should include the repair attempts, the second log snippet above, and everything afterwards).

One thing you can try is to try to start the server with a small cache size, for example by using the following command line switch: --wiredTigerCacheSizeGB=1. If you try that, please upload the logs afterwards so we can see the effects of this attempt.

Thanks,
Ramón.

Comment by DaixiShi [ 26/Jun/15 ]

I updated from 3.0.3 to 3.0.4 but unfortunatelly it didn't work either!!
As it is the only primary for that shard, how can I recover it?
Without this shard, the whole cluster cann't work!!

Comment by DaixiShi [ 26/Jun/15 ]

@Fernandez
Thanks for Ramon Fernandez and I will try updating to 3.0.4

Comment by Ramon Fernandez Marina [ 25/Jun/15 ]

tottishi05, you may be running into a variant of SERVER-18316. The preferred way to recover this node is to upgrade it to 3.0.4 and resync it from the current primary. Can you please try this procedure and report back?

I'd also recommend you upgrade the rest of your nodes to 3.0.4 at your earliest convenience.

Thanks,
Ramón.

Generated at Thu Feb 08 03:49:57 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.