[SERVER-17996] Zlib decompression on wired tiger Created: 10/Apr/15  Updated: 04/Jun/15  Resolved: 15/May/15

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Kam Lall Assignee: Ramon Fernandez Marina
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-17713 WiredTiger using zlib compression can... Closed
Operating System: ALL
Steps To Reproduce:

Do a full backup of the cluster perhaps?

Participants:

 Description   

We ran into the following problem on db version v3.0.2-rc0

This may or may not be related but what I was doing at this time was a backup which involves stopping the balancer, locking the shards and taking a file system backup.

https://jira.mongodb.org/browse/SERVER-17930



 Comments   
Comment by Sam Kleinman (Inactive) [ 20/Apr/15 ]

I think you're correct that there are no more invalid zlib streams in your system after you've re-synced all members of the set with 3.0.2 (or later), and the procedure that you describe would have accomplished this goal. Sorry that you hit this issue.

I'm going to go ahead and close this ticket now, but if you experience any other related issues, feel free to open another ticket in the future.

Regards,
sam

Comment by Kam Lall [ 13/Apr/15 ]

Given our two shards, this is our situation.

In each shard, the secondary's data has been wiped out and synched with a healthy primary.

In each shard after synching was done I swapped the primary and secondary. Then I wiped the new secondary's data and let it sync with the new primary.

My assumption at this point is that our data should have been “created” only by Mongo 3.0.2 which means we should expect no more corruption.

Is that a valid assumption?

Comment by Kam Lall [ 12/Apr/15 ]

Actually tonight the primary crashed with the same problem and became the secondary. I wiped the data clean of this new secondary and am letting it sync with the new primary.

Hopefully we'll see no more crashes of the type but still wondering if there is the original corruption is still there.

Comment by Kam Lall [ 10/Apr/15 ]

Hmm, for this particular replica set we didn't have a crash on the secondary on pre 3.0.1 so I didn't wipe it and let it sync from a healthy primary.

But for the secondary that had the problem pre 3.0.1, on the other shard I did wipe it because it had the corruption.

About 30 mins ago I did a sync from the healthy primary and the secondary is running fine.

However a question on my mind is whether the primary also has the corrupted data and now the secondary is simply replicating that corruption?

Is there any tool to check if any data on the primary is corrupted? Why wouldn't the primary have the corruption first? Is there something specific about the secondary with this problem?

What happened in our case pre-3.0.1 is that the secondary crashed and then several hours later the primary also crashed with the same problem so I'm worried that the primary has this problem too.

Comment by Ramon Fernandez Marina [ 10/Apr/15 ]

klall, did you restore your data from backups or re-synced from a healthy primary before starting this node with 3.0.2-rc0? If I understand SERVER-17713 correctly, if a pre-3.0.2 mongod created the invalid zlib stream I don't think 3.0.2 can somehow "fix" that.

Comment by Ramon Fernandez Marina [ 10/Apr/15 ]

Thanks for your prompt response klall. I re-formatted your post so the logs are easier to read in JIRA, and I'm looking at the logs now.

Comment by Kam Lall [ 10/Apr/15 ]

sure, here is what we have in the logs:

[root@ip-10-0-8-202 log]# cat mongod.log 
2015-04-10T20:52:47.702+0000 W -        [initandlisten] Detected unclean shutdown - /data/mongod.lock is not empty.
2015-04-10T20:52:47.702+0000 W STORAGE  [initandlisten] Recovering data from the last clean checkpoint.
2015-04-10T20:52:47.702+0000 I STORAGE  [initandlisten] wiredtiger_open config: create,cache_size=8G,session_max=20000,eviction=(threads_max=4),statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=zlib),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0),
2015-04-10T20:52:48.724+0000 I STORAGE  [initandlisten] Starting WiredTigerRecordStoreThread local.oplog.rs
2015-04-10T20:52:48.731+0000 I CONTROL  [initandlisten] MongoDB starting : pid=14896 port=27017 dbpath=/data 64-bit host=ip-10-0-8-202
2015-04-10T20:52:48.731+0000 I CONTROL  [initandlisten] db version v3.0.2-rc0
2015-04-10T20:52:48.731+0000 I CONTROL  [initandlisten] git version: 7ecaf439780d60e9e6b316ddf90904bd9a61726d
2015-04-10T20:52:48.731+0000 I CONTROL  [initandlisten] OpenSSL version: OpenSSL 1.0.0-fips 29 Mar 2010
2015-04-10T20:52:48.731+0000 I CONTROL  [initandlisten] build info: Linux ip-10-63-154-59 3.4.43-43.43.amzn1.x86_64 #1 SMP Mon May 6 18:04:41 UTC 2013 x86_64 BOOST_LIB_VERSION=1_49
2015-04-10T20:52:48.731+0000 I CONTROL  [initandlisten] allocator: tcmalloc
2015-04-10T20:52:48.731+0000 I CONTROL  [initandlisten] options: { config: "/etc/mongod.conf", net: { bindIp: "0.0.0.0" }, processManagement: { fork: true, pidFilePath: "/var/run/mongodb/mongod.pid" }, replication: { replSet: "ib3_prod_rs1" }, security: { authorization: "enabled", keyFile: "/etc/mongodb-keyfile" }, storage: { dbPath: "/data", engine: "wiredTiger", wiredTiger: { collectionConfig: { blockCompressor: "zlib" }, engineConfig: { journalCompressor: "zlib" } } }, systemLog: { destination: "file", logAppend: true, path: "/log/mongod.log" } }
2015-04-10T20:52:48.736+0000 I NETWORK  [initandlisten] waiting for connections on port 27017
2015-04-10T20:52:48.751+0000 I NETWORK  [initandlisten] connection accepted from 10.0.21.26:6171 #1 (1 connection now open)
2015-04-10T20:52:48.775+0000 I REPL     [ReplicationExecutor] New replica set config in use: { _id: "ib3_prod_rs1", version: 7, members: [ { _id: 0, host: "ip-10-0-6-138:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 1, host: "ip-10-0-8-202.ec2.internal:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 2, host: "ip-10-0-6-5.ec2.internal:27018", arbiterOnly: true, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 } ], settings: { chainingAllowed: true, heartbeatTimeoutSecs: 10, getLastErrorModes: {}, getLastErrorDefaults: { w: 1, wtimeout: 0 } } }
2015-04-10T20:52:48.775+0000 I REPL     [ReplicationExecutor] This node is ip-10-0-8-202.ec2.internal:27017 in the config
2015-04-10T20:52:48.775+0000 I REPL     [ReplicationExecutor] transition to STARTUP2
2015-04-10T20:52:48.775+0000 I REPL     [ReplicationExecutor] Starting replication applier threads
2015-04-10T20:52:48.776+0000 I REPL     [ReplicationExecutor] transition to RECOVERING
2015-04-10T20:52:48.778+0000 I REPL     [ReplicationExecutor] transition to SECONDARY
2015-04-10T20:52:48.794+0000 I REPL     [ReplicationExecutor] Member ip-10-0-6-138:27017 is now in state PRIMARY
2015-04-10T20:52:48.798+0000 I REPL     [ReplicationExecutor] Member ip-10-0-6-5.ec2.internal:27018 is now in state ARBITER
2015-04-10T20:52:49.260+0000 I NETWORK  [initandlisten] connection accepted from 10.0.21.26:6172 #2 (2 connections now open)
2015-04-10T20:52:49.276+0000 I ACCESS   [conn2] Successfully authenticated as principal __system on local
2015-04-10T20:52:49.788+0000 I NETWORK  [initandlisten] connection accepted from 10.0.6.138:39897 #3 (3 connections now open)
2015-04-10T20:52:49.805+0000 I ACCESS   [conn3] Successfully authenticated as principal __system on local
2015-04-10T20:52:50.179+0000 I NETWORK  [initandlisten] connection accepted from 10.0.6.5:22140 #4 (4 connections now open)
2015-04-10T20:52:50.196+0000 I ACCESS   [conn4] Successfully authenticated as principal __system on local
2015-04-10T20:52:51.252+0000 I NETWORK  [initandlisten] connection accepted from 10.0.6.5:22141 #5 (5 connections now open)
2015-04-10T20:52:51.268+0000 I ACCESS   [conn5] Successfully authenticated as principal __system on local
2015-04-10T20:52:51.418+0000 I NETWORK  [initandlisten] connection accepted from 10.0.13.119:35659 #6 (6 connections now open)
2015-04-10T20:52:51.434+0000 I ACCESS   [conn6] Successfully authenticated as principal __system on local
2015-04-10T20:52:51.777+0000 I REPL     [ReplicationExecutor] syncing from: ip-10-0-6-138:27017
2015-04-10T20:52:51.794+0000 I REPL     [SyncSourceFeedback] replset setting syncSourceFeedback to ip-10-0-6-138:27017
2015-04-10T20:52:52.256+0000 I NETWORK  [initandlisten] connection accepted from 10.0.19.83:62012 #7 (7 connections now open)
2015-04-10T20:52:52.281+0000 I ACCESS   [conn7] Successfully authenticated as principal __system on local
2015-04-10T20:52:52.286+0000 I NETWORK  [initandlisten] connection accepted from 10.0.8.239:26015 #8 (8 connections now open)
2015-04-10T20:52:52.317+0000 I ACCESS   [conn8] Successfully authenticated as principal __system on local
2015-04-10T20:52:53.222+0000 E STORAGE  [WiredTigerRecordStoreThread for local.oplog.rs] WiredTiger (0) [1428699173:222755][14896:0x7f0f96bc7700], file:collection-6--2298561192753711689.wt, cursor.next: zlib error: inflate: data error: -3
2015-04-10T20:52:53.222+0000 E STORAGE  [WiredTigerRecordStoreThread for local.oplog.rs] WiredTiger (0) [1428699173:222834][14896:0x7f0f96bc7700], file:collection-6--2298561192753711689.wt, cursor.next: file:collection-6--2298561192753711689.wt: encountered an illegal file format or internal value
2015-04-10T20:52:53.222+0000 E STORAGE  [WiredTigerRecordStoreThread for local.oplog.rs] WiredTiger (-31804) [1428699173:222854][14896:0x7f0f96bc7700], file:collection-6--2298561192753711689.wt, cursor.next: the process must exit and restart: WT_PANIC: WiredTiger library panic
2015-04-10T20:52:53.222+0000 I -        [WiredTigerRecordStoreThread for local.oplog.rs] Fatal Assertion 28558
2015-04-10T20:52:53.234+0000 I CONTROL  [WiredTigerRecordStoreThread for local.oplog.rs] 
 0xf6b4a9 0xf0bca1 0xef0821 0xd9584a 0x1399979 0x1399b35 0x1399fd4 0x1309f27 0x130e575 0x130b873 0x131eb1e 0x131f2ec 0x12f56c6 0x133bd89 0xd8395f 0xd894b9 0xef3460 0xfb8a04 0x7f0fa32f2df3 0x7f0fa1da11ad
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"400000","o":"B6B4A9"},{"b":"400000","o":"B0BCA1"},{"b":"400000","o":"AF0821"},{"b":"400000","o":"99584A"},{"b":"400000","o":"F99979"},{"b":"400000","o":"F99B35"},{"b":"400000","o":"F99FD4"},{"b":"400000","o":"F09F27"},{"b":"400000","o":"F0E575"},{"b":"400000","o":"F0B873"},{"b":"400000","o":"F1EB1E"},{"b":"400000","o":"F1F2EC"},{"b":"400000","o":"EF56C6"},{"b":"400000","o":"F3BD89"},{"b":"400000","o":"98395F"},{"b":"400000","o":"9894B9"},{"b":"400000","o":"AF3460"},{"b":"400000","o":"BB8A04"},{"b":"7F0FA32EB000","o":"7DF3"},{"b":"7F0FA1CAB000","o":"F61AD"}],"processInfo":{ "mongodbVersion" : "3.0.2-rc0", "gitVersion" : "7ecaf439780d60e9e6b316ddf90904bd9a61726d", "uname" : { "sysname" : "Linux", "release" : "3.14.35-28.38.amzn1.x86_64", "version" : "#1 SMP Wed Mar 11 22:50:37 UTC 2015", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000", "buildId" : "8CBFAAEBE7925E866FF84582F73E7BF89BF23A57" }, { "b" : "7FFC63EAC000", "elfType" : 3, "buildId" : "8DD7A7F645E37D599573A937DBAA3E931F55DE40" }, { "b" : "7F0FA32EB000", "path" : "/lib64/libpthread.so.0", "elfType" : 3, "buildId" : "D48D3E6672A77B603B402F661BABF75E90AD570B" }, { "b" : "7F0FA307E000", "path" : "/usr/lib64/libssl.so.10", "elfType" : 3, "buildId" : "F711D67FF0C1FE2222FB003A30AB74DA26A5EF41" }, { "b" : "7F0FA2C99000", "path" : "/lib64/libcrypto.so.10", "elfType" : 3, "buildId" : "777069F5EECC26CD66C5C8390FA2BF4E444979D1" }, { "b" : "7F0FA2A91000", "path" : "/lib64/librt.so.1", "elfType" : 3, "buildId" : "E81013CBFA409053D58A65A0653271AB665A4619" }, { "b" : "7F0FA288D000", "path" : "/lib64/libdl.so.2", "elfType" : 3, "buildId" : "62A8842157C62F95C3069CBF779AFCC26577A99A" }, { "b" : "7F0FA2584000", "path" : "/usr/lib64/libstdc++.so.6", "elfType" : 3, "buildId" : "66F1CF311C61879639BD3DC0034DEE0D6D042261" }, { "b" : "7F0FA2282000", "path" : "/lib64/libm.so.6", "elfType" : 3, "buildId" : "5F97F8F8E5024E29717CF35998681F84D4A22D45" }, { "b" : "7F0FA206C000", "path" : "/lib64/libgcc_s.so.1", "elfType" : 3, "buildId" : "C52958E393BDF8E8D090F36DE0F4E620D8736FBF" }, { "b" : "7F0FA1CAB000", "path" : "/lib64/libc.so.6", "elfType" : 3, "buildId" : "77E48C0DBAE0843560558F1E51A5FC61602ABC5F" }, { "b" : "7F0FA3507000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "6F90843B9087FE91955FEB0355EB0858EF9E97B2" }, { "b" : "7F0FA1A68000", "path" : "/lib64/libgssapi_krb5.so.2", "elfType" : 3, "buildId" : "72C1DB5E2447A90D1BF34065BCC031B7263FFBAC" }, { "b" : "7F0FA1783000", "path" : "/lib64/libkrb5.so.3", "elfType" : 3, "buildId" : "2B8787E8E0C317CF820E5D830D923BC744E497F4" }, { "b" : "7F0FA1580000", "path" : "/usr/lib64/libcom_err.so.2", "elfType" : 3, "buildId" : "622F315EB5CB2F791E9B64020692EBA98195D06D" }, { "b" : "7F0FA1355000", "path" : "/lib64/libk5crypto.so.3", "elfType" : 3, "buildId" : "B10FBFEC246C4EAD1719D16090D0BE54904BBFC9" }, { "b" : "7F0FA113F000", "path" : "/lib64/libz.so.1", "elfType" : 3, "buildId" : "89C6AF118B6B4FB6A73AE1813E2C8BDD722956D1" }, { "b" : "7F0FA0F34000", "path" : "/lib64/libkrb5support.so.0", "elfType" : 3, "buildId" : "7292C0673D7C116E3389D3FFA67087A6B9287A71" }, { "b" : "7F0FA0D31000", "path" : "/lib64/libkeyutils.so.1", "elfType" : 3, "buildId" : "37A58210FA50C91E09387765408A92909468D25B" }, { "b" : "7F0FA0B17000", "path" : "/lib64/libresolv.so.2", "elfType" : 3, "buildId" : "17295EFA7A45ADE184C2A7A3294460BDC8B43277" }, { "b" : "7F0FA08F6000", "path" : "/usr/lib64/libselinux.so.1", "elfType" : 3, "buildId" : "F5054DC94443326819FBF3065CFDF5E4726F57EE" } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x29) [0xf6b4a9]
 mongod(_ZN5mongo10logContextEPKc+0xE1) [0xf0bca1]
 mongod(_ZN5mongo13fassertFailedEi+0x61) [0xef0821]
 mongod(+0x99584A) [0xd9584a]
 mongod(__wt_eventv+0x489) [0x1399979]
 mongod(__wt_err+0x95) [0x1399b35]
 mongod(__wt_panic+0x24) [0x1399fd4]
 mongod(__wt_bt_read+0x437) [0x1309f27]
 mongod(__wt_cache_read+0x1C5) [0x130e575]
 mongod(__wt_page_in_func+0x403) [0x130b873]
 mongod(+0xF1EB1E) [0x131eb1e]
 mongod(__wt_tree_walk+0x2CC) [0x131f2ec]
 mongod(__wt_btcur_next+0x2C6) [0x12f56c6]
 mongod(+0xF3BD89) [0x133bd89]
 mongod(_ZN5mongo21WiredTigerRecordStore27cappedDeleteAsNeeded_inlockEPNS_16OperationContextERKNS_8RecordIdE+0x18F) [0xd8395f]
 mongod(+0x9894B9) [0xd894b9]
 mongod(_ZN5mongo13BackgroundJob7jobBodyEv+0x120) [0xef3460]
 mongod(+0xBB8A04) [0xfb8a04]
 libpthread.so.0(+0x7DF3) [0x7f0fa32f2df3]
 libc.so.6(clone+0x6D) [0x7f0fa1da11ad]
-----  END BACKTRACE  -----
2015-04-10T20:52:53.234+0000 I -        [WiredTigerRecordStoreThread for local.oplog.rs] 
 
***aborting after fassert() failure

Comment by Ramon Fernandez Marina [ 10/Apr/15 ]

klall, SERVER-17930 is a dup of SERVER-17713, which should be fixed in 3.0.2 (and 3.0.2-rc0).

Can you please send us the logs of the mongod instance(s) where you observed this behavior?

Thanks,
Ramón.

Generated at Thu Feb 08 03:46:14 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.