[SERVER-21616] WiredTiger hangs when mongorestoring 2.8TB data Created: 21/Nov/15  Updated: 13/Feb/16  Resolved: 13/Feb/16

Status: Closed
Project: Core Server
Component/s: Admin, WiredTiger
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: guruditta golani Assignee: Ramon Fernandez Marina
Resolution: Duplicate Votes: 0
Labels: WTmem
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-22062 Foreground index build may hang 3.0.x... Closed
Operating System: ALL
Participants:

 Description   

When mongorestoring about 2.8 TB of data split across 4 mongo dibs of almost equal size, mongorestore gets stuck once two of the 4 db restores moves into index restore phase. Around the same time RAM was max utilized and there was no OOM.

Configuration:
Mongodb version-3.0.7 with wired tiger storage engine
server configuration, (4 core, 32GB RAM, Linux xxxhost 3.13.0-44-generic #73-Ubuntu SMP Tue Dec 16 00:22:43 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux)

Trace of mongod process that seems to be spinning 100% cpu is:

#0  longest_match (s=s@entry=0xb3441000, cur_match=49810)
    at src/third_party/zlib-1.2.8/deflate.c:1159
#1  0x00000000013f7043 in deflate_slow (s=0xb3441000, flush=2)
    at src/third_party/zlib-1.2.8/deflate.c:1771
#2  0x00000000013f8f92 in deflate (strm=strm@entry=0x7fb4fc0b4240, flush=flush@entry=2)
    at src/third_party/zlib-1.2.8/deflate.c:903
#3  0x00000000012d9b84 in zlib_compress_raw (compressor=0x29d5100, session=0x2d1f8c0,
    page_max=<optimized out>, split_pct=<optimized out>, extra=<optimized out>,
    src=<optimized out>, offsets=0x1d5c2000, slots=59,
    dst=0x1b546040 "x\234\354[[o\033E\024\236\070m\323\004h\242B\313\v\bh\201\212JF\276\304I\fH\220u\n\255\224ʦNh\373\200\242\265=I\226ػ\226\275\216H\244B\021\022\002\t(\027\tqS\371\a\224\313\003\022\277", dst_len=324707, final=0, result_lenp=0x7fb4fc0b43c0, result_slotsp=0x7fb4fc0b43a0)
    at src/third_party/wiredtiger/ext/compressors/zlib/zlib_compress.c:284
#4  0x0000000001372580 in __rec_split_raw_worker (session=session@entry=0x2d1f8c0,
    r=r@entry=0x231cbe00, next_len=3864, no_more_rows=no_more_rows@entry=false)
    at src/third_party/wiredtiger/src/reconcile/rec_write.c:2353
#5  0x0000000001374695 in __rec_split_raw (next_len=<optimized out>, r=0x231cbe00,
    session=0x2d1f8c0) at src/third_party/wiredtiger/src/reconcile/rec_write.c:2617
#6  __rec_row_leaf_insert (session=session@entry=0x2d1f8c0, r=r@entry=0x231cbe00,
    ins=<optimized out>) at src/third_party/wiredtiger/src/reconcile/rec_write.c:4744
#7  0x00000000013769b4 in __rec_row_leaf (session=session@entry=0x2d1f8c0, r=r@entry=0x231cbe00,
    page=page@entry=0x455187f20, salvage=salvage@entry=0x0)
    at src/third_party/wiredtiger/src/reconcile/rec_write.c:4366
#8  0x000000000137860d in __wt_reconcile (session=session@entry=0x2d1f8c0, ref=0x3c9fa2f60,
    salvage=salvage@entry=0x0, flags=flags@entry=0)
    at src/third_party/wiredtiger/src/reconcile/rec_write.c:441
#9  0x000000000130cfb5 in __sync_file (syncop=16, session=0x2d1f8c0)
    at src/third_party/wiredtiger/src/btree/bt_sync.c:77
#10 __wt_cache_op (session=session@entry=0x2d1f8c0, ckptbase=ckptbase@entry=0x0, op=op@entry=16)
    at src/third_party/wiredtiger/src/btree/bt_sync.c:269
#11 0x0000000001399d4a in __checkpoint_write_leaves (cfg=0x7fb4fc0b4a30, session=0x2d1f8c0)
    at src/third_party/wiredtiger/src/txn/txn_ckpt.c:277
#12 __checkpoint_apply (op=0x13984a0 <__checkpoint_write_leaves>, cfg=0x7fb4fc0b4a30,
    session=0x2d1f8c0) at src/third_party/wiredtiger/src/txn/txn_ckpt.c:184
#13 __wt_txn_checkpoint (session=session@entry=0x2d1f8c0, cfg=cfg@entry=0x7fb4fc0b4a30)
    at src/third_party/wiredtiger/src/txn/txn_ckpt.c:407
#14 0x000000000138d2f6 in __session_checkpoint (wt_session=0x2d1f8c0, config=<optimized out>)
    at src/third_party/wiredtiger/src/session/session_api.c:955
#15 0x0000000001325a5a in __ckpt_server (arg=0x2d1f8c0)
    at src/third_party/wiredtiger/src/conn/conn_ckpt.c:95
#16 0x00007fb5006ef182 in start_thread (arg=0x7fb4fc0b5700) at pthread_create.c:312
#17 0x00007fb4ff1b647d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

from mongod.log file, the following error line seems pertinent:

Caught WriteConflictException doing insert on dbx.collectiony, attempt: 1 retrying

Marking it as P2 as workaround observed is to serially mongorestore databases with each sized at 700GB approx.



 Comments   
Comment by Ramon Fernandez Marina [ 13/Feb/16 ]

guruditta, I'm closing this ticket as a duplicate of SERVER-22062, which is being worked on for 3.0.10. If you think you'll run into this issue again in this or other deployments you may want to consider upgrading to 3.2 where the problem is fixed.

Thanks,
Ramón.

Comment by Ramon Fernandez Marina [ 30/Jan/16 ]

guruditta, we think we've identified the underlying cause of this issue in SERVER-22062; the stack trace looks different than the one above, so I'm not 100% sure yet, but the symptoms are the same.

Interestingly I was not able to replicate this behavior with MongoDB 3.2.1: I had 4 index builds in parallel, and at the same time 10 threads inserting data, and I was getting about 100k inserts/second during the index builds – so you may want to consider upgrading to MongoDB 3.2 as it's showing a more performing and stable behavior.

Regards,
Ramón.

Comment by Ramon Fernandez Marina [ 21/Jan/16 ]

Apologies for the radio silence on this ticket guruditta. The behavior you're seeing could be caused by a long-running transaction, but without a way to reproduce this locally is hard to tell what the underlying cause is.

I'm going to build a dataset locally that matches the numbers you provided and attempt to reproduce this bug. Note however that it tends to be specific data distributions that cause issues like this one, so size alone may not be sufficient to reproduce (see SERVER-20159 for a good example).

Since you have a workaround I assume that you're not blocked and your system is back in operation. But since you're not able to upload the data, perhaps you can try if MongoDB 3.2.1 exhibits the problem? The 3.2 series includes a significant number of improvements in general and in WiredTiger in particular, so it would be useful for us to know if this problem is still present in the latest version.

Thanks,
Ramón.

Comment by guruditta golani [ 23/Nov/15 ]

It was observed that wired tiger hung in that state when 2 of the four dbs were restoring the collection and the other two were recreating indexes from the dump metadata.

Mongorestoring the whole dataset with --noIndexRestore worked just fine.

Comment by guruditta golani [ 23/Nov/15 ]

Startup log

2015-11-22T19:33:46.467+0000 I CONTROL  [initandlisten] MongoDB starting : pid=5418 port=27017 dbpath=/var/lib/mongodb 64-bit host=xyz
2015-11-22T19:33:46.467+0000 I CONTROL  [initandlisten]
2015-11-22T19:33:46.467+0000 I CONTROL  [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/enabled is 'always'.
2015-11-22T19:33:46.467+0000 I CONTROL  [initandlisten] **        We suggest setting it to 'never'
2015-11-22T19:33:46.467+0000 I CONTROL  [initandlisten]
2015-11-22T19:33:46.467+0000 I CONTROL  [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/defrag is 'always'.
2015-11-22T19:33:46.467+0000 I CONTROL  [initandlisten] **        We suggest setting it to 'never'
2015-11-22T19:33:46.467+0000 I CONTROL  [initandlisten]
2015-11-22T19:33:46.467+0000 I CONTROL  [initandlisten] db version v3.0.7
2015-11-22T19:33:46.467+0000 I CONTROL  [initandlisten] git version: 6ce7cbe8c6b899552dadd907604559806aa2e9bd
2015-11-22T19:33:46.467+0000 I CONTROL  [initandlisten] build info: Linux ip-a-b-c-d 3.13.0-24-generic #46-Ubuntu SMP Thu Apr 10 19:11:08 UTC 2014 x86_64 BOOST_LIB_VERSION=1_49
2015-11-22T19:33:46.467+0000 I CONTROL  [initandlisten] allocator: tcmalloc
2015-11-22T19:33:46.467+0000 I CONTROL  [initandlisten] options: { config: "/etc/mongod.conf", net:{ bindIp: "0.0.0.0", port: 27017 }, processManagement: { pidFilePath: "/var/lib/mongodb/mongodb.pid" }, storage: { dbPath: "/var/lib/mongodb", engine: "wiredTiger", journal: { enabled: true }, wiredTiger: { collectionConfig: { blockCompressor: "zlib" }, engineConfig: { journalCompressor: "zlib" } }}, systemLog: { destination: "file", logAppend: true, path: "/var/log/mongodb/mongod.log" } }
2015-11-22T19:33:46.484+0000 I NETWORK  [initandlisten] waiting for connections on port 27017

I can't upload the data, but the characteristics of the dbs are the same. Each has >10 collections but one collection dominates in size and indexes and usage by 99%.

here are the stats for one sample db:

> db.stats()
{
	"db" : "xx",
	"collections" : 58,
	"objects" : 190414927,
	"avgObjSize" : 6645.909801372873,
	"dataSize" : 1265480429677,
	"storageSize" : 107139239936,
	"numExtents" : 0,
	"indexes" : 255,
	"indexSize" : 17665167360,
	"ok" : 1
}
 
> db.yy.stats()
{
	"ns" : "xx.yy",
	"count" : 189316977,
	"size" : NumberLong("1263073512602"),
	"avgObjSize" : 6671,
	"storageSize" : 106810978304,
	"capped" : false,
	"wiredTiger" : {
		"metadata" : {
			"formatVersion" : 1
		},
		"creationString" : "allocation_size=4KB,app_metadata=(formatVersion=1),block_allocation=best,block_compressor=zlib,cache_resident=0,checkpoint=(WiredTigerCheckpoint.28718=(addr=\"01e4018dc69f81e490364e65e4018dc6a081e4e169faffe4018dc6a181e49a88bfbf808080e518de6e0fc0e518e5701fc0\",order=28718,time=1448030875,size=106928553984,write_gen=15578545)),checkpoint_lsn=(9866,42158208),checksum=on,collator=,columns=,dictionary=0,format=btree,huffman_key=,huffman_value=,id=3213,internal_item_max=0,internal_key_max=0,internal_key_truncate=,internal_page_max=4KB,key_format=q,key_gap=10,leaf_item_max=0,leaf_key_max=0,leaf_page_max=32KB,leaf_value_max=64MB,memory_page_max=10m,os_cache_dirty_max=0,os_cache_max=0,prefix_compression=0,prefix_compression_min=4,split_deepen_min_child=0,split_deepen_per_child=0,split_pct=90,value_format=u,version=(major=1,minor=1)",
		"type" : "file",
		"uri" : "statistics:table:collection-3210-124522488557442337",
		"LSM" : {
			"bloom filters in the LSM tree" : 0,
			"bloom filter false positives" : 0,
			"bloom filter hits" : 0,
			"bloom filter misses" : 0,
			"bloom filter pages evicted from cache" : 0,
			"bloom filter pages read into cache" : 0,
			"total size of bloom filters" : 0,
			"sleep for LSM checkpoint throttle" : 0,
			"chunks in the LSM tree" : 0,
			"highest merge generation in the LSM tree" : 0,
			"queries that could have benefited from a Bloom filter that did not exist" : 0,
			"sleep for LSM merge throttle" : 0
		},
		"block-manager" : {
			"file allocation unit size" : 4096,
			"blocks allocated" : 0,
			"checkpoint size" : 106928553984,
			"allocations requiring file extension" : 0,
			"blocks freed" : 0,
			"file magic number" : 120897,
			"file major version number" : 1,
			"minor version number" : 0,
			"file bytes available for reuse" : 36864,
			"file size in bytes" : 106810978304
		},
		"btree" : {
			"btree checkpoint generation" : 0,
			"column-store variable-size deleted values" : 0,
			"column-store fixed-size leaf pages" : 0,
			"column-store internal pages" : 0,
			"column-store variable-size leaf pages" : 0,
			"pages rewritten by compaction" : 0,
			"number of key/value pairs" : 0,
			"fixed-record size" : 0,
			"maximum tree depth" : 0,
			"maximum internal page key size" : 368,
			"maximum internal page size" : 4096,
			"maximum leaf page key size" : 3276,
			"maximum leaf page size" : 32768,
			"maximum leaf page value size" : 67108864,
			"overflow pages" : 0,
			"row-store internal pages" : 0,
			"row-store leaf pages" : 0
		},
		"cache" : {
			"bytes read into cache" : 3208,
			"bytes written from cache" : 0,
			"checkpoint blocked page eviction" : 0,
			"unmodified pages evicted" : 0,
			"page split during eviction deepened the tree" : 0,
			"modified pages evicted" : 0,
			"data source pages selected for eviction unable to be evicted" : 0,
			"hazard pointer blocked page eviction" : 0,
			"internal pages evicted" : 0,
			"pages split during eviction" : 0,
			"in-memory page splits" : 0,
			"overflow values cached in memory" : 0,
			"pages read into cache" : 1,
			"overflow pages read into cache" : 0,
			"pages written from cache" : 0
		},
		"compression" : {
			"raw compression call failed, no additional data available" : 0,
			"raw compression call failed, additional data available" : 0,
			"raw compression call succeeded" : 0,
			"compressed pages read" : 0,
			"compressed pages written" : 0,
			"page written failed to compress" : 0,
			"page written was too small to compress" : 0
		},
		"cursor" : {
			"create calls" : 0,
			"insert calls" : 0,
			"bulk-loaded cursor-insert calls" : 0,
			"cursor-insert key and value bytes inserted" : 0,
			"next calls" : 0,
			"prev calls" : 0,
			"remove calls" : 0,
			"cursor-remove key bytes removed" : 0,
			"reset calls" : 0,
			"search calls" : 0,
			"search near calls" : 0,
			"update calls" : 0,
			"cursor-update value bytes updated" : 0
		},
		"reconciliation" : {
			"dictionary matches" : 0,
			"internal page multi-block writes" : 0,
			"leaf page multi-block writes" : 0,
			"maximum blocks required for a page" : 0,
			"internal-page overflow keys" : 0,
			"leaf-page overflow keys" : 0,
			"overflow values written" : 0,
			"pages deleted" : 0,
			"page checksum matches" : 0,
			"page reconciliation calls" : 0,
			"page reconciliation calls for eviction" : 0,
			"leaf page key bytes discarded using prefix compression" : 0,
			"internal page key bytes discarded using suffix compression" : 0
		},
		"session" : {
			"object compaction" : 0,
			"open cursor count" : 0
		},
		"transaction" : {
			"update conflicts" : 0
		}
	},
	"nindexes" : 6,
	"totalIndexSize" : 10181746688,
	"indexSizes" : {
		"_id_" : 1921007616,
		"capturetime_1" : 1824616448,
		"data.source_1" : 753684480,
		"data.timestamp_1" : 1933316096,
		"data.source_1_capturetime_1" : 1781706752,
		"data.source_1_data.timestamp_1" : 1967415296,
	},
	"ok" : 1
}

Comment by Ramon Fernandez Marina [ 22/Nov/15 ]

Can you please upload server logs from startup (so we can see the options you're using) until the server gets stuck?

Also, can you please provide more information about the databases you're restoring? If you managed to successfully restore them individually then the output of db.stats() and db.collection.getIndexes() for all collections would help us attempt to reproduce this issue locally.

Alternatively, if there's a chance you can share your dataset with us you can upload it privately and securely here. You'll need to split the files in 5GiB chunks:

$ tar czf - your_dbpath | split -d -b 500000 - part.

and then upload all the part.* files.

Thanks,
Ramón.

Generated at Thu Feb 08 03:57:53 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.