[SERVER-49919] sizes of indexes on replSet members are very different Created: 26/Jul/20  Updated: 05/Aug/20  Resolved: 05/Aug/20

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 4.0.12
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Kay Agahd Assignee: Dmitry Agranat
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File collectionStats.txt    
Operating System: ALL
Participants:

 Description   

We are running a mongodb v4.0 replSet on bare metal servers. Recently one replSet member crashed and all data was lost. To bring it back into the replSet, all databases had to be cloned. A few days later, the replSet member was up & runing & in sync.
However, we observed that the size of the _id index of a (capped) collection was more than twice as large (423 vs 180 GB). The index sizes of the other secondary are quite similar to the primary though.
That's a problem because such huge _id index does not fit in RAM anymore since the server has "only" 376 GB of RAM.

Please see attached the collection stats of the primary and both secondaries.



 Comments   
Comment by Kay Agahd [ 02/Aug/20 ]

Hi Dima,
just to keep you up to date: I reduced the size of the capped collection to only 50%. The size of the _id index looks now correct:

changelog:SECONDARY> db.offerMapping.stats(1024*1024*1024)
{
	"ns" : "changelog.offerMapping",
	"size" : 11035,
	"count" : 5894490282,
	"avgObjSize" : 2010,
	"storageSize" : 1597,
	"capped" : true,
	"max" : -1,
	"maxSize" : 13252,
	"sleepCount" : 0,
	"sleepMS" : 0,
	"wiredTiger" : {
		"metadata" : {
			"formatVersion" : 1
		},
		"creationString" : "access_pattern_hint=none,allocation_size=4KB,app_metadata=(formatVersion=1),assert=(commit_timestamp=none,read_timestamp=none),block_allocation=best,block_compressor=zlib,cache_resident=false,checksum=on,colgroups=,collator=,columns=,dictionary=0,encryption=(keyid=,name=),exclusive=false,extractor=,format=btree,huffman_key=,huffman_value=,ignore_in_memory_cache_size=false,immutable=false,internal_item_max=0,internal_key_max=0,internal_key_truncate=true,internal_page_max=4KB,key_format=q,key_gap=10,leaf_item_max=0,leaf_key_max=0,leaf_page_max=32KB,leaf_value_max=64MB,log=(enabled=true),lsm=(auto_throttle=true,bloom=true,bloom_bit_count=16,bloom_config=,bloom_hash_count=8,bloom_oldest=false,chunk_count_limit=0,chunk_max=5GB,chunk_size=10MB,merge_custom=(prefix=,start_generation=0,suffix=),merge_max=15,merge_min=0),memory_page_image_max=0,memory_page_max=10m,os_cache_dirty_max=0,os_cache_max=0,prefix_compression=false,prefix_compression_min=4,source=,split_deepen_min_child=0,split_deepen_per_child=0,split_pct=90,type=file,value_format=u",
		"type" : "file",
		"uri" : "statistics:table:collection-3--5171798477577783663",
		"LSM" : {
			"bloom filter false positives" : 0,
			"bloom filter hits" : 0,
			"bloom filter misses" : 0,
			"bloom filter pages evicted from cache" : 0,
			"bloom filter pages read into cache" : 0,
			"bloom filters in the LSM tree" : 0,
			"chunks in the LSM tree" : 0,
			"highest merge generation in the LSM tree" : 0,
			"queries that could have benefited from a Bloom filter that did not exist" : 0,
			"sleep for LSM checkpoint throttle" : 0,
			"sleep for LSM merge throttle" : 0,
			"total size of bloom filters" : 0
		},
		"block-manager" : {
			"allocations requiring file extension" : 344313,
			"blocks allocated" : 346648,
			"blocks freed" : 2127,
			"checkpoint size" : NumberLong("1715160920064"),
			"file allocation unit size" : 4096,
			"file bytes available for reuse" : 65536,
			"file magic number" : 120897,
			"file major version number" : 1,
			"file size in bytes" : NumberLong("1715164864512"),
			"minor version number" : 0
		},
		"btree" : {
			"btree checkpoint generation" : 269,
			"column-store fixed-size leaf pages" : 0,
			"column-store internal pages" : 0,
			"column-store variable-size RLE encoded values" : 0,
			"column-store variable-size deleted values" : 0,
			"column-store variable-size leaf pages" : 0,
			"fixed-record size" : 0,
			"maximum internal page key size" : 368,
			"maximum internal page size" : 4096,
			"maximum leaf page key size" : 2867,
			"maximum leaf page size" : 32768,
			"maximum leaf page value size" : 67108864,
			"maximum tree depth" : 6,
			"number of key/value pairs" : 0,
			"overflow pages" : 0,
			"pages rewritten by compaction" : 0,
			"row-store internal pages" : 0,
			"row-store leaf pages" : 0
		},
		"cache" : {
			"bytes currently in the cache" : 106519129232,
			"bytes dirty in the cache cumulative" : 13711766090,
			"bytes read into cache" : 486465857124,
			"bytes written from cache" : 40355990103,
			"checkpoint blocked page eviction" : 0,
			"data source pages selected for eviction unable to be evicted" : 667,
			"eviction walk passes of a file" : 223637,
			"eviction walk target pages histogram - 0-9" : 23847,
			"eviction walk target pages histogram - 10-31" : 47354,
			"eviction walk target pages histogram - 128 and higher" : 0,
			"eviction walk target pages histogram - 32-63" : 43607,
			"eviction walk target pages histogram - 64-128" : 108829,
			"eviction walks abandoned" : 6241,
			"eviction walks gave up because they restarted their walk twice" : 0,
			"eviction walks gave up because they saw too many pages and found no candidates" : 16920,
			"eviction walks gave up because they saw too many pages and found too few candidates" : 5743,
			"eviction walks reached end of tree" : 28534,
			"eviction walks started from root of tree" : 28904,
			"eviction walks started from saved location in tree" : 194733,
			"hazard pointer blocked page eviction" : 73,
			"in-memory page passed criteria to be split" : 113369,
			"in-memory page splits" : 56504,
			"internal pages evicted" : 23969,
			"internal pages split during eviction" : 21,
			"leaf pages split during eviction" : 5015,
			"modified pages evicted" : 1036817,
			"overflow pages read into cache" : 0,
			"page split during eviction deepened the tree" : 0,
			"page written requiring cache overflow records" : 0,
			"pages read into cache" : 4523347,
			"pages read into cache after truncate" : 0,
			"pages read into cache after truncate in prepare state" : 0,
			"pages read into cache requiring cache overflow entries" : 0,
			"pages requested from the cache" : 1903568053,
			"pages seen by eviction walk" : 58249091,
			"pages written from cache" : 346116,
			"pages written requiring in-memory restoration" : 978599,
			"tracked dirty bytes in the cache" : 46571428,
			"unmodified pages evicted" : 3522688
		},
		"cache_walk" : {
			"Average difference between current eviction generation when the page was last considered" : 0,
			"Average on-disk page image size seen" : 0,
			"Average time in cache for pages that have been visited by the eviction server" : 0,
			"Average time in cache for pages that have not been visited by the eviction server" : 0,
			"Clean pages currently in cache" : 0,
			"Current eviction generation" : 0,
			"Dirty pages currently in cache" : 0,
			"Entries in the root page" : 0,
			"Internal pages currently in cache" : 0,
			"Leaf pages currently in cache" : 0,
			"Maximum difference between current eviction generation when the page was last considered" : 0,
			"Maximum page size seen" : 0,
			"Minimum on-disk page image size seen" : 0,
			"Number of pages never visited by eviction server" : 0,
			"On-disk page image sizes smaller than a single allocation unit" : 0,
			"Pages created in memory and never written" : 0,
			"Pages currently queued for eviction" : 0,
			"Pages that could not be queued for eviction" : 0,
			"Refs skipped during cache traversal" : 0,
			"Size of the root page" : 0,
			"Total number of pages currently in cache" : 0
		},
		"compression" : {
			"compressed pages read" : 4460150,
			"compressed pages written" : 341151,
			"page written failed to compress" : 0,
			"page written was too small to compress" : 4961
		},
		"cursor" : {
			"bulk-loaded cursor-insert calls" : 0,
			"close calls that result in cache" : 0,
			"create calls" : 106,
			"cursor operation restarted" : 0,
			"cursor-insert key and value bytes inserted" : 456994425766,
			"cursor-remove key bytes removed" : 0,
			"cursor-update value bytes updated" : 0,
			"cursors reused from cache" : 2340004,
			"insert calls" : 231462408,
			"modify calls" : 0,
			"next calls" : 0,
			"open cursor count" : 1,
			"prev calls" : 5,
			"remove calls" : 0,
			"reserve calls" : 0,
			"reset calls" : 656513468,
			"search calls" : 244373223,
			"search near calls" : 1,
			"truncate calls" : 0,
			"update calls" : 0
		},
		"reconciliation" : {
			"dictionary matches" : 0,
			"fast-path pages deleted" : 0,
			"internal page key bytes discarded using suffix compression" : 688781,
			"internal page multi-block writes" : 489,
			"internal-page overflow keys" : 0,
			"leaf page key bytes discarded using prefix compression" : 0,
			"leaf page multi-block writes" : 5178,
			"leaf-page overflow keys" : 0,
			"maximum blocks required for a page" : 1,
			"overflow values written" : 0,
			"page checksum matches" : 15013,
			"page reconciliation calls" : 1043081,
			"page reconciliation calls for eviction" : 983718,
			"pages deleted" : 53920
		},
		"session" : {
			"object compaction" : 0
		},
		"transaction" : {
			"update conflicts" : 0
		}
	},
	"nindexes" : 2,
	"totalIndexSize" : 181,
	"indexSizes" : {
		"_id_" : 91,
		"offerId_1" : 89
	},
	"ok" : 1
}


As you can see, it has only 91 GB which corresponds to the reduced collection size of 50% (remember, it was 180 GB for the unreduced collection and 423 GB for the "defective" unreduced collection which was the reason to create this ticket).

It would be nice to have an explanation how this big _id index size could happen but since it looks good to me now again, you may close this ticket as "cannot reproduce".

Comment by Kay Agahd [ 29/Jul/20 ]

Hi Dima, no, small(er) collections don't have this problem as far I can say.

Comment by Dmitry Agranat [ 29/Jul/20 ]

Hi kay.agahd@idealo.de, I am not sure how a long resync process could impact just one _id index. Are you able to reproduce this outside your replica set with a smaller data set, let's say, 1 GB index size?

Comment by Kay Agahd [ 29/Jul/20 ]

Hi Dima,

your steps seem to be correct to me. Did I mention that the capped collection is 4.5 TB (=27 TB uncompressed) large? The cloning process took more than 4 days. Maybe the huge size of the index is due to this?

Comment by Dmitry Agranat [ 29/Jul/20 ]

Hi kay.agahd@idealo.de, I tried to reproduce this issue by using the following steps:

  • MongoDB 4.0.12, default settings, Replica set PSS
  • Create and populate a capped collection having 2 indexes, _id and an additional one
  • Check db.collection.stats() among all members
  • Kill one secondary, remove the contents of the dbPath directory and let it resync
  • Check db.collection.stats() among all members

I did not notice the issue you've reported. Could you review my steps and adjust if needed? Alternatively, provide a repro of this issue outside of your replica set.

Thanks,
Dima

Comment by Kay Agahd [ 28/Jul/20 ]

Hi Dima,

I meant by "had to be cloned" the standard mongodb process when a member without any data is brought back to the replSet. Mongodb begins automatically to clone all databases. You can observe the status of the cloning process in mongodb's log file.
We did not yet notice this difference on any non-capped collection but I must admit that we never cared about, so perhaps it happened also on non-capped collections but we did not remarked it yet.

Comment by Dmitry Agranat [ 28/Jul/20 ]

Hi kay.agahd@idealo.de,

Could you please clarify what "had to be cloned" mean? We are interested to understand how exactly the data was restored.
Also, did you notice this difference on any non-capped collections?

Thanks,
Dima

Generated at Thu Feb 08 05:21:13 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.