[SERVER-35431] rollback does not correct sizeStorer data sizes Created: 05/Jun/18  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Judah Schvimer Assignee: Backlog - Storage Execution Team
Resolution: Unresolved Votes: 0
Labels: pm-1820
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by SERVER-11113 Shard maxSize should be more accurate... Closed
Related
related to SERVER-31020 Sharding database creation is slow be... Open
related to SERVER-34977 subtract capped deletes from fastcoun... Closed
related to DOCS-11792 Document that collection/database dat... Closed
is related to SERVER-35565 Change capped collection age-out to b... Closed
Assigned Teams:
Storage Execution
Participants:
Linked BF Score: 18

 Description   

We just keep the data size the same when we recover to a stable timestamp instead of correcting it like we do with counts: https://github.com/mongodb/mongo/blob/f757bc52b926943bc748f0dc33173ab16e980f61/src/mongo/db/repl/storage_interface_impl.cpp#L1025-L1028

This means that the size reported in collStats will be wrong, it also can have the side effect of slowly decreasing the effective size of a capped collection, since the system will think it's more full than it actually is. Validate will fix the size.



 Comments   
Comment by Geert Bosch [ 05/Feb/20 ]

This ticket really is two issues:

  • Fix dataSize to be correct in the presence of crashes/rollbacks. That's work that falls on the Storage Execution team. It's a significant chunk of work, but something I think we'll need to do.
  • Decide whether storageSize or dataSize is the better metric to use for decisions on where to place data, etc. I think that dataSize is generally better as storageSize can differ significantly between nodes based on their history.

A newly added node may have significantly less fragmentation and better compression than a long-lived node that has processed lots of remove and update operations. Deciding chunk migration based on storageSize could lead to unstable behavior where chunks move back and forth depending on which node of a replicaset is used to find the storageSize of a collection. Additionally dataSize is important as it determines memory pressure for data access. If we'd balance to shards to both have a storageSize of 100 GB, but one uncompresses to 300 GB and the other to 600 GB it is likely that the latter node will perform much worse as it can cache a much smaller fraction of its data. The expectation is that over time storage sizes will balance out.

Comment by Kaloian Manassiev [ 29/Aug/18 ]

The enableSharding command uses the totalSize field from listDatabases. Looking at listDatabases, this value is derived from DatabaseCatalogEntry::sizeOnDisk, which eventually calls into RecordStore::storageSize.

So I guess the answer to your question is that the primary shard selection uses storageSize and not dataSize.

Comment by Alyson Cabral (Inactive) [ 28/Aug/18 ]

spencer or kaloian.manassiev do we know if we choose the primary shard for a database with dataSize or storageSize? 

https://docs.mongodb.com/manual/core/sharded-cluster-shards/

I believe we spoke about this in person, but just so it's captured here, in addition to capped collections becoming the incorrect size, these numbers are also used in balancing. 

Comment by Michael Cahill (Inactive) [ 13/Jun/18 ]

We can address this issue for capped collections without dramatic changes such as SERVER-35565. In particular, for capped collections we can correct the data size during rollback, since the only permitted operations are inserts so we have the size of the inserted documents that are rolled back.

For general collections, we could reduce the drift by (a) accounting for inserts that are rolled back and (b) estimating the effect of deletes on the data size (e.g., by estimating that all deleted documents are the average document size). We don't have enough information (either in the oplog or efficiently available in WiredTiger) to deal with all size-changing updates, but we should be able to avoid systematic drift.

Comment by Gregory McKeon (Inactive) [ 07/Jun/18 ]

spencer to follow up with milkie to see if there's a possible fix for this in the storage layer.

Generated at Thu Feb 08 04:39:47 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.