[SERVER-5514] Data not balanced accross all the shards Created: 05/Apr/12  Updated: 15/Aug/12  Resolved: 09/Apr/12

Status: Closed
Project: Core Server
Component/s: Stability
Affects Version/s: 2.0.1
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Preetham Derangula Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Linux


Attachments: Zip Archive config.zip     Text File mongos.log_20120405_953am.txt     Text File printShardingStatus.txt    
Issue Links:
Related
related to SERVER-1136 Online defragCollection command Closed
related to SERVER-2640 Automatically remove old moveChunk bs... Closed
Operating System: Linux
Participants:

 Description   

I have 7 shards and 1 Router and 1 Config server. All the shards and servers are running on VMs.One of the shards is allocated excessive data. This is the same machine that was acting up and exhausted all the diskspace in the previous issue that I have raised before. We deleted all the data and config and started on clean slate.
Could there be any problem with this machine in having affinity for taking more data- If so, what could be the reason?

I have raised a related issue which details the history of this issue
https://jira.mongodb.org/browse/SERVER-5433
Thanks
**************************************
OUTPUTS for STATS
***********************************
mongostat for the machine thats acting up(lrchb00363)

insert  query update delete getmore command flushes mapped  vsize    res faults locked % idx miss %     qr|qw   ar|aw  netIn netOut  conn       time
     4      0      7      0       0      15       0  5.95g  12.5g  4.21g      0     0.2          0       0|0     0|0     4k     4k   127   09:31:56
    13      0      9      0       0      23       0  5.95g  12.5g   4.2g      0     0.4          0       0|0     0|0    10k     3k   127   09:31:57

Observation look at the mapped and vsize 5.95 and 12.5
##################################
mongostat for the machine thats normal(lrchb00363)

insert  query update delete getmore command flushes mapped  vsize    res faults locked % idx miss %     qr|qw   ar|aw  netIn netOut  conn       time
     9      0      6      0       0      16       0  1.95g  4.42g  1.48g      1     1.3          0       0|0     0|0     6k     3k   129   09:33:24
    16      0     14      0       0      31       0  1.95g  4.42g  1.48g      0     0.6          0       0|0     0|0    13k     4k   129   09:33:25

Observation: look at the mapped and vsize 1.95 and 4.42
All the other shards have similiar mapped and vsizes, except lrchb00319.
I am thinking this odd shard would take up all the data and then stop functioning as it runs out of diskspace. Please comment.

*****************************
Output of db.stats()
*****************************

{
        "raw" : {
                "LRCHB00319:40001" : {
                        "db" : "audit",
                        "collections" : 9,
                        "objects" : 1094648,
                        "avgObjSize" : 1018.3885486476017,
                        "dataSize" : 1114776988,
                        "storageSize" : 1346035712,
                        "numExtents" : 80,
                        "indexes" : 20,
                        "indexSize" : 233449328,
                        "fileSize" : 4226809856,
                        "nsSizeMB" : 16,
                        "ok" : 1
                },
                "LRCHB00362:40004" : {
                        "db" : "audit",
                        "collections" : 5,
                        "objects" : 904531,
                        "avgObjSize" : 1135.2065810900897,
                        "dataSize" : 1026829544,
                        "storageSize" : 1104818176,
                        "numExtents" : 33,
                        "indexes" : 12,
                        "indexSize" : 213589824,
                        "fileSize" : 4226809856,
                        "nsSizeMB" : 16,
                        "ok" : 1
                },
                "LRCHB00363:40005" : {
                        "db" : "audit",
                        "collections" : 5,
                        "objects" : 1239329,
                        "avgObjSize" : 1268.8326473438449,
                        "dataSize" : 1572501096,
                        "storageSize" : 4183908352,
                        "numExtents" : 56,
                        "indexes" : 12,
                        "indexSize" : 290534160,
                        "fileSize" : 8519680000,
                        "nsSizeMB" : 16,
                        "ok" : 1
                },
                "LRCHB00364:40006" : {
                        "db" : "audit",
                        "collections" : 5,
                        "objects" : 893827,
                        "avgObjSize" : 1169.2595547013013,
                        "dataSize" : 1045115760,
                        "storageSize" : 1076518912,
                        "numExtents" : 45,
                        "indexes" : 12,
                        "indexSize" : 210908096,
                        "fileSize" : 4226809856,
                        "nsSizeMB" : 16,
                        "ok" : 1
                },
                "LRCHB00365:40002" : {
                        "db" : "audit",
                        "collections" : 5,
                        "objects" : 848153,
                        "avgObjSize" : 1184.0515048582035,
                        "dataSize" : 1004256836,
                        "storageSize" : 1167663104,
                        "numExtents" : 50,
                        "indexes" : 12,
                        "indexSize" : 201129600,
                        "fileSize" : 4226809856,
                        "nsSizeMB" : 16,
                        "ok" : 1
                },
                "LRCHB00366:40003" : {
                        "db" : "audit",
                        "collections" : 5,
                        "objects" : 891586,
                        "avgObjSize" : 1092.6191169444114,
                        "dataSize" : 974163908,
                        "storageSize" : 1135775744,
                        "numExtents" : 37,
                        "indexes" : 12,
                        "indexSize" : 211300544,
                        "fileSize" : 4226809856,
                        "nsSizeMB" : 16,
                        "ok" : 1
                },
                "LRCHB00374:40007" : {
                        "db" : "audit",
                        "collections" : 5,
                        "objects" : 1013103,
                        "avgObjSize" : 1202.7945075673451,
                        "dataSize" : 1218554724,
                        "storageSize" : 1347837952,
                        "numExtents" : 38,
                        "indexes" : 12,
                        "indexSize" : 240954896,
                        "fileSize" : 4226809856,
                        "nsSizeMB" : 16,
                        "ok" : 1
                }
        },
        "objects" : 6885177,
        "avgObjSize" : 1155.5547309822246,
        "dataSize" : 7956198856,
        "storageSize" : 11362557952,
        "numExtents" : 339,
        "indexes" : 92,
        "indexSize" : 1601866448,
        "fileSize" : 33880539136,
        "ok" : 1



 Comments   
Comment by Preetham Derangula [ 09/Apr/12 ]

Thanks a lot!

Comment by Scott Hernandez (Inactive) [ 09/Apr/12 ]

Also, SERVER-1136.

Comment by Preetham Derangula [ 09/Apr/12 ]

Thanks. SERVER-2640 issue is concerned with cleaning up of moveChunks folder. How about repairDatabase() on each shard periodically, so that it takes up optimum size- Is that also in the plans? If so, any issue associated with this topic?

Comment by Scott Hernandez (Inactive) [ 09/Apr/12 ]

Yes: SERVER-2640

Comment by Preetham Derangula [ 09/Apr/12 ]

One other question, I can write a maintenance job to repair database and clearing moveChunks periodically? But, do you guys have any plans of solving this internally within Mongo subsytems in further releases? Just curious and greedy

Comment by Scott Hernandez (Inactive) [ 09/Apr/12 ]

Yes, you can remove those files periodically. Probably worth keeping a few days (or longer).

Comment by Preetham Derangula [ 09/Apr/12 ]

Thanks. Repairing individual shard was a good idea and it has shrunk storaze size.
But one more thing,
moveChunks folder is holding data. Does it need to be cleaned up regularly?

Comment by Scott Hernandez (Inactive) [ 07/Apr/12 ]

Just on the one shard to start with.

The total for the database files allocated (on all shards) are around 33GB, but only 11GB of data (storage) is used. Because this is spread around many shards and each might have 2-4 GB of extra space (pre-allocated files) you can see how these numbers aren't going to accurately reflect the real data size when added up for the sharded cluster.

Comment by Preetham Derangula [ 06/Apr/12 ]

Reapair on just shard LRCHB00363 or the whole shard network?If you look at LRCHB00363 it has more than 11GB
Thanks,
On LRCHB00363
***************************
insert query update delete getmore command flushes mapped vsize res faults
locked % idx miss % qr|qw ar|aw netIn netOut conn time
10 0 15 0 0 25 0 13.9g 28.5g 6.55g 0
0.5 0 0|0 0|0 17k 4k 139 10:37:08
3 0 5 0 0 9 0 13.9g 28.5g 6.55g 0
On other shards
******************

insert query update delete getmore command flushes mapped vsize res faults
locked % idx miss % qr|qw ar|aw netIn netOut conn time
0 0 0 0 0 3 0 3.95g 8.35g 2.82g 0
0 0 0|0 0|0 182b 2k 129 10:37:25
0 0 0 0 0 1 0 3.95g 8.35g 2.82g 0

Comment by Scott Hernandez (Inactive) [ 06/Apr/12 ]

It looks like the number of chunks are even across the shards but shard3 (LRCHB00363) has more files allocated and more data stored. Can you do a repair database to see if you have fragmentation there?

Also, with so little data (less than 11GB for the sharded DB) you will not so even file storage across the shards because of pre-allocation: http://www.mongodb.org/display/DOCS/Excessive+Disk+Space

Comment by Preetham Derangula [ 05/Apr/12 ]

Attached

Comment by Scott Hernandez (Inactive) [ 05/Apr/12 ]

Please attach the results of db.printShardingStatus(), the logs from mongos, and a mongodump of the config database.

Generated at Thu Feb 08 03:09:07 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.