[SERVER-10630] Speed of cleanupOldData while chunk balancing Created: 27/Aug/13  Updated: 11/Jul/16  Resolved: 07/Nov/13

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 2.4.3
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Steffen Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Ubuntu 12.04.1 LTS
3.2.0-32-generic #51-Ubuntu SMP Wed Sep 26 21:33:09 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
MongoDB 2.4.3


Operating System: ALL
Participants:

 Description   

We started to shard one more of our big collections in our database. Database has 26 collections and some of them are already sharded.
Now every night (UTC) we let the balancer run:

{ "_id" : "balancer", "activeWindow" : { "start" : "18:00", "stop" : "7:00" }, "stopped" : false }

The collection we now added has around 140 mio documents.
"avgObjSize" : 378.40800250149164,
"size" : 52424250472,

What we now see, is that outside the Balancer window the homeshard is doing its cleanup rounds.

Thus we see a lot writes and reads via mongotop on this collection.
We profiled the access patterns and think that >80% of the writes are coming from the cleanup job.

some output dbtop (webinterface) for this collection:

total		Reads		Writes		Queries		GetMores		Inserts		Updates		Removes
2259	84.9%	1987	49.9%	272	34.9%	682	37.9%	5	2.7%	0	0%	40	8.3%	0	0%
 
2320	84.1%	1479	47.9%	841	36.3%	530	28.9%	3	11.3%	0	0%	6	0.2%	0	0%

In the logfile of the server process (primary) we find the following entry:

Tue Aug 27 15:08:25.610 [cleanupOldData-5219670bedeed3fdea9d337b] moveChunk starting delete for: database.CollectionToshard from { targetUid: -5232965359423252304 } -> { targetUid: -5219148617130848963 }
....
Tue Aug 27 15:32:58.264 [cleanupOldData-5219670bedeed3fdea9d337b] Helpers::removeRangeUnlocked time spent waiting for replication: 526999ms
Tue Aug 27 15:32:58.264 [cleanupOldData-5219670bedeed3fdea9d337b] moveChunk deleted 92419 documents for database.CollectionToshard from { targetUid: -5232965359423252304 } -> { targetUid: -5219148617130848963 }

Every cleanup deletes around 90k documents in ~24 minutes. This is very slow and we suffer from periodic high IO writes. During these high IO writes the mongod service is slow and we queue up reads and some writes (monitored via mongostat)

Is this cleanUp job so aggressive for the IO?
Why is this cleanup not done while the balancer runs?
Is there a way to check the status of this cleanup job?
Is there a way to limit the cleanup job performance?

Thanks in advance,
Steffen



 Comments   
Comment by ning [ 14/Nov/13 ]

can we set speed of remove records in cleanupOldData.
if it's too fast, we got heavy load.

Comment by David Storch [ 07/Nov/13 ]

Hi Steffen, it looks like the issues outlined in this ticket have been addressed, so I'm resolving as fixed. Please feel to re-open if you experience related issues again.

Comment by Steffen [ 19/Sep/13 ]

We found an Issue with our Repset setup and a Arbiter. This leads to the following problems.

void ReplSetConfig::setMajority() {
        int total = members.size();
        int nonArbiters = total;
        int strictMajority = total/2+1;
 
        for (vector<MemberCfg>::iterator it = members.begin(); it < members.end(); it++) {
            if ((*it).arbiterOnly) {
                nonArbiters--;
            }
        }
 
        // majority should be all "normal" members if we have something like 4
        // arbiters & 3 normal members
        _majority = (strictMajority > nonArbiters) ? nonArbiters : strictMajority;
    }

This Function calculates in our setup to a result of 3. This means that every physical Machine in our repset needs to be up2date in replication lag.
From https://github.com/mongodb/mongo/blob/r2.4.6/src/mongo/s/d_migrate.cpp

            {
                // 4. do bulk of mods
                state = CATCHUP;
 
.
.
.
            {
                // pause to wait for replication
                // this will prevent us from going into critical section until we're ready
                Timer t;
                while ( t.minutes() < 600 ) {
                    log() << "Waiting for replication to catch up before entering critical section"
                          << endl;
                    if ( flushPendingWrites( lastOpApplied ) )
                        break;
                    sleepsecs(1);
                }
            }

We hit this for every moveChunk operation.
Thus we removed all arbiters from the repseta. Now the Majority calculation leads to 2. So our Hidden (backup) node in the repset does not count into the replag check.

Comment by Daniel Pasette (Inactive) [ 02/Sep/13 ]

Hi Steffen,
Rebalancing data across shards can be costly, especially because you will be moving data that is not in your working set and must necessarily be accessed from disk. You can try to narrow the balancing window further to reduce the spillover of deletes, but there is not much I can suggest to make your disk i/o perform better. On the bright side, once the data is redistributed you should be out of the woods. Looking back at MMS, it appears page faults and queues have been easing on repset2 primary.

Regarding the 4 member replica set, it doesn't hurt so much as it doesn't really help. If you lose 2 members from a 4 member or a 3 member replica set, the result is the same, your set will not be able to receive writes.

See: http://docs.mongodb.org/manual/core/replica-set-architecture-three-members/

Comment by Steffen [ 30/Aug/13 ]

FYI we are using tag aware sharding to force where the sharded collection remain.
We have 10 Shards. Shards 5-10 have the tag "Refind"
The collection has a tag range

tag: Refind  { "targetUid" : { "$minKey" : 1 } } -->> { "targetUid" : { "$maxKey" : 1 } }

So all our sharded databases/collections are on shards 5-10 only.
Shards 1-4 have the primary (home) databases and non sharded collections.

Comment by Steffen [ 30/Aug/13 ]
  • We have started sharding this collection on the 20. of August.
    We are worried of the performance of the primary node. This node queues up reads and writes.
    This 'bit' of more stress keeps our application (java) struggling, everytime we hit the IO hard. And this happens, even when the balancer is not running. Only the delayed cleanup thread.
  • Does this have a negativ impact on the repset?
Comment by Daniel Pasette (Inactive) [ 30/Aug/13 ]
  • Thanks for the note on the documentation error. I'll make sure to get that fixed ASAP.
  • There is no current way to change the writeConcern for _secondaryThrottle. If you're worried about throttling the number of writes to your replica set, you don't want to lower the write concern, as this will put more pressure on your system. It appears your sharded collections are quite well balanced and I don't see much evidence of heavy migrations. Looking at the repset2 graphs for the last month it appears that starting on around 8/20, something in your application changed dramatically which is adding quite a bit more stress to your cluster (or at least this replica set). Locking, iowait and page faults have all shot up. This usually indicates your disk is having trouble keeping up.
  • Regarding your replica set configuration, having a slow hidden secondary node is usually fine. But why do you have the extra arbiter? You never want an even number of nodes in your replica set.
Comment by Steffen [ 29/Aug/13 ]

Is there a way to specify the write concern for _secondaryThrottle ?
I would like to set it to w:true or w:1.
Since our replica set looks the following:

2 fast machines ( 1 prim , 1 second )
1 slow machine ( hidden secondary )
1 arbiter on different host

Comment by Steffen [ 29/Aug/13 ]

BTW, there is syntax error in the documentation at http://docs.mongodb.org/manual/tutorial/configure-sharded-cluster-balancer/#require-replication-before-chunk-migration-secondary-throttle

Wrong bracket on $set close:

use config
db.settings.update( { "_id" : "balancer" } , { $set : { "_secondaryThrottle" : true } , { upsert : true } } )

Working:

use config
db.settings.update( { "_id" : "balancer" } , { $set : { "_secondaryThrottle" : true } } , { upsert : true } )

Comment by Steffen [ 29/Aug/13 ]

It's the shard "repset2". This is the homeshard of the database with name refind.

Comment by Daniel Pasette (Inactive) [ 29/Aug/13 ]

Hi Stefan,

There are a couple settings which control the aggressiveness of chunk migrations and cleanup.

Starting in v2.4 the _secondaryThrottle option was turned on by default to send each insert and delete with a write concern of w:2 to try and mitigate shard migration and cleanup costs. See: http://docs.mongodb.org/manual/tutorial/configure-sharded-cluster-balancer/#require-replication-before-chunk-migration-secondary-throttle

Also, starting 2.2.1, migration cleanups are performed asynchronously to allow migrations to continue, while the old data is still being deleted from the donor shard. This can result in deletions "leaking" past the balancer window time.

I see you have MMS monitoring enabled. Can you identify a particular shard where this issue is occurring?

Generated at Thu Feb 08 03:23:39 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.