[SERVER-1780] "doing delete inline" blocks the whole cluster Created: 12/Sep/10  Updated: 16/Nov/21  Resolved: 30/Sep/10

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 1.6.2, 1.7.0
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Sergei Tulentsev Assignee: Eliot Horowitz (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: Linux
Participants:

 Description   

in best of the best cases it takes at least several seconds.

When the amount of data is substantial, it can block for more than an hour, pretty much rendering the whole cluster useless and making queries pile up in the queue.

It does some very intensive I/O. Does it do data compaction of some sort?

Since we have chunks of data roughly of the same size, could we just mark it free and than later rewrite it?



 Comments   
Comment by Guanhai Wang [ 09/Jul/12 ]

Hi, Scott, I am very sorry that I didn't give you some useful details. While the sharding cluster is a production one and has high-traffic. I didn't catch those stats and logs when the problem occurred this afternoon and am not going to reproduce this problem. I had switched the activeWindow to low-traffic times. If it appears again, I will let you know. Thank your very much!

Comment by Scott Hernandez (Inactive) [ 09/Jul/12 ]

Guanhai, please open a new issue with stats, and logs. Please include iostat -xmt 2, mongostat and vmstat numbers during the period for all members involved as well as the logs from those members and the mongos instances.

In addition please include a mongodump of the config database after the event. Timing/a-timeline would be very useful so please call out when what happened as you experienced it from the user's/applications' perspective.

Comment by Guanhai Wang [ 09/Jul/12 ]

When the balancer migrating a chunk from one shard to another I got the same issue with mongodb 1.8.5, git version: 403c8dadcd56f68dcbe06013ecbfac67b32a22ac. When "doing delete inline", all operations of the sharding cluster were blocked more than twenty minutes. My cluster received about 10,000 commands per second at that time.

Comment by Eliot Horowitz (Inactive) [ 03/Nov/10 ]

That's expected. The goal is that it shouldn't cause detrimental performance to the system.
There are already a lot of other improvements in 1.7 to make it less imapctfull

Comment by Chris Chandler [ 03/Nov/10 ]

The total cluster blocking appears to be resolved in 1.6.4, everything just slows down in relation. The source shard however that's having a chunk moved away is still experiencing ~15 minutes of 100% IO utilization for a 200MB chunk. Is this expected or should I file/comment on a bug elsewhere?

Comment by Eliot Horowitz (Inactive) [ 03/Nov/10 ]

Can you try 1.6.4?

Comment by Chris Chandler [ 03/Nov/10 ]

I'm still seeing this issue on 1.6.3.

Wed Nov 3 12:53:34 MongoDB starting : pid=16617 port=27018 dbpath=/db/var/mongodb/ 64-bit
Wed Nov 3 12:53:34 db version v1.6.3, pdfile version 4.5
Wed Nov 3 12:53:34 git version: 278bd2ac2f2efbee556f32c13c1b6803224d1c01
Wed Nov 3 12:53:34 sys info: Linux domU-12-31-39-06-79-A1 2.6.21.7-2.ec2.v1.2.fc8xen #1 SMP Fri Nov 20 17:48:28 EST 2009 x86_64 BOOST_LIB_VERSION=1_41
Wed Nov 3 12:53:34 [initandlisten] waiting for connections on port 27018
Wed Nov 3 12:53:34 [websvr] web admin interface listening on port 28018

I see the "doing delete inline" message and then iostat -x 2 jumps to 100% for approximately 10-11 minutes. Any attempt to write to the cluster in this window appears to block activity.

avg-cpu: %user %nice %system %iowait %steal %idle
2.12 0.00 9.09 38.61 0.00 50.19

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 98.51 10049.25 2057.71 121796.02 12.21 114.86 11.30 0.10 100.50
dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-2 0.00 0.00 96.52 10044.28 2041.79 121903.48 12.22 114.85 11.31 0.10 100.50

Comment by Eliot Horowitz (Inactive) [ 30/Sep/10 ]

Ok - going to close for now.
Please comment if it starts acting up again.

Comment by Sergei Tulentsev [ 30/Sep/10 ]

It's performing very well, though write load is significantly lower.

Comment by Eliot Horowitz (Inactive) [ 30/Sep/10 ]

Is this performing better or are you still having problems?

Comment by Sergei Tulentsev [ 30/Sep/10 ]

I am running 1.6.3 now.

Comment by Eliot Horowitz (Inactive) [ 30/Sep/10 ]

1.6.3 has a number of the changes.
You may want to try that.

Comment by Sergei Tulentsev [ 29/Sep/10 ]

Sorry, must have missed your previous comment. What's a stall?

I must say that I dont' encounter this behaviour anymore. Probably because I am not inserting data at that rate.Though, I still have some data to import.

But I would rather wait for next stable build. Are you going to merge these changes into it?

Comment by Eliot Horowitz (Inactive) [ 29/Sep/10 ]

Any updates?
Can you try 1.7.1?

Comment by Eliot Horowitz (Inactive) [ 12/Sep/10 ]

Yes that.
And you upgraded everything?
Can you send the logs from a stall after that?
Shouldn't be happening any more

Comment by Sergei Tulentsev [ 12/Sep/10 ]

You mean this?

sergio@cs2592:~$ mongod --version
db version v1.7.1-pre-, pdfile version 4.5
Sun Sep 12 21:03:28 git version: 6766569f9acdd80e27d957906f61dd7a14425d0d

Comment by Eliot Horowitz (Inactive) [ 12/Sep/10 ]

Can you send the startup banned with git hash?

Comment by Sergei Tulentsev [ 12/Sep/10 ]

Yes, this is still happening in the latest nightly (2010-09-10)

Comment by Eliot Horowitz (Inactive) [ 12/Sep/10 ]

Is this still happening in the 1.7?
SERVER-1521 should have fixed it

Generated at Thu Feb 08 02:58:01 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.