[SERVER-21230] Insert performance degrades drastically or hangs on large capped collection with WiredTiger Created: 30/Oct/15  Updated: 07/Apr/23  Resolved: 24/Nov/15

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: 3.0.1, 3.0.6
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Denys Assignee: Unassigned
Resolution: Incomplete Votes: 0
Labels: needs-repo
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Screen Shot 2015-10-30 at 16.54.39.png     File WiredTigerStat.30.13     File WiredTigerStat.30.14     File WiredTigerStat.30.15     Text File collection_stats.txt     Text File mongod.log     File mongodb.stack     HTML File sar30     Text File stats.txt     Text File status.log    
Issue Links:
Related
Operating System: Linux
Participants:

 Description   

We are using MongoDB to store some historical data and occasionally insert performance degrades drastically (up to hour for one record) or hangs.
Issue was reproduced few times on MongoDB 3.0.1 and 3.0.6.
Symptoms are the following:

  • inserts are slow or hang
  • mogostat hangs
  • mongotop works and shows 0 numbers
  • remote Robomongo hangs on attempt to connect
  • local mongo shell connects and can perform some commands but hanging on getting db/collection stats and queries

pstack, WiredTiger stats, parts of mongod log, biggest DB stats and few longest Operations are attached.
Some data removed as it can be sensitive.



 Comments   
Comment by Denys [ 01/Dec/15 ]

Hi Ramon

Issue is reproduced again on 3.0.7 enterprise. Will try to reach our db team to open commercial support ticket.
Another drawback of this issue – in replicaset if primary got stuck this way whole replicaset became unavailable. Heartbeat still working but data can't be saved or read from primary.

Can I collect some additional useful data from instance while it hanging? I can left it for few days in this state.

Comment by Ramon Fernandez Marina [ 24/Nov/15 ]

Thanks for the update s.dixenon@gmail.com. I'm going to close this ticket for now, but if you see this behavior again please post here so we can reopen of just open a new ticket.

Regards,
Ramón.

Comment by Denys [ 24/Nov/15 ]

Unfortunately I was able to upgrade only few days ago so it's too short period to have any results.
And as I said it's rarely reproducible and haven't occurred this month even on 3.0.6.
Will update the issue if it happens again.

Comment by Ramon Fernandez Marina [ 23/Nov/15 ]

s.dixenon@gmail.com, is this still an issue for you? Have you observed the same behavior in 3.0.7?

Thanks,
Ramón.

Comment by Ramon Fernandez Marina [ 31/Oct/15 ]

s.dixenon@gmail.com, I believe you're experience some performance-related issues with deletions on capped collections in earlier 3.0 versions that have been addressed in 3.0.7: SERVER-19522 fixed a declining insert rate, and there was performance drop-off reported in SERVER-19995 that was fixed before 3.0.7.

I am trying to reproduce this ticket locally in case this ticket uncovers a new bug related to capped collections, but I'd recommend you upgrade to 3.0.7 to get better performance out of capped collections.

Comment by Denys [ 30/Oct/15 ]

Thanks for quick answer.
It's VmWare virtual hosts with 4 x Intel(R) Xeon(R) CPU X5650 @ 2.67GHz; 8Gb RAM; 6Gb swap; 100Gb SAN storage 21Gb free.
mongod is the only application running on the host and at the moment consumed 16% CPU and 79% memory, other apps about 2% of memory.
sar stats for the same day attached.

Forget to mention that database contains single capped collection, current stats attached.
At the moment of the issue stats couldn't be retrieved.

Typical usage scenario – about 10M inserts and few hundred indexed queries for few hours during the day.
Issue rarely reproducible – happens about once per month per host.

Comment by Ramon Fernandez Marina [ 30/Oct/15 ]

Thanks for the detailed report. The only thing that seems odd in the WT stats is reconciliation numbers, I've uploaded a visual representation of it. Looking at the logs I see the slow inserts on the testHarnessSIT_N database; given its size it may take a while to reproduce this behavior onsite.

Can you please provide some more information about this deployment? What's the machine size, available memory, etc.?

Thanks,
Ramón.

Generated at Thu Feb 08 03:56:43 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.