[SERVER-21616] WiredTiger hangs when mongorestoring 2.8TB data Created: 21/Nov/15 Updated: 13/Feb/16 Resolved: 13/Feb/16 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Admin, WiredTiger |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | guruditta golani | Assignee: | Ramon Fernandez Marina |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | WTmem | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Operating System: | ALL | ||||||||
| Participants: | |||||||||
| Description |
|
When mongorestoring about 2.8 TB of data split across 4 mongo dibs of almost equal size, mongorestore gets stuck once two of the 4 db restores moves into index restore phase. Around the same time RAM was max utilized and there was no OOM. Configuration: Trace of mongod process that seems to be spinning 100% cpu is:
from mongod.log file, the following error line seems pertinent:
Marking it as P2 as workaround observed is to serially mongorestore databases with each sized at 700GB approx. |
| Comments |
| Comment by Ramon Fernandez Marina [ 13/Feb/16 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
guruditta, I'm closing this ticket as a duplicate of Thanks, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Ramon Fernandez Marina [ 30/Jan/16 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
guruditta, we think we've identified the underlying cause of this issue in Interestingly I was not able to replicate this behavior with MongoDB 3.2.1: I had 4 index builds in parallel, and at the same time 10 threads inserting data, and I was getting about 100k inserts/second during the index builds – so you may want to consider upgrading to MongoDB 3.2 as it's showing a more performing and stable behavior. Regards, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Ramon Fernandez Marina [ 21/Jan/16 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Apologies for the radio silence on this ticket guruditta. The behavior you're seeing could be caused by a long-running transaction, but without a way to reproduce this locally is hard to tell what the underlying cause is. I'm going to build a dataset locally that matches the numbers you provided and attempt to reproduce this bug. Note however that it tends to be specific data distributions that cause issues like this one, so size alone may not be sufficient to reproduce (see Since you have a workaround I assume that you're not blocked and your system is back in operation. But since you're not able to upload the data, perhaps you can try if MongoDB 3.2.1 exhibits the problem? The 3.2 series includes a significant number of improvements in general and in WiredTiger in particular, so it would be useful for us to know if this problem is still present in the latest version. Thanks, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by guruditta golani [ 23/Nov/15 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
It was observed that wired tiger hung in that state when 2 of the four dbs were restoring the collection and the other two were recreating indexes from the dump metadata. Mongorestoring the whole dataset with --noIndexRestore worked just fine. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by guruditta golani [ 23/Nov/15 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Startup log
I can't upload the data, but the characteristics of the dbs are the same. Each has >10 collections but one collection dominates in size and indexes and usage by 99%. here are the stats for one sample db:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Ramon Fernandez Marina [ 22/Nov/15 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Can you please upload server logs from startup (so we can see the options you're using) until the server gets stuck? Also, can you please provide more information about the databases you're restoring? If you managed to successfully restore them individually then the output of db.stats() and db.collection.getIndexes() for all collections would help us attempt to reproduce this issue locally. Alternatively, if there's a chance you can share your dataset with us you can upload it privately and securely here. You'll need to split the files in 5GiB chunks:
and then upload all the part.* files. Thanks, |