[SERVER-18224] jsCore_small_oplog test times out on Ubuntu with RocksDB Created: 08/Apr/15 Updated: 29/Apr/15 Resolved: 29/Apr/15 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Testing Infrastructure |
| Affects Version/s: | None |
| Fix Version/s: | 3.1.2 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Samantha Ritter (Inactive) | Assignee: | Ramon Fernandez Marina |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Backwards Compatibility: | Fully Compatible |
| Operating System: | ALL |
| Participants: |
| Description |
|
| Comments |
| Comment by Spencer Brody (Inactive) [ 29/Apr/15 ] |
|
Does indeed seem to be fixed according to MCI, thanks! |
| Comment by Igor Canadi [ 29/Apr/15 ] |
|
I think I fixed this: https://mci.10gen.com/task/mongodb_mongo_master_ubuntu1404_rocksdb_aa54b581e9afaf7444846a35bbd1adc8262d1330_15_04_29_14_58_08_jsCore_small_oplog_ubuntu1404_rocksdb The issue was that c3.xlarge has only 8GB memory. I configured RocksDB with 1024 shards in block cache (we've seen some block cache mutex contention), which left each shard with only 4MB (by default block cache size is half the physical RAM). This led to a bunch of block cache misses because one shard was very hot on the slave. Reducing number of block cache shards to 64 fixed the issue: https://github.com/mongodb-partners/mongo-rocks/commit/cd4ee670dde7eb4df8ec64bd7b2913503610053e. I'll need to see if block cache mutex contention is still an issue. Feel free to close this! |
| Comment by Igor Canadi [ 29/Apr/15 ] |
|
Running on m3.xlarge I encounter this issue: |
| Comment by Ernie Hershey [ 28/Apr/15 ] |
|
igor - For Ubuntu 14.04 tests in MCI right now we're using c3.xlarge. |
| Comment by Igor Canadi [ 28/Apr/15 ] |
|
Yeah as soon as I have it reproduced it will be easy to identify the issue. But without reproduction it'll be very hard to fix. As a short-term solution, do you think it would make sense to increase timeout to 10min? |
| Comment by Spencer Brody (Inactive) [ 28/Apr/15 ] |
|
Hi Igor, Thanks for looking into this! |
| Comment by Igor Canadi [ 28/Apr/15 ] |
|
BTW it looks like currently it's failing because of pymongo issues? https://mci.10gen.com/task_log_raw/mongodb_mongo_master_ubuntu1404_rocksdb_c8e2c0546b30621f78cd436d96714cc064bbb8a7_15_04_28_15_36_04_jsCore_small_oplog_ubuntu1404_rocksdb/0?type=T |
| Comment by Igor Canadi [ 28/Apr/15 ] |
|
I could actually use some help on this one. From the logs I see that insertion on primary is much faster than replication on the secondary. But unfortunately I can't reproduce on my system. Could this possibly be related to https://jira.mongodb.org/browse/SERVER-18200 ? |
| Comment by Spencer Brody (Inactive) [ 28/Apr/15 ] |
|
Great, thanks for the update Igor! |
| Comment by Igor Canadi [ 28/Apr/15 ] |
|
Thanks Spencer! This has been on my radar for a while now, but I've been focusing on stabilizing v3.0 branch. So what's happening here is a tombstone issue. RocksDB just inserts a tombstone when deletion happens. jscore_small_oplog fails on remove* tests, because they issue bunch of deletions. With so many deletions, every query that we want to answer has to iterate through a lot of tombstones, and replication can't keep up with 5min limit. An easy fix could be to increase the limit But I also have some other ideas how to fix properly. I'll work on that soon. |
| Comment by Spencer Brody (Inactive) [ 27/Apr/15 ] |
|
igor, The Rocksdb build variant of our regression tests has been failing on the jscore_small_oplog suite for a while now. Would you mind taking a look? |