[SERVER-22231] Add additional test suites to run resmoke.py validation hook Created: 19/Jan/16  Updated: 16/Sep/20  Resolved: 24/Feb/16

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: None
Fix Version/s: 3.2.4, 3.3.3

Type: Improvement Priority: Major - P3
Reporter: Jonathan Abrahams Assignee: Robert Guo (Inactive)
Resolution: Done Votes: 0
Labels: test-only, tig-resmoke
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-22234 Collection with a long key fails vali... Closed
Related
is related to SERVER-22121 Add resmoke.py validation testing hoo... Closed
Backwards Compatibility: Fully Compatible
Backport Completed:
Sprint: TIG 10 (02/19/16), TIG 11 (03/11/16)
Participants:
Linked BF Score: 0

 Description   

SERVER-22121 has only 2 YAML files, which have the validation hook enabled:

  • core.yml
  • sharding_jscore_passthrough.yml

We should investigate adding the following suites:

  • concurrency
  • concurrency_replication
  • concurrency_sharded
  • jstestfuzz
  • jstestfuzz_replication
  • jstestfuzz_sharded


 Comments   
Comment by Robert Guo (Inactive) [ 08/Mar/16 ]

Excellent! Thanks for the fix and the explanation Igor. It makes sense these two suites that were failing; they were the only ones that ran more than one mongod on the same VM (2 mongods in a master-slave configuration in this case).

Comment by Igor Canadi [ 08/Mar/16 ]

Looks like it helped! All the tests are passing for RocksDB now!

Comment by Igor Canadi [ 08/Mar/16 ]

I just pushed https://github.com/mongodb-partners/mongo-rocks/commit/982c182382fdb0de1ebd1c9770bc9fa79372f893, let's see if this fixes it.

Comment by Igor Canadi [ 08/Mar/16 ]

Thanks for the investigation Robert!

I'm running the tests on machine and even if they're not failing, I see that calling validate() fills up the block cache pretty quickly. With validate() my block cache grows to 5GB pretty quickly; without it, it stays around 1GB.

By default we set block cache to be 1/3 of the total RAM available: https://github.com/mongodb-partners/mongo-rocks/blob/master/src/rocks_engine.cpp#L206

What might be happening here is that mongod process using 1/3 of machine RAM is too much. How much memory do those machines have? Let me try defaulting RocksDB memory size to a smaller amount, with calculation similar to what Wiredtiger does: https://github.com/mongodb/mongo/blob/master/src/mongo/db/storage/wiredtiger/wiredtiger_init.cpp#L72

Alternative, we could pass in --rocksdbCacheSizeGB=1 (https://github.com/mongodb-partners/mongo-rocks/blob/master/src/rocks_global_options.cpp#L46)

Comment by Robert Guo (Inactive) [ 07/Mar/16 ]

Hi igor,

I did some investigation and it looks like the OOM issue is caused by the validate command running after every test. I used malloc_history -highWaterMark to check the memory usage on OS X and here is the stacktrace.

2992 calls for 1359137235 bytes: thread_70000105e000 |
thread_start |
 _pthread_body |
 _pthread_body |
 void* std::__1::__thread_proxy<std::__1::tuple<std::__1::__bind<void* (*)(void*), mongo::(anonymous namespace)::MessagingPortWithHandler*> > >(void*) |
 mongo::PortMessageServer::handleIncomingMsg(void*) |
 mongo::MyMessageHandler::process(mongo::Message&, mongo::AbstractMessagingPort*) |
 mongo::assembleResponse(mongo::OperationContext*, mongo::Message&, mongo::DbResponse&, mongo::HostAndPort const&) |
 mongo::receivedCommand(mongo::OperationContext*, mongo::NamespaceString const&, mongo::Client&, mongo::DbResponse&, mongo::Message&) |
 mongo::runCommands(mongo::OperationContext*, mongo::rpc::RequestInterface const&, mongo::rpc::ReplyBuilderInterface*) |
 mongo::Command::execCommand(mongo::OperationContext*, mongo::Command*, mongo::rpc::RequestInterface const&, mongo::rpc::ReplyBuilderInterface*) |
 mongo::Command::run(mongo::OperationContext*, mongo::rpc::RequestInterface const&, mongo::rpc::ReplyBuilderInterface*) |
 mongo::ValidateCmd::run(mongo::OperationContext*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, mongo::BSONObj&, int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >&, mongo::BSONObjBuilder&) |
 mongo::Collection::validate(mongo::OperationContext*, bool, bool, mongo::ValidateResults*, mongo::BSONObjBuilder*) |
 mongo::RocksRecordStore::validate(mongo::OperationContext*, bool, bool, mongo::ValidateAdaptor*, mongo::ValidateResults*, mongo::BSONObjBuilder*) |
 mongo::RocksRecordStore::Cursor::next() |
 mongo::(anonymous namespace)::PrefixStrippingIterator::Next() |
 rocksdb::BaseDeltaIterator::Next() |
 rocksdb::BaseDeltaIterator::Advance() |
 rocksdb::DBIter::Next() |
 rocksdb::MergingIterator::Next() |
 rocksdb::(anonymous namespace)::TwoLevelIterator::Next() |
 rocksdb::(anonymous namespace)::TwoLevelIterator::SkipEmptyDataBlocksForward() |
 rocksdb::(anonymous namespace)::TwoLevelIterator::InitDataBlock() |
 rocksdb::BlockBasedTable::NewDataBlockIterator(rocksdb::BlockBasedTable::Rep*, rocksdb::ReadOptions const&, rocksdb::Slice const&, rocksdb::BlockIter*) |
 rocksdb::(anonymous namespace)::ReadBlockFromFile(rocksdb::RandomAccessFileReader*, rocksdb::Footer const&, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, std::__1::unique_ptr<rocksdb::Block, std::__1::default_delete<rocksdb::Block> >*, rocksdb::Env*, bool) |
 rocksdb::ReadBlockContents(rocksdb::RandomAccessFileReader*, rocksdb::Footer const&, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::BlockContents*, rocksdb::Env*, bool) |
 operator new(unsigned long) |
 malloc |
 malloc_zone_malloc 

mongo::RocksRecordStore::validate calls into rocksdb here in rocks_record_store.cpp;

There doesn't seem to be a glaring bug in the validation code, which is very similar to WiredTiger's, the issue may simply be that rocks uses more memory when doing a collection scan. But if I remember your talk correctly, rocks should be excepted to use more memory than a b-tree based implementation when doing reads?

If you think there's room for improvement in the rock's memory usage, I'd be happy to play around with it after a fix. At the same time, I'll try to ask our AWS team for a larger instance for the rocksdb builds. If you have other suggestions for fixing this build failure, please feel free to let me know as well.

Thanks,
Robert

Comment by Igor Canadi [ 04/Mar/16 ]

Thanks Ramon and Robert!

FYI I wasn't able to reproduce the failure on my machine, but that was probably expected as it has a lot of memory and this is an OOM issue. I should probably try reproducing on smaller memory machine.

Comment by Robert Guo (Inactive) [ 04/Mar/16 ]

ramon.fernandez Yep, I'll take a look.

Comment by Ramon Fernandez Marina [ 04/Mar/16 ]

robert.guo, after this change we started having some failures in the RocksDB testing – see igor's comment on github. Can you please take a look?

Comment by Githook User [ 24/Feb/16 ]

Author:

{u'username': u'guoyr', u'name': u'Robert Guo', u'email': u'robert.guo@10gen.com'}

Message: SERVER-22231 enable the validation hook on additional suites

(cherry picked from commit 5e2b94dca62ab39a4fddf8896aae6d66d7922256)
Branch: v3.2
https://github.com/mongodb/mongo/commit/9b021af0f19d0525b05e5a612eb76e1f4511962b

Comment by Githook User [ 24/Feb/16 ]

Author:

{u'username': u'guoyr', u'name': u'Robert Guo', u'email': u'robert.guo@10gen.com'}

Message: SERVER-22231 enable the validation hook on additional suites
Branch: master
https://github.com/mongodb/mongo/commit/5e2b94dca62ab39a4fddf8896aae6d66d7922256

Comment by Robert Guo (Inactive) [ 11/Feb/16 ]

removing concurrency_* since they start their own fixtures instead of through resmoke.py

Generated at Thu Feb 08 03:59:47 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.