[SERVER-22328] bench_test_crud_commands.js fails due to resource contention from other resmoke jobs and low timeout values Created: 03/Jan/16  Updated: 22/Feb/16  Resolved: 18/Feb/16

Status: Closed
Project: Core Server
Component/s: Performance
Affects Version/s: None
Fix Version/s: 3.0.10, 3.3.2

Type: Bug Priority: Minor - P4
Reporter: Ian Whalen (Inactive) Assignee: J Rassi
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Completed:
Sprint: Query F (02/01/16), Query 10 (02/22/16)
Participants:
Linked BF Score: 0

 Description   

noPassthroughWithMongod_WT failed on suse12

bench_test_crud_commands.js - Logs | History

BF Ticket Generated by ian.whalen



 Comments   
Comment by Githook User [ 18/Feb/16 ]

Author:

{u'username': u'jrassi', u'name': u'Jason Rassi', u'email': u'rassi@10gen.com'}

Message: SERVER-22328 Remove benchRun() retval assertion in bench jstest

(cherry picked from commit cace2c61d3bddf3dd9f82ead6b0bb6167d635d11)
Branch: v3.2
https://github.com/mongodb/mongo/commit/c44e2fef508212f688abd4f3f37a850145f62f43

Comment by Githook User [ 18/Feb/16 ]

Author:

{u'username': u'jrassi', u'name': u'Jason Rassi', u'email': u'rassi@10gen.com'}

Message: SERVER-22328 Remove benchRun() retval assertion in bench jstest
Branch: master
https://github.com/mongodb/mongo/commit/cace2c61d3bddf3dd9f82ead6b0bb6167d635d11

Comment by Samantha Ritter (Inactive) [ 11/Feb/16 ]

Again: https://evergreen.mongodb.com/task/mongodb_mongo_v3.2_linux_64_debug_noPassthroughWithMongod_WT_aaa36d2ce713e4bde75f6b9e73fe91bcf20865f8_16_02_11_02_51_00

Comment by J Rassi [ 27/Jan/16 ]

In the above failure, an insert command of 100 documents with write concern {w: "majority"} took 4.4 seconds. Again, I suspect the cause for this slowness is disk contention from other resmoke jobs. For example, job #6 was taking ~300ms for each ~2MB single-document insert (external_sort_text_agg.js) at 2016-01-27T19:48:09, the same time that the 4-second insert batch occurred in this job.

Comment by Jonathan Reams [ 27/Jan/16 ]

Looks like this happened again https://evergreen.mongodb.com/task/mongodb_mongo_v3.2_rhel70_noPassthroughWithMongod_WT_43482c1e7ad1fd8c307460682a76368c80bb355d_16_01_27_18_26_22

Comment by J Rassi [ 08/Jan/16 ]

In this build failure, an insert command of 100 documents with the write concern {w: 1, j: false} took 1.8 seconds, though the test expects the operation to complete in <1 second.

[MongoDFixture:job2] 2015-12-31T22:16:33.766+0000 I COMMAND  [conn58] command test.bench_test_crud_commands command: insert { insert: "bench_test_crud_commands", documents: 100, writeConcern: { writeConcern: { w: 1.0, j: false } } } ninserted:100 keyUpdates:0 writeConflicts:0 numYields:0 reslen:25 locks:{ Global: { acquireCount: { r: 3, w: 3 } }, Database: { acquireCount: { w: 2, W: 1 } }, Collection: { acquireCount: { w: 2, W: 1 } } } protocol:op_command 1807ms

This particular test case was added in SERVER-21426, though it appears unlikely the commit for SERVER-21426 introduced this regression (a test case containing the same write command without a write concern was previously in this file, which should take the same amount of time to complete).

I believe that this test is also suffering from the issue that benchRun has a short "warm-up" period (see BenchRunWorker::shouldCollectStats()) where it doesn't count operations that are run in the final stats report. This can be evidenced from the fact that the server reports the insert operations as completed (see the assert.gt(coll.count(), 0) assertion in the test), but benchRun reports these operations as not completed.

I audited the logs for the other three resmoke.py jobs running on the machine at the same time, to look for evidence of resource contention on the machine at this time. I found that no_balance_collection.js (running in job #3) made two 40MB writes to disk at around the same time (one of which took 569ms, the other took 717ms), which points to disk contention as a possible cause for the inserts taking too long:

[js_test:no_balance_collection] 2015-12-31T22:16:32.856+0000 c20764| 2015-12-31T22:16:32.855+0000 I REPL     [rsSync] ******
[js_test:no_balance_collection] 2015-12-31T22:16:32.856+0000 c20764| 2015-12-31T22:16:32.855+0000 I REPL     [rsSync] creating replication oplog of size: 40MB...
[js_test:no_balance_collection] 2015-12-31T22:16:32.856+0000 c20764| 2015-12-31T22:16:32.856+0000 I REPL     [ReplicationExecutor] Member ip-10-43-157-38:20762 is now in state SECONDARY
[js_test:no_balance_collection] 2015-12-31T22:16:32.857+0000 c20764| 2015-12-31T22:16:32.856+0000 I REPL     [ReplicationExecutor] Member ip-10-43-157-38:20763 is now in state STARTUP2
[js_test:no_balance_collection] 2015-12-31T22:16:33.346+0000 c20763| 2015-12-31T22:16:33.346+0000 I REPL     [rsSync] ******
[js_test:no_balance_collection] 2015-12-31T22:16:33.347+0000 c20763| 2015-12-31T22:16:33.346+0000 I REPL     [rsSync] creating replication oplog of size: 40MB...
[js_test:no_balance_collection] 2015-12-31T22:16:33.572+0000 c20764| 2015-12-31T22:16:33.572+0000 I STORAGE  [rsSync] Starting WiredTigerRecordStoreThread local.oplog.rs
[js_test:no_balance_collection] 2015-12-31T22:16:33.573+0000 c20764| 2015-12-31T22:16:33.572+0000 I STORAGE  [rsSync] The size storer reports that the oplog contains 0 records totaling to 0 bytes
[js_test:no_balance_collection] 2015-12-31T22:16:33.573+0000 c20764| 2015-12-31T22:16:33.572+0000 I STORAGE  [rsSync] Scanning the oplog to determine where to place markers for truncation
[js_test:no_balance_collection] 2015-12-31T22:16:33.915+0000 c20763| 2015-12-31T22:16:33.915+0000 I STORAGE  [rsSync] Starting WiredTigerRecordStoreThread local.oplog.rs

I also looked at the last 1000 executions of this test, and found no other failures.

I recommend taking no action until this test fails again, in case the failure is due to a transient hardware issue. After analysis of the next failure, I would recommend one of the following:

  1. Increasing the 1 second timeout.
  2. Reducing the number of documents inserted per batch down from 100.
  3. Improve the predictability of benchRun's "warm-up" period, or provide a user mechanism to disable it.
  4. Removing the assert.gt(res.insert, 0) assertion from this test (it's redundant with the coll.count() assertion).

#4 seems like it would have the highest likelihood of fixing the problem with the least amount of work.

Comment by Ian Whalen (Inactive) [ 03/Jan/16 ]

[js_test:bench_test_crud_commands] 2015-12-31T22:16:33.768+0000 assert: 0 is not greater than 0 : {
[js_test:bench_test_crud_commands] 2015-12-31T22:16:33.769+0000 	"note" : "values per second",
[js_test:bench_test_crud_commands] 2015-12-31T22:16:33.769+0000 	"errCount" : NumberLong(0),
[js_test:bench_test_crud_commands] 2015-12-31T22:16:33.769+0000 	"trapped" : "error: not implemented",
[js_test:bench_test_crud_commands] 2015-12-31T22:16:33.769+0000 	"totalOps" : NumberLong(0),
[js_test:bench_test_crud_commands] 2015-12-31T22:16:33.769+0000 	"totalOps/s" : 0,
[js_test:bench_test_crud_commands] 2015-12-31T22:16:33.769+0000 	"findOne" : 0,
[js_test:bench_test_crud_commands] 2015-12-31T22:16:33.769+0000 	"insert" : 0,
[js_test:bench_test_crud_commands] 2015-12-31T22:16:33.769+0000 	"delete" : 0,
[js_test:bench_test_crud_commands] 2015-12-31T22:16:33.769+0000 	"update" : 0,
[js_test:bench_test_crud_commands] 2015-12-31T22:16:33.770+0000 	"query" : 0,
[js_test:bench_test_crud_commands] 2015-12-31T22:16:33.770+0000 	"command" : 0
[js_test:bench_test_crud_commands] 2015-12-31T22:16:33.770+0000 }
[js_test:bench_test_crud_commands] 2015-12-31T22:16:33.770+0000 doassert@src/mongo/shell/assert.js:15:14
[js_test:bench_test_crud_commands] 2015-12-31T22:16:33.770+0000 assert.gt@src/mongo/shell/assert.js:284:5
[js_test:bench_test_crud_commands] 2015-12-31T22:16:33.770+0000 testInsert@jstests/noPassthroughWithMongod/bench_test_crud_commands.js:34:9
[js_test:bench_test_crud_commands] 2015-12-31T22:16:33.770+0000 @jstests/noPassthroughWithMongod/bench_test_crud_commands.js:66:1
[js_test:bench_test_crud_commands] 2015-12-31T22:16:33.770+0000 @jstests/noPassthroughWithMongod/bench_test_crud_commands.js:2:2

Generated at Thu Feb 08 04:00:04 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.