[SERVER-11177] stale_clustered.js failed on RHEL 5.7 64-bit Created: 14/Oct/13  Updated: 11/Jul/16  Resolved: 17/Oct/13

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: None
Fix Version/s: 2.5.3

Type: Bug Priority: Major - P3
Reporter: Matt Kangas Assignee: Greg Studer
Resolution: Done Votes: 0
Labels: buildbot
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:

 Description   

Assigning to Greg for triage.

a428f58a6fd4 RHEL 5.7 64-bit slow_nightly_tests (logfile)

2013-10-14T13:03:10.355-0400 error: {
	"$err" : "not master or secondary; cannot currently read from this replSet member",
	"code" : 17070
} at src/mongo/shell/query.js:131
failed to load: /data/mci/git@github.commongodb/mongo.git/master/jstests/slowNightly/stale_clustered.js

Last visible green was 6a70c219c629

/cc alerner



 Comments   
Comment by Ernie Hershey [ 16/Oct/13 ]

Should I convert /data on that build host to XFS? It's ext4 now, which I think could be contributing to preallocation slowdown. Is the basic build host issue that the filesystem is slow? Or do you think it could be something else?

Comment by Matt Kangas [ 16/Oct/13 ]

Ok, sounds like the above (rev 94ce941611a) failures were build host issues.

Keeping ticket open until we are clear.

Comment by Greg Studer [ 16/Oct/13 ]

The second failure above:

> rev 94ce941611a
> Buildbot Nightly Linux RHEL 64-bit Build #712 (logfile)

This is a failure to even start up the basic replica set for the test:

{
 "ok" : 0,
 "errmsg" : "couldn't initiate : need all members up to initiate, not ok : bs-rhel-57-64-2:31201"
}
2013-10-16T00:31:19.964-0400 assert.soon failed: function () {
        var result = master.runCommand(cmd);
        printjson(result);
        return result['ok'] == 1;
    }, msg:Initiate replica set

and doesn't die...

2013-10-16T00:36:19.089-0400 process on port 31201, with pid 14157 not terminated, sending sigkill

That particular node, :31201, never actually seems to start accepting connections because it's preallocating.

 m31201| 2013-10-16T00:14:19.325-0400 [initandlisten] MongoDB starting : pid=14157 port=31201 dbpath=/data/db/clusteredstale-rs1-1 64-bit host=bs-rhel-57-64-2
 m31201| 2013-10-16T00:14:19.326-0400 [initandlisten] 
 m31201| 2013-10-16T00:14:19.326-0400 [initandlisten] ** NOTE: This is a development version (2.5.3-pre-) of MongoDB.
 m31201| 2013-10-16T00:14:19.326-0400 [initandlisten] **       Not recommended for production.
 m31201| 2013-10-16T00:14:19.326-0400 [initandlisten] 
 m31201| 2013-10-16T00:14:19.326-0400 [initandlisten] db version v2.5.3-pre-
 m31201| 2013-10-16T00:14:19.326-0400 [initandlisten] git version: 94ce941611abf2fd3826936a90d93b09e483d99c
 m31201| 2013-10-16T00:14:19.326-0400 [initandlisten] build info: Linux bs-rhel-57-64-2 2.6.18-274.el5xen #1 SMP Fri Jul 8 17:45:44 EDT 2011 x86_64 BOOST_LIB_VERSION=1_49
 m31201| 2013-10-16T00:14:19.326-0400 [initandlisten] allocator: tcmalloc
 m31201| 2013-10-16T00:14:19.326-0400 [initandlisten] options: { dbpath: "/data/db/clusteredstale-rs1-1", noprealloc: true, oplogSize: 40, port: 31201, replSet: "clusteredstale-rs1", rest: true, setParameter: [ "enableTestCommands=1" ], smallfiles: true }
 m31201| 2013-10-16T00:14:19.333-0400 [initandlisten] journal dir=/data/db/clusteredstale-rs1-1/journal
 m31201| 2013-10-16T00:14:19.333-0400 [initandlisten] recover : no journal files present, no recovery needed
 m31201| 2013-10-16T00:14:19.532-0400 [initandlisten] preallocateIsFaster=true 2.04
 m31201| 2013-10-16T00:14:19.732-0400 [initandlisten] preallocateIsFaster=true 2.12
 m31201| 2013-10-16T00:14:20.932-0400 [initandlisten] preallocateIsFaster=true 2.04
 m31201| 2013-10-16T00:14:20.932-0400 [initandlisten] preallocating a journal file /data/db/clusteredstale-rs1-1/journal/prealloc.0
 m31201| 2013-10-16T00:14:23.329-0400 [initandlisten] preallocating a journal file /data/db/clusteredstale-rs1-1/journal/prealloc.1
 m31201| 2013-10-16T00:14:25.806-0400 [initandlisten] preallocating a journal file /data/db/clusteredstale-rs1-1/journal/prealloc.2
(last log entry)

Things are pretty confusing since the log entries for this node are way out of order, but it looks like it took at least 5 minutes to allocate journal files here, and either we were still allocating or the process itself hung completely.

Comment by Matt Kangas [ 16/Oct/13 ]

Two more failures which occurred after commit 2e972e1

rev 94ce941611a
Buildbot Nightly Linux 64-bit Build #1618 (logfile)

 *******************************************
         Test : stale_clustered.js ...
         Date : Wed Oct 16 00:09:42 2013
                (output suppressed; see http://buildlogs.mongodb.org/build/525e1185d2a60f2feb000349/test/525e118dd2a60f2f45000821/)
 
command timed out: 10800 seconds without output, attempting to kill

rev 94ce941611a
Buildbot Nightly Linux RHEL 64-bit Build #712 (logfile)

 *******************************************
         Test : stale_clustered.js ...
         Date : Wed Oct 16 00:13:08 2013
                (output suppressed; see http://buildlogs.mongodb.org/build/525e1253d2a60f2f940011a8/test/525e125ad2a60f2f6e0011f8/)
                   36.9248 minutes
test /data/buildslaves/Linux_RHEL_64bit_Nightly/mongo/jstests/slowNightly/stale_clustered.js exited with status 253

Comment by auto [ 15/Oct/13 ]

Author:

{u'username': u'gregstuder', u'name': u'Greg Studer', u'email': u'greg@10gen.com'}

Message: SERVER-11177 buildbot stale_clustered.js make repl error codes constants
Branch: master
https://github.com/mongodb/mongo/commit/2e972e1e7de2966dd5e39ebd79f9fad8dcf5a78c

Comment by Daniel Pasette (Inactive) [ 15/Oct/13 ]

also appearing on bb osx and linux-64
http://buildbot.mongodb.org:8081/builders/Nightly%20OS%20X%2010.5%2064-bit/builds/1480/steps/smokeJsSlowNightly/logs/stdio

Comment by Matt Kangas [ 15/Oct/13 ]

We had a green build at 486efd8f13fc RHEL 5.7 64-bit slow_nightly_tests, followed by another failure.

So it's not completely deterministic.

9fcdfeb195c4 C++11 Ubuntu 1204 64-bit slow_nightly_tests also went green but I'm guessing we haven't actually fixed the issue.

Comment by Matt Kangas [ 14/Oct/13 ]

Also visible on the C++11 Ubuntu builder. And rather oddly, it seems to

486efd8f13fc C++11 Ubuntu 1204 64-bit slow_nightly_tests
a428f58a6fd4 C++11 Ubuntu 1204 64-bit slow_nightly_tests
3f2c4c4afd01 C++11 Ubuntu 1204 64-bit slow_nightly_tests

Commit before that (ae9897079d21) was green.

So the first failed commit was Benety's fix for the Solaris build?. I'm puzzled.

/cc benety.goh

Generated at Thu Feb 08 03:25:06 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.