[SERVER-4544] "unable to connect to mongo program on port 31000" in auth2.js test Created: 22/Dec/11  Updated: 29/May/12  Resolved: 23/Jan/12

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Ian Whalen (Inactive) Assignee: Eric Milkie
Resolution: Done Votes: 0
Labels: buildbot
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-4586 TypeError in in replica set bridging ... Closed
is related to SERVER-4548 Print more info on segfault Closed
Operating System: Linux
Participants:

 Description   

http://buildbot.mongodb.org/builders/Linux%2064-bit/builds/3968/steps/test_1/logs/stdio

...
@/mnt/home/buildbot/slave/Linux_64bit/mongo/jstests/replsets/auth2.js:101
Wed Dec 21 19:59:21 uncaught exception: assert.soon failed: function () {
    try {
        m = new Mongo("127.0.0.1:" + port);
        return true;
    } catch (e) {
    }
    return false;
}, msg:unable to connect to mongo program on port 31000
failed to load: /mnt/home/buildbot/slave/Linux_64bit/mongo/jstests/replsets/auth2.js
...



 Comments   
Comment by Eric Milkie [ 23/Jan/12 ]

Went away again. I didn't get the Windows build to crash..

Comment by Eric Milkie [ 31/Dec/11 ]

I'm hoping to get this test (and others) running under Windows so that we'll have multiple angles to attack this. Perhaps Windows will detect something or the stack trace will be different.

Comment by Spencer Brody (Inactive) [ 30/Dec/11 ]

Submitted d9d0b431f225c4e3369a6e320b2ee45ff6d8c9be to get thread name on segfault, so hopefully we'll have a little more information next time this happens.

Comment by Spencer Brody (Inactive) [ 30/Dec/11 ]

This happened in a different test now. replsets/majority.js. Tracked in SERVER-4586.

Comment by Eric Milkie [ 27/Dec/11 ]

Test failed again in Linux 64bit:

http://buildbot.mongodb.org:8081/builders/Linux%2064-bit/builds/3994/steps/test_1/logs/stdio

...
 m31000| Tue Dec 27 11:51:57 [rsHealthPoll] replSet member ip-10-110-9-236:31001 is up
 m31000| Tue Dec 27 11:51:57 [rsHealthPoll] replSet member ip-10-110-9-236:31001 is now in state RECOVERING
 m31000| Tue Dec 27 11:51:57 Invalid access at address: 0
 m31000| 
 m31000| Tue Dec 27 11:51:57 Got signal: 11 (Segmentation fault).
 m31000| 
 m31000| Tue Dec 27 11:51:57 Backtrace:
 m31000| 0xb4adb4 0xb4fadc 0x2aaaaacd5540 
 m31000|  /mnt/home/buildbot/slave/Linux_64bit/mongo/mongod(_ZN5mongo10abruptQuitEi+0x3d4) [0xb4adb4]
 m31000|  /mnt/home/buildbot/slave/Linux_64bit/mongo/mongod(_ZN5mongo24abruptQuitWithAddrSignalEiP7siginfoPv+0x22c) [0xb4fadc]
 m31000|  /lib64/libpthread.so.0 [0x2aaaaacd5540]
 m31000| 
 m31000| Tue Dec 27 11:51:57 Invalid access at address: 0
 m31000| 
 m31000| Tue Dec 27 11:51:57 Tue Dec 27 11:51:57 Got signal: 11 (Segmentation fault).

Interestingly, that last line in the snippit above is the last thing we hear from 31000. I guess if we encounter recursive signal 11's, we just immediately abort after the first one?

Comment by Spencer Brody (Inactive) [ 22/Dec/11 ]

The problem is that a node segfaulted on startup:

m31000| Wed Dec 21 19:49:21 Invalid access at address: 0x2aaaab9c6ad0
 m31000| 
 m31000| Wed Dec 21 19:49:21 Got signal: 11 (Segmentation fault).
 m31000| 
 m31002| Wed Dec 21 19:49:21 [conn3] end connection 10.110.9.236:50654 (3 connections now open)
 m31000| Wed Dec 21 19:49:21 Backtrace:
 m31000| 0xb4c164 0xb50b2c 0x2aaaaacd5540 0x2aaaab9c6ad0 
 m31000|  /mnt/home/buildbot/slave/Linux_64bit/mongo/mongod(_ZN5mongo10abruptQuitEi+0x3d4) [0xb4c164]
 m31000|  /mnt/home/buildbot/slave/Linux_64bit/mongo/mongod(_ZN5mongo24abruptQuitWithAddrSignalEiP7siginfoPv+0x22c) [0xb50b2c]
 m31000|  /lib64/libpthread.so.0 [0x2aaaaacd5540]
 m31000|  /lib64/libc.so.6 [0x2aaaab9c6ad0]

Unfortunately that stack trace tells us very little about what went wrong.

Generated at Thu Feb 08 03:06:18 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.