[SERVER-712] Slave locks up Created: 07/Mar/10  Updated: 17/Mar/11  Resolved: 16/Jan/11

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 1.3.3
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Ianiv Schweber Assignee: Eliot Horowitz (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

AMD 2352
62G RAM
CentOS 5.4


Attachments: HTML File webconsole.html    
Participants:

 Description   

Slave database locks up. Pasting email thread so far.

On Sun, Mar 7, 2010 at 8:45 PM, Ianiv Schweber <ischweber@nowpublic.com> wrote:
iostat shows nothing much is happening, mostly 0s all the time.

top now shows
13084 mongodb 18 0 8656m 7.1g 6.8g T 0.0 11.3 460:31.78 mongod

I'm pretty sure status was S before running gdb. That scrolled out of screen's buffer so I can't double check anymore.

Would you like me to open a new ticket and paste the conversation so far?

Ianiv Schweber
Software Developer
NowPublic.com

Public Key: http://www.blogaholics.ca/ianivpubkey.asc

On 2010-03-07, at 5:26 PM, Eliot Horowitz wrote:

Just added you to commercial support on jira.
is anything happening in top? what about "iostat -x 2"

On Sun, Mar 7, 2010 at 8:16 PM, Ianiv Schweber <ischweber@nowpublic.com> wrote:
I don't have access to the commercial support section yet. I do have an account at jira.mongodb.org with username ianiv.

I can't even start the JS console. It prints this and then just sits there.

[root@mongo2 ~]# mongo scan_stats
MongoDB shell version: 1.3.3
url: scan_stats
connecting to: scan_stats

Likewise, the web console doesn't repsond.

ctrl-c on gdb does nothing. The output of the gdb session is

[root@mongo2 mongodb]# gdb bin/mongod --pid 13084
<snip gdb version info>
(no debugging symbols found)
Attaching to program: /usr/local/mongodb/bin/mongod, process 13084
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done.
[Thread debugging using libthread_db enabled]
[New Thread 0x2b3bb0eddc40 (LWP 13084)]
[New Thread 0x4425b940 (LWP 13091)]
[New Thread 0x4385a940 (LWP 13090)]
[New Thread 0x42e59940 (LWP 13089)]

Ianiv Schweber
Software Developer
NowPublic.com

Public Key: http://www.blogaholics.ca/ianivpubkey.asc

On 2010-03-07, at 4:49 PM, Eliot Horowitz wrote:

Can you connect with a shell and do db.currentOp()

Can you view the web console?

in gdb, what happens if you hit ctrl-c?

do you have access to the commercial support section of jira yet?
would be great if we could move discussion there so its easier to track.

On Sun, Mar 7, 2010 at 7:26 PM, Ianiv Schweber <ischweber@nowpublic.com> wrote:
I attached gdb to the running mongod but I cannot interrupt it to run the
commands.
[root@mongo2 ~]# mount
/dev/sda5 on / type ext3 (rw,noatime)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
/dev/sda3 on /tmp type ext3 (rw)
/dev/sda1 on /boot type ext3 (rw)
tmpfs on /dev/shm type tmpfs (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
nfsd on /proc/fs/nfsd type nfsd (rw)
[root@mongo2 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda5 807G 406G 359G 54% /
/dev/sda3 2.0G 36M 1.9G 2% /tmp
/dev/sda1 251M 22M 216M 10% /boot
tmpfs 32G 0 32G 0% /dev/shm
Ianiv Schweber
Software Developer
NowPublic.com

Public Key: http://www.blogaholics.ca/ianivpubkey.asc
On 2010-03-07, at 11:16 AM, Dwight Merriman wrote:

hi,

can you do the following with gdb:

$ gdb mongod --pid=<slave process mongod pid>
(gdb) bt
(gdb) info threads

then also can you run

mount
df -h

and email the above?

thanks

On Sun, Mar 7, 2010 at 1:28 PM, Ianiv Schweber <ischweber@nowpublic.com>
wrote:

Hi Dwight,

It looks like the slave locked up again, 1 hour 35 minutes ago. Last lines
in log:

Sun Mar 7 17:05:36 repl: applied 269 operations
Sun Mar 7 17:05:36 repl: end sync_pullOpLog syncedTo: Sun Mar 7 16:52:18
2010 4b93d9c2:26
Sun Mar 7 17:05:38 allocating new datafile
/usr/local/mongodb/data/scan_stats.8, filling with zeroes...

Just like before.

I'll keep it in this state for now in case it helps you figure out what is
going on. If not, let me know and I'll restart it.

Thanks,

Ianiv Schweber
Software Developer
NowPublic.com

Public Key: http://www.blogaholics.ca/ianivpubkey.asc

On 2010-03-04, at 1:26 PM, Narayan Newton wrote:

Ianiv will chime in here soon, but:

AMD 2352
62G RAM
CentOS 5.4
Mongo 1.3.3-64bit builds from mongodb.org

On Thu, Mar 4, 2010 at 1:18 PM, Dwight Merriman <dwight@10gen.com>
wrote:
ok.
what is your OS platform and rough versions of mongo?

On Thu, Mar 4, 2010 at 4:16 PM, Narayan Newton
<nnewton@tag1consulting.com>
wrote:

Nice to meet you Dwight,

We have had two issues recently, I'm currently investigating one (a
slave hung on filling a data file with 0s and appeared to deadlock)
and Ianiv has been working on the other, a corruption issue we have
had on multiple versions causing exceptions in the log.

We have "started over" with a new DB and the exceptions have not
recurred so far, however, we have kept the old data files around to
figure out what went wrong there. I've added Ianiv to the thread and
he can explain this issue in more detail.

Thanks!

-N



 Comments   
Comment by Ianiv Schweber [ 16/Mar/10 ]

I'm watching the issue but for some reason I didn't get an email (and not in spam folder) so I just saw this.

Tomorrow I'll recompile as you suggest. I'll see about getting you access to the server once it locks up again.

Comment by Eliot Horowitz (Inactive) [ 12/Mar/10 ]

is there anyway we can get access to the server when this is happening?

Comment by Eliot Horowitz (Inactive) [ 12/Mar/10 ]

This is very weird...

Could you try building from source with --d (will get us debugging symbols when attach with gdb)

Comment by Ianiv Schweber [ 11/Mar/10 ]

> db.currentOp()

{ "inprog" : [ ] }

Then I tried
> show dbs;
which hung. now:

> db.currentOp()
{
"inprog" : [
{
"opid" : 332129,
"active" : false,
"lockType" : "write",
"waitingForLock" : true,
"op" : "query",
"ns" : "?",
"query" :

{ "listDatabases" : 1 }

,
"client" : "127.0.0.1:30723",
"desc" : "conn"
}
]
}

output from mongostat:

insert/s query/s update/s delete/s getmore/s command/s mapped vsize res % locked % idx miss
0 0 0 0 0 2 36959 35362 15839 0 0
same line keeps repeating

Comment by Ianiv Schweber [ 11/Mar/10 ]

output of slave's web console

Comment by Eliot Horowitz (Inactive) [ 11/Mar/10 ]

Can you connect with the shell and db.currentOp()
Or try the web console?
Also - can you try hitting it with mongostat (the one included in the distro)

Comment by Ianiv Schweber [ 11/Mar/10 ]

Slave has locked up again. The last few log entries are:

Thu Mar 11 17:20:24 repl: applied 376 operations
Thu Mar 11 17:20:24 repl: end sync_pullOpLog syncedTo: Thu Mar 11 17:07:01 2010 4b992335:42
Thu Mar 11 17:20:24 repl: sync_pullOpLog local.oplog.$main syncedTo:Thu Mar 11 17:07:01 2010 4b992335:42
Thu Mar 11 17:20:24 repl: tailing=true
Thu Mar 11 17:20:24 repl: first op time received: 4b992335:43
Thu Mar 11 17:20:24 updated _lastSavedLocalTs to: 0:0
Thu Mar 11 17:20:24 Saving repl source:

{ host: "192.168.1.130", source: "main", syncedTo: Timestamp 1268327221000|67, localLogTs: Timestamp 0|0 }

Thu Mar 11 17:20:24 repl: applied 1 operations
Thu Mar 11 17:20:24 repl: end sync_pullOpLog syncedTo: Thu Mar 11 17:07:01 2010 4b992335:43
Thu Mar 11 17:20:25 repl: sync_pullOpLog local.oplog.$main syncedTo:Thu Mar 11 17:07:01 2010 4b992335:43
Thu Mar 11 17:20:25 repl: tailing=true
Thu Mar 11 17:20:25 repl: first op time received: 4b992335:44
Thu Mar 11 17:20:26 btree _insert: reusing unused key
Thu Mar 11 17:20:26 btree _insert: reusing unused key
Thu Mar 11 17:20:26 allocating new extent for scan_stats.url padding:1.7 lenWHdr: 334
Thu Mar 11 17:20:26 allocating new datafile /usr/local/mongodb/data/scan_stats.21, filling with zeroes...
Thu Mar 11 17:20:39 flushing mmmap
Thu Mar 11 17:21:39 flushing mmmap
Thu Mar 11 17:22:39 flushing mmmap
Thu Mar 11 17:23:39 flushing mmmap
Thu Mar 11 17:24:39 flushing mmmap
Thu Mar 11 17:25:39 flushing mmmap
Thu Mar 11 17:26:39 flushing mmmap

And the last entry repeats every minute. I'll keep it running so we can attempt to debug it again.

Generated at Thu Feb 08 02:54:56 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.