[SERVER-2383] Mongod craches when killing pid running on kernel 2.6.32-5-xen-amd64 Created: 20/Jan/11  Updated: 08/Mar/13  Resolved: 01/Oct/12

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 1.6.5
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Wouter D'Haeseleer Assignee: Eliot Horowitz (Inactive)
Resolution: Cannot Reproduce Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Xen DomU Guest running Debian Squeeze (kernel 2.6.32-5-xen-amd64)
MongoDB 1.6.5 downloaded from the website as binary
Configured mongo with replSet


Attachments: Text File messages.txt    
Operating System: Linux
Participants:

 Description   

Mongo is running fine and when I do
kill $(cat /mongo/db/mongod.lock)

It sometimes (1 out of 4 ) seems to cause a kernel panic.
I did some testing and it only seems to occur when adding the mongo to a replica set cluster with the replSet option.

This is the stack trace:

[58625.873310] alignment check: 0000 1 SMP
[58625.873317] last sysfs file: /sys/devices/virtual/net/lo/operstate
[58625.873320] CPU 0
[58625.873323] Modules linked in: snd_pcm snd_timer snd soundcore snd_page_alloc pcspkr evdev xfs exportfs xen_netfront xen_blkfront
[58625.873336] Pid: 8539, comm: mongod Not tainted 2.6.32-5-xen-amd64 #1
[58625.873339] RIP: e030:[<ffffffff81270c0b>] [<ffffffff81270c0b>] eth_type_trans+0x3d/0xae
[58625.873347] RSP: e02b:ffff880001c93988 EFLAGS: 00050246
[58625.873350] RAX: ffff88002efd20fc RBX: ffff88002e3b12e8 RCX: ffff88002efd20ee
[58625.873354] RDX: 0000000000000042 RSI: 000000000000000e RDI: ffff88002e3b12e8
[58625.873357] RBP: ffff88002fc3e800 R08: 0000000000000000 R09: 0000000000000000
[58625.873361] R10: 000000000000000e R11: ffffffff8125fbaf R12: ffff88002e3a2080
[58625.873364] R13: ffff88002fc3e800 R14: ffff88002fdea980 R15: ffffffff81350270
[58625.873371] FS: 00007ff239953710(0000) GS:ffff8800031ac000(0000) knlGS:0000000000000000
[58625.873375] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[58625.873378] CR2: 000000000080a45c CR3: 0000000001001000 CR4: 0000000000002660
[58625.873382] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[58625.873385] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[58625.873389] Process mongod (pid: 8539, threadinfo ffff880001c92000, task ffff88002eab2350)
[58625.873392] Stack:
[58625.873394] 0000000000000000 ffff88002fc3e800 ffff88002e3b12e8 ffffffff812398d0
[58625.873399] <0> 0000000000000000 ffff88002e3b12e8 ffff88002e3a2080 ffffffff8125f9e4
[58625.873407] <0> ffffffff8100ecdf 0000000000000000 ffff88002fdea980 ffff88002e3a2080
[58625.873414] Call Trace:
[58625.873418] [<ffffffff812398d0>] ? loopback_xmit+0x36/0x7a
[58625.873422] [<ffffffff8125f9e4>] ? dev_hard_start_xmit+0x211/0x2db
[58625.873428] [<ffffffff8100ecdf>] ? xen_restore_fl_direct_end+0x0/0x1
[58625.873432] [<ffffffff8125fe8c>] ? dev_queue_xmit+0x2dd/0x38d
[58625.873437] [<ffffffff81287483>] ? ip_queue_xmit+0x311/0x386
[58625.873487] [<ffffffffa004744d>] ? xfs_log_release_iclog+0x10/0x38 [xfs]
[58625.873498] [<ffffffffa00515f5>] ? _xfs_trans_commit+0x25f/0x2d1 [xfs]
[58625.873502] [<ffffffff8100e63d>] ? xen_force_evtchn_callback+0x9/0xa
[58625.873507] [<ffffffff81297e33>] ? tcp_transmit_skb+0x648/0x687
[58625.873511] [<ffffffff8100ecf2>] ? check_events+0x12/0x20
[58625.873515] [<ffffffff8129a2b5>] ? tcp_write_xmit+0x874/0x96c
[58625.873518] [<ffffffff8129a3fa>] ? __tcp_push_pending_frames+0x22/0x53
[58625.873523] [<ffffffff8128d7fd>] ? tcp_close+0x176/0x3d0
[58625.873528] [<ffffffff812aa2f8>] ? inet_release+0x4e/0x54
[58625.873533] [<ffffffff81251121>] ? sock_release+0x19/0x66
[58625.873536] [<ffffffff81251190>] ? sock_close+0x22/0x26
[58625.873541] [<ffffffff810f09c9>] ? __fput+0x100/0x1af
[58625.873545] [<ffffffff810ede06>] ? filp_close+0x5b/0x62
[58625.873549] [<ffffffff810508a0>] ? put_files_struct+0x64/0xc1
[58625.873553] [<ffffffff8105215d>] ? do_exit+0x22e/0x6c6
[58625.873557] [<ffffffff81052165>] ? do_exit+0x236/0x6c6
[58625.873560] [<ffffffff8105266b>] ? do_group_exit+0x76/0x9d
[58625.873565] [<ffffffff8105eef7>] ? get_signal_to_deliver+0x310/0x339
[58625.873570] [<ffffffff8101104f>] ? do_notify_resume+0x87/0x73f
[58625.873573] [<ffffffff8100b444>] ? xen_write_msr_safe+0x76/0xb1
[58625.873577] [<ffffffff810106c4>] ? __switch_to+0x1ad/0x297
[58625.873582] [<ffffffff81049045>] ? finish_task_switch+0x44/0xaf
[58625.873586] [<ffffffff81011e0e>] ? int_signal+0x12/0x17
[58625.873588] Code: 87 d8 00 00 00 2b 87 d0 00 00 00 be 0e 00 00 00 89 87 c4 00 00 00 e8 68 48 fe ff 8b 8b c4 00 00 00 48 03 8b d0 00 00 00 f6 01 01 <48> 8b 11 74 20 48 33 95 40 02 00 00 8a 43 7d 48 c1 e2 10 75 08
[58625.873630] RIP [<ffffffff81270c0b>] eth_type_trans+0x3d/0xae
[58625.873634] RSP <ffff880001c93988>
[58625.873639] --[ end trace f73fe61a27c51fab ]--
[58625.873641] Kernel panic - not syncing: Fatal exception in interrupt
[58625.873645] Pid: 8539, comm: mongod Tainted: G D 2.6.32-5-xen-amd64 #1
[58625.873648] Call Trace:
[58625.873652] [<ffffffff8130ac81>] ? panic+0x86/0x143
[58625.873657] [<ffffffff8130cb3a>] ? _spin_unlock_irqrestore+0xd/0xe
[58625.873661] [<ffffffff8100ecdf>] ? xen_restore_fl_direct_end+0x0/0x1
[58625.873664] [<ffffffff8130cb3a>] ? _spin_unlock_irqrestore+0xd/0xe
[58625.873668] [<ffffffff8104f3af>] ? release_console_sem+0x17e/0x1af
[58625.873672] [<ffffffff8130d9d5>] ? oops_end+0xa7/0xb4
[58625.873676] [<ffffffff81013416>] ? do_alignment_check+0x88/0x92
[58625.873680] [<ffffffff8125fbaf>] ? dev_queue_xmit+0x0/0x38d
[58625.873685] [<ffffffff811f1976>] ? HYPERVISOR_event_channel_op+0x11/0x50
[58625.873695] [<ffffffffa004d6f9>] ? xfs_icsb_modify_counters+0x7b/0x1a0 [xfs]
[58625.873699] [<ffffffff81012a75>] ? alignment_check+0x25/0x30
[58625.873703] [<ffffffff8125fbaf>] ? dev_queue_xmit+0x0/0x38d
[58625.873706] [<ffffffff81270c0b>] ? eth_type_trans+0x3d/0xae
[58625.873710] [<ffffffff81270bfb>] ? eth_type_trans+0x2d/0xae
[58625.873713] [<ffffffff812398d0>] ? loopback_xmit+0x36/0x7a
[58625.873717] [<ffffffff8125f9e4>] ? dev_hard_start_xmit+0x211/0x2db
[58625.873721] [<ffffffff8100ecdf>] ? xen_restore_fl_direct_end+0x0/0x1
[58625.873724] [<ffffffff8125fe8c>] ? dev_queue_xmit+0x2dd/0x38d
[58625.873728] [<ffffffff81287483>] ? ip_queue_xmit+0x311/0x386
[58625.873738] [<ffffffffa004744d>] ? xfs_log_release_iclog+0x10/0x38 [xfs]
[58625.873747] [<ffffffffa00515f5>] ? _xfs_trans_commit+0x25f/0x2d1 [xfs]
[58625.873752] [<ffffffff8100e63d>] ? xen_force_evtchn_callback+0x9/0xa
[58625.873755] [<ffffffff81297e33>] ? tcp_transmit_skb+0x648/0x687
[58625.873759] [<ffffffff8100ecf2>] ? check_events+0x12/0x20
[58625.873762] [<ffffffff8129a2b5>] ? tcp_write_xmit+0x874/0x96c
[58625.873766] [<ffffffff8129a3fa>] ? __tcp_push_pending_frames+0x22/0x53
[58625.873770] [<ffffffff8128d7fd>] ? tcp_close+0x176/0x3d0
[58625.873773] [<ffffffff812aa2f8>] ? inet_release+0x4e/0x54
[58625.873777] [<ffffffff81251121>] ? sock_release+0x19/0x66
[58625.873780] [<ffffffff81251190>] ? sock_close+0x22/0x26
[58625.873784] [<ffffffff810f09c9>] ? __fput+0x100/0x1af
[58625.873787] [<ffffffff810ede06>] ? filp_close+0x5b/0x62
[58625.873791] [<ffffffff810508a0>] ? put_files_struct+0x64/0xc1
[58625.873794] [<ffffffff8105215d>] ? do_exit+0x22e/0x6c6
[58625.873797] [<ffffffff81052165>] ? do_exit+0x236/0x6c6
[58625.873801] [<ffffffff8105266b>] ? do_group_exit+0x76/0x9d
[58625.873804] [<ffffffff8105eef7>] ? get_signal_to_deliver+0x310/0x339
[58625.873808] [<ffffffff8101104f>] ? do_notify_resume+0x87/0x73f
[58625.873812] [<ffffffff8100b444>] ? xen_write_msr_safe+0x76/0xb1
[58625.873815] [<ffffffff810106c4>] ? __switch_to+0x1ad/0x297
[58625.873819] [<ffffffff81049045>] ? finish_task_switch+0x44/0xaf
[58625.873822] [<ffffffff81011e0e>] ? int_signal+0x12/0x17



 Comments   
Comment by Eliot Horowitz (Inactive) [ 01/Oct/12 ]

If someone has another instance, please let us know.

Comment by Ian Whalen (Inactive) [ 20/Jul/12 ]

Roger can you let us know what version of mongo you're running?

Comment by Roger Rohrbach [ 20/Jul/12 ]

Excerpt from /var/log/messages

Comment by Roger Rohrbach [ 20/Jul/12 ]

Saw this today on an Amazon EC2 instance (Linux version 3.2.22-35.60.amzn1.x86_64) after issuing "kill -TERM."

Comment by Raymond Lu [ 02/Sep/11 ]

We've since stepped around the issue by moving our production mongod's to real hardware and `kill -9`-ing our test mongod's.

I may get around to it eventually but it's not very easy to reproduce.

Comment by Eliot Horowitz (Inactive) [ 02/Sep/11 ]

This is likely a bug we fixed in 2.0 with replica set concurrency.
Can you try 2.0.0-rc1

Comment by Raymond Lu [ 23/Jun/11 ]

Edited my comment above; 1.8.2.

Comment by Eliot Horowitz (Inactive) [ 23/Jun/11 ]

@raymond - which version of mongo?

Comment by Raymond Lu [ 23/Jun/11 ]

I get similar behavior with both "pkill -2 mongo" and "pkill mongo". 2.6.35 kernel, 1.8.2 mongo.

[ 3716.645935] Call Trace:
[ 3716.645941] [<ffffffff8123a515>] ? panic+0x8a/0x105
[ 3716.645948] [<ffffffff8100c216>] ? oops_end+0xa8/0xb5
[ 3716.645956] [<ffffffff8100a12e>] ? do_alignment_check+0x88/0x92
[ 3716.645963] [<ffffffff81009675>] ? alignment_check+0x25/0x30
[ 3716.645971] [<ffffffff811d792c>] ? eth_type_trans+0x46/0xb3
[ 3716.645978] [<ffffffff811a62f3>] ? loopback_xmit+0x36/0x75
[ 3716.645985] [<ffffffff811c7151>] ? dev_hard_start_xmit+0x25a/0x31f
[ 3716.645993] [<ffffffff811c76c3>] ? dev_queue_xmit+0x3a1/0x458
[ 3716.645999] [<ffffffff811ef28b>] ? ip_queue_xmit+0x2c4/0x30f
[ 3716.646006] [<ffffffff81110e1c>] ? do_get_write_access+0x385/0x3cc
[ 3716.646013] [<ffffffff810c4249>] ? __find_get_block+0x16b/0x17b
[ 3716.646021] [<ffffffff811ff4f3>] ? tcp_transmit_skb+0x6ce/0x70c
[ 3716.646028] [<ffffffff81201a4d>] ? tcp_write_xmit+0x80c/0x8fc
[ 3716.646035] [<ffffffff81005f7f>] ? xen_restore_fl_direct_end+0x0/0x1
[ 3716.646042] [<ffffffff8109e6a4>] ? __kmalloc+0x97/0xba
[ 3716.646049] [<ffffffff81005f92>] ? check_events+0x12/0x20
[ 3716.646055] [<ffffffff81201b86>] ? __tcp_push_pending_frames+0x18/0x44
[ 3716.646063] [<ffffffff811f4e36>] ? tcp_close+0x15a/0x37a
[ 3716.646069] [<ffffffff8121109c>] ? inet_release+0x6d/0x73
[ 3716.646076] [<ffffffff811b8549>] ? sock_release+0x19/0x6b
[ 3716.646082] [<ffffffff811b85bd>] ? sock_close+0x22/0x27
[ 3716.646089] [<ffffffff810a6db7>] ? fput+0xff/0x1a4
[ 3716.646095] [<ffffffff810a463f>] ? filp_close+0x5f/0x6a
[ 3716.646102] [<ffffffff81005f7f>] ? xen_restore_fl_direct_end+0x0/0x1
[ 3716.646109] [<ffffffff8103479a>] ? put_files_struct+0x67/0xc1
[ 3716.646116] [<ffffffff8123bf6e>] ? _raw_spin_lock_irq+0x7/0x1c
[ 3716.646123] [<ffffffff81035edb>] ? do_exit+0x22b/0x691
[ 3716.646129] [<ffffffff810363ba>] ? do_group_exit+0x79/0xa3
[ 3716.646136] [<ffffffff8103ee05>] ? get_signal_to_deliver+0x302/0x323
[ 3716.646143] [<ffffffff81007e35>] ? do_signal+0x6a/0x675
[ 3716.646150] [<ffffffff81005f7f>] ? xen_restore_fl_direct_end+0x0/0x1
[ 3716.646156] [<ffffffff81003edd>] ? xen_mc_flush+0x158/0x183
[ 3716.646163] [<ffffffff81005f7f>] ? xen_restore_fl_direct_end+0x0/0x1
[ 3716.646170] [<ffffffff81003357>] ? xen_end_context_switch+0xe/0x1c
[ 3716.646177] [<ffffffff81003138>] ? xen_write_msr_safe+0x5d/0x75
[ 3716.646183] [<ffffffff8100759d>] ? __switch_to+0x12f/0x21b
[ 3716.646190] [<ffffffff8100844d>] ? do_notify_resume+0xd/0x41
[ 3716.646196] [<ffffffff81008b90>] ? int_signal+0x12/0x17

Comment by Adrien Mogenet [ 03/Feb/11 ]

I got exactly the same bug (with 1.7.5 and 1.6.5), with the same Xen kernel.

Comment by Eliot Horowitz (Inactive) [ 21/Jan/11 ]

can you attach log file if any?

Generated at Thu Feb 08 02:59:48 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.