[SERVER-26157] deadlock during LVM snapshot Created: 19/Sep/16  Updated: 01/Dec/16  Resolved: 27/Sep/16

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 3.2.9
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Tomaz Beltram Assignee: Kelsey Schubert
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File 17.log     File diagnostic.data.tgz    
Operating System: ALL
Steps To Reproduce:

Difficult, happened just once.

Participants:

 Description   

I am testing backup of a standalone mongodb using LVM snapshot. I have journaling enabled so mongod is running and processing requests.

It happened just once that I run into a deadlock situation. There is one jbd2 and three mongod threads reported blocked in the syslog (attached with back traces) and another jdb2 and two more mongod threads after exactly two minutes.

Sep 15 17:03:01 dss2 kernel: [719046.057796] EXT4-fs (dm-5): mounted filesystem with ordered data mode. Opts: (null)
Sep 15 17:06:53 dss2 kernel: [719277.567904] INFO: task jbd2/dm-2-8:9048 blocked for more than 120 seconds.
Sep 15 17:06:53 dss2 kernel: [719277.568130] INFO: task mongod:23239 blocked for more than 120 seconds.
Sep 15 17:06:53 dss2 kernel: [719277.568267] INFO: task mongod:23242 blocked for more than 120 seconds.
Sep 15 17:06:53 dss2 kernel: [719277.568350] INFO: task mongod:23243 blocked for more than 120 seconds.
Sep 15 17:06:53 dss2 kernel: [719277.568397] INFO: task dmeventd:12427 blocked for more than 120 seconds.
Sep 15 17:06:53 dss2 kernel: [719277.568523] INFO: task kworker/u16:2:31890 blocked for more than 120 seconds.
Sep 15 17:06:53 dss2 kernel: [719277.568713] INFO: task tar:12446 blocked for more than 120 seconds.
Sep 15 17:08:53 dss2 kernel: [719397.567614] INFO: task jbd2/dm-2-8:9048 blocked for more than 120 seconds.
Sep 15 17:08:53 dss2 kernel: [719397.567731] INFO: task mongod:23239 blocked for more than 120 seconds.
Sep 15 17:08:53 dss2 kernel: [719397.567870] INFO: task mongod:23240 blocked for more than 120 seconds.

The last processing as reported in mongod.log was at 2016-09-15T17:08:27.808+0200, after that just the number of open connections accumulates.

I was not abble to terminate the tar process and unmount the snapshot. Also mongod was not stoppable and whole system had to be rebooted.

I am running Ubuntu 16.04.1 with mongod 3.2.9 on a 64bit system.



 Comments   
Comment by Kelsey Schubert [ 01/Dec/16 ]

Thanks for following up, tomazb! I'm glad you were able to identify the root cause.

Comment by Tomaz Beltram [ 29/Nov/16 ]

FYI: Its a kernel race indeed, reported to linux-lvm and launchpad.

Comment by Kelsey Schubert [ 27/Sep/16 ]

Hi tomazb,

Thank you for the additional information. I have taken a look at the diagnostic.data and do not see anything that indicates a bug in MongoDB. Since we are unable to reproduce this issue and the behavior may be explained a kernel bug, I am going to close this ticket. If you encounter this issue again, please let us know and we will continue to investigate.

Thank you,
Thomas

Comment by Tomaz Beltram [ 20/Sep/16 ]

Hi,

Thanks for your quick reply and for looking into this issue. As I mentioned it only happened once and I was not able to reproduce it even with more frequent backups and higher load on mongodb. The command for creating the snapshot is:

lvcreate --size 1G --snapshot --name mdb-backupsnap /dev/vgDB/db

The backup commands have some additional error checking but resolve to:

mount /dev/vgDB/mdb-backupsnap /mnt/snapshot
cd /mnt/snapshot
tar czf $backup-file *
cd -
umount /mnt/snapshot
lvremove -f /dev/vgDB/mdb-backupsnap

The diagnostic.data archive is attached.

Thanks again
Tomaz

Comment by Kelsey Schubert [ 19/Sep/16 ]

Hi tomazb,

Thank you for reporting this issue. So we can get a better understanding of what is happening here, would you please provide the following information?

  • An explicit walk-through of the steps and commands you executed to create the LVM snapshot when this issue occurred
  • An archive of the diagnostic.data directory

Thank you,
Thomas

Generated at Thu Feb 08 04:11:18 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.