[SERVER-28054] WiredTiger data corruption in 3.4.x Created: 18/Feb/17  Updated: 27/Oct/23  Resolved: 29/Mar/17

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: 3.4.0, 3.4.1, 3.4.2
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Rostyslav Mykhajliw Assignee: David Hows
Resolution: Gone away Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: HTML File fstab     File mongo.tar.gz     File mongod.conf     Text File mongod.log     File mongodb.service     File mongodb.tar.gz     HTML File syslog    
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

    tmpfile = "/tmp/disable-transparent-hugepages"
    put IO.read("config/deploy/plugin/mongo/init.d/disable-transparent-hugepages"), tmpfile
    run "sudo mv #{tmpfile} /etc/init.d/disable-transparent-hugepages"
    run "sudo chmod 755 /etc/init.d/disable-transparent-hugepages"
    run "sudo update-rc.d disable-transparent-hugepages defaults"
    run "sudo /etc/init.d/disable-transparent-hugepages start"
 
    run "sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 0C49F3730359A14518585931BC711F9BA15703C6"
    run "echo \"deb http://repo.mongodb.org/apt/ubuntu xenial/mongodb-org/3.4 multiverse\" | sudo tee /etc/apt/sources.list.d/mongodb-org-3.4.list"
 
    run "sudo apt-get update"
    run "sudo apt-get install -y --allow-unauthenticated  mongodb-org"
 
    run "sudo service mongod stop || true"
 
    tmpfile = "/tmp/mongodb.conf"
    put IO.read("config/deploy/plugin/mongo/mongod.conf"), tmpfile
    run "sudo mv #{tmpfile} /etc/mongod.conf"
 
    tmpfile = "/tmp/mongodb.service"
    put IO.read("config/deploy/plugin/mongo/systemd/mongodb.service"), tmpfile
    run "sudo mv #{tmpfile} /etc/systemd/system/mongodb.service"
 
    run "sudo systemctl daemon-reload"
    run "sudo systemctl enable mongodb.service"
 
    run "sudo systemctl start mongodb || true"
 
    

then restart instance.

Participants:

 Description   

MongoDB with WiredTiger corrupts storage in case of instance restart
Attached config, log, stored data, fstab (bfs and ext4 affected, I've tested both), mongodb.service

I'm running mongo db m4.large instances on AWS, region: us-west-2 (Oregon)



 Comments   
Comment by David Hows [ 29/Mar/17 ]

Thanks Rostyslav,

I'm marking this issue as resolved.

Comment by Rostyslav Mykhajliw [ 28/Mar/17 ]

issue is resolved. trouble was on my side in deploy script commands order - in some cases mount of filesystem was executed after mongodb install and run. That led to copying working mongodb files and data corruption.

Comment by David Hows [ 27/Mar/17 ]

necromant2005@gmail.com, can you elaborate on the problem you have seen/solved? If there is still an underlying issue with MongoDB or WiredTiger, it is definitely something I would want to get to the bottom of.

Thanks

Comment by Rostyslav Mykhajliw [ 27/Mar/17 ]

Trouble was in setup receipt - solved with update.

Comment by David Hows [ 27/Mar/17 ]

necromant2005@gmail.com, can you review my process and confirm if any of the steps I have taken differ from yours?

At this stage I have been unable to reproduce the issue as you have described.

Comment by David Hows [ 27/Mar/17 ]

I've attempted to reproduce this and was not able to make MongoDB fail in the same manner.

Process I used:

  1. Start the new m4.xlarge instance running the vanilla Ubuntu 16.10 AMI. Added 1x 40GB EBS device
  2. Followed the steps here to disable THP
  3. Ran the blob to install MongoDB and setup devices
  4. Restarted instance on AWS console

Blob

sudo mkfs.xfs /dev/xvdb
sudo mkdir /var/lib/mongodb
sudo mount /dev/xvdb /var/lib/mongodb
sudo chmod 777 /var/lib/mongodb
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 0C49F3730359A14518585931BC711F9BA15703C6
echo "deb http://repo.mongodb.org/apt/ubuntu xenial/mongodb-org/3.4 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-3.4.list
sudo apt-get update
sudo apt-get install -y --allow-unauthenticated  mongodb-org
sudo systemctl link /lib/systemd/system/mongod.service
sudo systemctl enable mongod
sudo systemctl daemon-reload
sudo systemctl restart mongod

I don't have the config used by Rostyslav Mykhajliw for his MongoDB instance, I have just used the default found in the MongoDB d

Comment by Alexander Gorrod [ 23/Mar/17 ]

david.hows Please try reproducing this failure using exactly the steps outlined by the user.

Comment by Rostyslav Mykhajliw [ 28/Feb/17 ]

There's no issues on Ubuntu 16.04. I suppose the issue somehow related to system.d, because that's native replacement for init.d

Comment by Rostyslav Mykhajliw [ 28/Feb/17 ]

1. ami-a49b1bc4 (ubuntu 16.10)
2. ebs general ssd
3.
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
udev on /dev type devtmpfs (rw,nosuid,relatime,size=4076980k,nr_inodes=1019245,mode=755)
devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000)
tmpfs on /run type tmpfs (rw,nosuid,noexec,relatime,size=817320k,mode=755)
/dev/xvda1 on / type ext4 (rw,relatime,data=ordered)
securityfs on /sys/kernel/security type securityfs (rw,nosuid,nodev,noexec,relatime)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
tmpfs on /run/lock type tmpfs (rw,nosuid,nodev,noexec,relatime,size=5120k)
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd)
pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=30,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=10977)
mqueue on /dev/mqueue type mqueue (rw,relatime)
debugfs on /sys/kernel/debug type debugfs (rw,relatime)
hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime)
configfs on /sys/kernel/config type configfs (rw,relatime)
fusectl on /sys/fs/fuse/connections type fusectl (rw,relatime)
/dev/xvdc on /var/lib/ssdb type ext4 (rw,relatime,data=ordered)
/dev/xvdb on /var/lib/mongodb type xfs (rw,relatime,attr2,inode64,noquota)
lxcfs on /var/lib/lxcfs type fuse.lxcfs (rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other)
tmpfs on /run/user/1000 type tmpfs (rw,nosuid,nodev,relatime,size=817316k,mode=700,uid=1000,gid=1000)

4. Go to AWS console and click on "restart instance" link

Comment by Alexander Gorrod [ 22/Feb/17 ]

necromant2005@gmail.com We are having difficulty reproducing the behavior you report, and it is not obvious how the symptoms could occur from inspecting and reasoning about the code. Could you please let us know:

  • The AMI you are using (if it's public), or any interesting details of the AMI apart from the OS.
  • What device is mounted on /dev/xvdb (which is mapped as /var/lib/mongodb). e.g: is it an EBS volume?
  • Please add the output of mount to the ticket.
  • Details about how you are restarting the instance.
Comment by Rostyslav Mykhajliw [ 21/Feb/17 ]

Hi Michael,

It may be because I'm using ubuntu 16.10
journalling doesn't help
Attached mongo.tar.gz before restart. The scariest in this case is that every instance restart leads to data corruption.

Cheers,

Comment by Michael Cahill (Inactive) [ 21/Feb/17 ]

necromant2005@gmail.com, thanks for this report, the WiredTiger metadata file you uploaded has been truncated.

Can you create a tarball of the dbpath before restarting the instance (i.e., while it is running)? It would be good to compare that to the files after restart to narrow down when the truncation happens.

One further question: if you run with journal enabled (the default setting), do you still see this issue?

Comment by Rostyslav Mykhajliw [ 18/Feb/17 ]

syslog

Generated at Thu Feb 08 04:17:00 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.