Major - P3
AWS EC2 m2.2xlarge instances. Instance has four IOPS volumes striped together.
I'm just upgrading our production cluster running mongodb 2.0.5 into 2.2.3. I setup a new slave (2.2.3) into the existing mongodb 2.0.5 replica-set and I let it bootstrap itself over the network. After this I snapshotted the mongodb storage volumes and created a new slave instance for these (to test recovery from backup).
After the new instance booted it started to recover itself from the journal. Immediately after recovery was completed the slave startet to get assertions about Invalid BSONObj size, which eventually killed the slave.
I've done the entire job twice, only to get exactly same results. There's the slave mongod.log attached.
The snapshots were done with RightScale block_device cookbook scripts. The actual steps are:
1) Lock the underlying XFS filesystem
2) Create LVM snapshot
3) Unlock the underlying XFS filesystem
4) After this the each EBS stripe under LVM is ordered to make an EBS snapshot.
This procedure is well tested by RightScale and should ensure that the snapshot is atomic and physically intact after the stripes are rejoined. The LVM snapshot is used the restore the volume.
My plan is to do a rolling upgrade:
1) First add second slave, with 2.2.3
2) Replace old slave with 2.2.3 by bootstrapping it with a snapshot from the already created slave and to let it catch up after recovering from journal
3) Step the old primary down and do the same for the old primary.