Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-8518

Recovering slave with journal causes Invalid BSONObj size -assertions

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Works as Designed
    • Affects Version/s: 2.2.3
    • Fix Version/s: None
    • Labels:
      None
    • Environment:
      AWS EC2 m2.2xlarge instances. Instance has four IOPS volumes striped together.
    • Operating System:
      Linux
    • Steps To Reproduce:
      Hide

      1) Create 2.2.3 slave, bootstrap it from an existing 2.0.5 replicaset
      2) Snapshot the slave live disks
      3) Create new slave from these snapshots
      4) Let the slave recover with journal

      Show
      1) Create 2.2.3 slave, bootstrap it from an existing 2.0.5 replicaset 2) Snapshot the slave live disks 3) Create new slave from these snapshots 4) Let the slave recover with journal

      Description

      I'm just upgrading our production cluster running mongodb 2.0.5 into 2.2.3. I setup a new slave (2.2.3) into the existing mongodb 2.0.5 replica-set and I let it bootstrap itself over the network. After this I snapshotted the mongodb storage volumes and created a new slave instance for these (to test recovery from backup).

      After the new instance booted it started to recover itself from the journal. Immediately after recovery was completed the slave startet to get assertions about Invalid BSONObj size, which eventually killed the slave.

      I've done the entire job twice, only to get exactly same results. There's the slave mongod.log attached.

      The snapshots were done with RightScale block_device cookbook scripts. The actual steps are:
      1) Lock the underlying XFS filesystem
      2) Create LVM snapshot
      3) Unlock the underlying XFS filesystem
      4) After this the each EBS stripe under LVM is ordered to make an EBS snapshot.
      This procedure is well tested by RightScale and should ensure that the snapshot is atomic and physically intact after the stripes are rejoined. The LVM snapshot is used the restore the volume.

      My plan is to do a rolling upgrade:
      1) First add second slave, with 2.2.3
      2) Replace old slave with 2.2.3 by bootstrapping it with a snapshot from the already created slave and to let it catch up after recovering from journal
      3) Step the old primary down and do the same for the old primary.

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              james.wahlin James Wahlin
              Reporter:
              garo Juho Mäkinen
              Participants:
              Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: