Secondary node startup failed within data copied from an up-to-date node

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Done
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • ALL
    • None
    • 3
    • None
    • None
    • None
    • None
    • None
    • None
    • None

       

      Since our Mongodb instance have a large lag between primary and secondary1 (more than 1 hour),  while secondary2 is up-to-date. we decided to do a backup-from-secondary2 and restore-to-secondary1 operation.

      The official manual describe BackUp with cp or rsync as follows:

      If your storage system does not support snapshots, you can copy the files directly using cp, rsync, or a similar tool. Since copying multiple files is not an atomic operation, you must stop all writes to the mongod before copying the files. Otherwise, you will copy the files in an invalid state.

      Backups produced by copying the underlying data do not support point in time recovery for replica sets and are difficult to manage for larger sharded clusters. Additionally, these backups are larger because they include the indexes and duplicate underlying storage padding and fragmentation. mongodump, by contrast, creates smaller backups.

      https://docs.mongodb.com/manual/core/backups/#back-up-with-cp-or-rsync

      We do backup operations as follows steps according to the Mongodb official manual:

      1. Run db.fsyncLock() on secondary2
      2. Remove all datafiles upon secondary1 datapath
      3. cp all secondary2 data files in datapath to secondary1
      4. startup secondary1
      5. startup failed

       

      It seems that the startup failed because WiredTiger metadata in WiredTiger.wt corrupted.

      Keepping eyes on WiredTiger.wt file of secondary2, we noticed that WiredTiger.wt is still changing after executing db.fsyncLock() command on secondary2.

      My question is

      • Is wiredTiger.wt file expected to be changing after running fsyncLock command? If so, can users still do a backup by copying wiredTiger.wt and other database files after acquiring fsyncLock ?
      • If copying is potentially problematic, shall it be ok by taking a fs-snapshot?

      The corruption stack as follows:

       

       

       

              Assignee:
              Unassigned
              Reporter:
              Mike sun
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: