Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-3316

Syncing a new replica in a replica set crashes the primary and leaves secondary in strange state

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Critical - P2
    • Resolution: Fixed
    • 1.8.1
    • None
    • Replication, Storage
    • None
    • Ubuntu Natty on EC2
    • ALL

    Description

      Our set up is as follows:

      2 shards consisting of 3 machines each (1 primary, 1 secondary, 1 arbiter). Each shard has about 35GB of data, running on 1.8.1.

      We lost a secondary today, so are trying to resync a new secondary from scratch. Two things have happened at least twice in this process:

      1. The primary segfaults; we had this happen while secondary was mid-sync, but also happen when the secondary was shut down and not communicating with the primary at all (three times).
      2. The secondary, once it's finished syncing and building its indexes, complains over and over about "DR102 too much data written uncommitted" (same error as SERVER-2737 but different situation as far as I can tell).

      We've tried stopping all the mongods, removing the local files on the primary, starting it up and re-initializing its replica set, and then syncing again, but this led to the same results (we cleared all data off the secondary first, too).

      I've attached the logs for both seg faults (one was running with verbose=false, the other =true) and a sample of the DR102 errors on the secondary.

      Attachments

        1. segfault.txt
          0.7 kB
        2. uncommitted.txt
          7 kB

        Activity

          People

            scotthernandez Scott Hernandez
            mikeyk Mike K
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: