Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-23688

Duplicate key index crashing nodes.

    • ALL
    • Hide

      Change the priority of replica set member to force the election to primary.

      Show
      Change the priority of replica set member to force the election to primary.
    • 7

      I have a problem with duplicated keys on a replica.
      My replica is made by 1 primary (A), 1 secondary (B) and 1 arbiter.

      The parameter "WriteConcern.REPLICA_ACKNOWLEDGED" is used in all insertion and update operations in those machines to keep both nodes synchronized and most operations are made in bulk.

      There are around 500 collection in the database. 250 of the have an unique index on "pid, channel" and the other 250 have an unique index on "uid, channel".

      Today I needed to chance the machines' priority to set the "most idle" one as the primary. I've simply changed the priority of machine B to, in the next election, become primary.

      A short while after the election, the machine B - which was now primary - detected the following problem:

      2016-04-13T09:40:21.966-0500 F STORAGE [conn372] Unique index cursor seeing multiple records for key { : "478713872142575", : "Facebook" }
      2016-04-13T09:40:21.999-0500 I - [conn372] Fatal Assertion 28608
      2016-04-13T09:40:21.999-0500 I - [conn372]

      ***aborting after fassert() failure

      2016-04-13T09:40:22.702-0500 F - [conn372] Got signal: 6 (Aborted).

      As soon as this happened, the machine A became the primary one again. A short while later, some of the following initializations of the machine B had the same problem with a different "key". 40 minutes later the duplicate error happened in the machine A too.

      This BUG has already happened in a similar way and it was solved by reindexing all unique indexes in the collections, but now it happened again and I'm concerned about using mongo on production. What can cause this type of error? Why MongoDB itself can't stay up and try to solve it instead of simply abort? Can it be the usage of WriteConcern method with REPLICA_ACKNOWLEDGED the cause of this problem? The priority change plus the election process can be the cause?

            Votes:
            13 Vote for this issue
            Watchers:
            46 Start watching this issue

              Created:
              Updated:
              Resolved: