Details

    • Type: Bug Bug
    • Status: Resolved Resolved
    • Priority: Blocker - P1 Blocker - P1
    • Resolution: Incomplete
    • Affects Version/s: 2.2.3
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Backport:
      No
    • Operating System:
      ALL
    • Bug Type:
      Unknown
    • # Replies:
      11
    • Last comment by Customer:
      true

      Description

      We are having some problems with sharding, when diagnosing them I stumbled upon this:

      $ mongo cancer:27022/a
      MongoDB shell version: 2.2.3
      connecting to: cancer:27022/a
      driveFS-2:SECONDARY> rs.slaveOk()
      driveFS-2:SECONDARY> db.fs.chunks.find({files_id:NumberLong("5444002714197426176")},{data:0})
      { "_id" : ObjectId("51252590f2ac490dc4e32ec8"), "files_id" : NumberLong("5444002714197426176"), "n" : 0 }
      { "_id" : ObjectId("51252597de30b908b0d23b69"), "files_id" : NumberLong("5444002714197426176"), "n" : 0 }
      driveFS-2:SECONDARY> db.system.indexes.find()
      { "v" : 1, "key" : { "_id" : 1 }, "ns" : "a.fs.chunks", "name" : "_id_" }
      { "v" : 1, "key" : { "files_id" : 1, "n" : 1 }, "unique" : true, "ns" : "a.fs.chunks", "name" : "files_id_1_n_1" }
      

      How this is possible?

      1. d1.s2.log.gz
        471 kB
        Aristarkh Zagorodnikov
      2. d2.s2.log.gz
        693 kB
        Aristarkh Zagorodnikov
      3. d3.s2.log.gz
        702 kB
        Aristarkh Zagorodnikov

        Activity

        Hide
        Aristarkh Zagorodnikov
        added a comment -

        Attached logs. d3 is usually the primary (has priority 10), d1 and d2 are secondaries. d3 had non-unique index, d1 and d2 had unique one. Insert got into d3, d1 and d2 at first refused to replicate the change (and that's fine), but after a series of restarts (puppet ensures the mongods are up and running periodically) without any user intervention (no one altered any data on d1 and d2, and since replicaSet was degraded due to only 1 member up, d3 was also secondary, so it wasn't accepting any writes) the insert got through.

        Show
        Aristarkh Zagorodnikov
        added a comment - Attached logs. d3 is usually the primary (has priority 10), d1 and d2 are secondaries. d3 had non-unique index, d1 and d2 had unique one. Insert got into d3, d1 and d2 at first refused to replicate the change (and that's fine), but after a series of restarts (puppet ensures the mongods are up and running periodically) without any user intervention (no one altered any data on d1 and d2, and since replicaSet was degraded due to only 1 member up, d3 was also secondary, so it wasn't accepting any writes) the insert got through.
        Hide
        Aristarkh Zagorodnikov
        added a comment - - edited

        The offending insert happened at about Feb 20 23:35 log time.

        Show
        Aristarkh Zagorodnikov
        added a comment - - edited The offending insert happened at about Feb 20 23:35 log time.
        Hide
        Ian Daniel
        added a comment -

        Hi Aristarkh,

        It has been a couple of months since you raised this issue. What is the status of the problem now? Have you been able to resolve the problem, or are you still looking for assistance?

        Kind regards,
        Ian

        Show
        Ian Daniel
        added a comment - Hi Aristarkh, It has been a couple of months since you raised this issue. What is the status of the problem now? Have you been able to resolve the problem, or are you still looking for assistance? Kind regards, Ian
        Hide
        Aristarkh Zagorodnikov
        added a comment - - edited

        Hi!
        We dropped the indexes and recreated them manually, dropping duplicates, this fixed the problem. I was more interested in how this happened in the first place and thought you would also be interested. So, yes, we solved the problem, bit without knowledge of what caused the issue itself we're a bit nervous about such problems arising in the future, and as dataset grows (7TiB now), reindexing might become more and more problematic as a resolution.

        Show
        Aristarkh Zagorodnikov
        added a comment - - edited Hi! We dropped the indexes and recreated them manually, dropping duplicates, this fixed the problem. I was more interested in how this happened in the first place and thought you would also be interested. So, yes, we solved the problem, bit without knowledge of what caused the issue itself we're a bit nervous about such problems arising in the future, and as dataset grows (7TiB now), reindexing might become more and more problematic as a resolution.
        Hide
        Joanna Cheng
        added a comment -

        Hi Aristarkh,

        I reviewed your logs and have some comments:

        1. I can see the problems you first hit at ~23:35, which causes the secondaries to crash. Also, as the replica set members come back up, we try to catch up but hit the same problematic transaction again.

        2. I don't see the write actually going through in your logs. I wouldn't expect the write to go through on the secondaries while the unique index exists.

        3. I can see the index being rebuilt around 01:40 on d3 (which is the primary at this stage?). Not sure what this was for. I see more errors (for the same duplicate key) as late as 08:44 on member d2

        4. Since there are only logs from the incident (but not earlier), I can't ascertain how the index inconsistency among replica set members happened in the first place.
        4a. Certainly there is no requirement for indexes to be the same across replica set members (i.e. you may want one member with an index for a certain batch reporting job that runs, but on no other members).

        As you have moved past this issue and probably no longer have any of the data from the incident, there is not much else we can advise without resorting to pure guesswork. I will resolve this ticket; however if you have any further information related to this issue, please feel free to re-open with additional details.

        Thanks and kind regards,
        Joanna

        Show
        Joanna Cheng
        added a comment - Hi Aristarkh, I reviewed your logs and have some comments: 1. I can see the problems you first hit at ~23:35, which causes the secondaries to crash. Also, as the replica set members come back up, we try to catch up but hit the same problematic transaction again. 2. I don't see the write actually going through in your logs. I wouldn't expect the write to go through on the secondaries while the unique index exists. 3. I can see the index being rebuilt around 01:40 on d3 (which is the primary at this stage?). Not sure what this was for. I see more errors (for the same duplicate key) as late as 08:44 on member d2 4. Since there are only logs from the incident (but not earlier), I can't ascertain how the index inconsistency among replica set members happened in the first place. 4a. Certainly there is no requirement for indexes to be the same across replica set members (i.e. you may want one member with an index for a certain batch reporting job that runs, but on no other members). As you have moved past this issue and probably no longer have any of the data from the incident, there is not much else we can advise without resorting to pure guesswork. I will resolve this ticket; however if you have any further information related to this issue, please feel free to re-open with additional details. Thanks and kind regards, Joanna

          People

          • Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:
              Days since reply:
              37 weeks, 6 days ago
              Date of 1st Reply: