Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-18822

Sharded clusters with WiredTiger primaries may lose writes during chunk migration

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical - P2
    • Resolution: Fixed
    • Affects Version/s: 3.0.0
    • Fix Version/s: 3.0.4
    • Labels:
      None
    • Backwards Compatibility:
      Fully Compatible
    • Operating System:
      ALL
    • Steps To Reproduce:
      Hide

      The program I wrote to expose the bug is at: https://github.com/jeffhj11/MongoBug.git

      1. Set up a cluster with three shards, each with WiredTiger as the storage engine.
      2. Run the program to insert records into an unsharded collection. Wait for it to compete.
      3. Shard the collection on the uuid field
      4. Insert one record and wait for the load balancer to start.
      5. Run the program again while the balancer is running. On most occasions, we lost a few records.

      Show
      The program I wrote to expose the bug is at: https://github.com/jeffhj11/MongoBug.git 1. Set up a cluster with three shards, each with WiredTiger as the storage engine. 2. Run the program to insert records into an unsharded collection. Wait for it to compete. 3. Shard the collection on the uuid field 4. Insert one record and wait for the load balancer to start. 5. Run the program again while the balancer is running. On most occasions, we lost a few records.
    • Sprint:
      Sharding 5 06/26/16

      Description

      Issue Status as of Jun 11, 2015

      ISSUE SUMMARY
      The system that allows shards to track changes to a chunk during migration did not always capture every write operation performed on the source shard. Because the shard receiving the chunk was notified about a new write operation before the write operation committed to the source shard, if the source shard is experiencing high-volume writes to the chunk being migrated, the receiving shard may miss some write operations to the chunk being migrated.

      USER IMPACT
      Users running MongoDB versions 3.0.0 through 3.0.3 may be affected by this issue under the following conditions:

      1. user is running a sharded cluster, and
      2. the sharded cluster has the balancer enabled or the user is manually moving chunks, and
      3. during a migration, the chunk which is in the process of being migrated is receiving writes, and
      4. the primary node in the source shard is running the WiredTiger storage engine

      Under all of these conditions the target shard may lose writes to the chunk being migrated.

      It is not possible to determine if a cluster missed any writes because of this bug.

      None of the following deployments are affected by this issue:

      • a stand-alone MongoDB instance or a replica set that is not part of a sharded cluster
      • a sharded cluster that has the balancer disabled and no manual chunk migrations
      • a sharded cluster where all the members are running the MMAPv1 storage engine
      • a sharded cluster where any member running with the WiredTiger storage engine cannot become primary (i.e.: primary nodes run MMAPv1 and WiredTiger nodes are hidden)

      WORKAROUND
      Users running mixed storage engine sharded clusters may prevent nodes running WiredTiger from becoming primary by setting their priority to 0, so all primary nodes run MMAPv1.

      Users running WiredTiger-only sharded clusters may need to either add MMAPv1 nodes or replace WiredTiger nodes with MMAPv1 nodes, and make sure all remaining WiredTiger nodes are prevented from becoming primary as indicated above.

      Alternatively users can disable the balancer and cease manual chunk migrations, but note that this may lead to an unbalanced cluster and may have a negative impact on performance. The balancer can be disabled cluster-wide or on a per-collection basis. See the Documentation section below for more information.

      Users may upgrade WiredTiger nodes to 3.0.4-rc0 until 3.0.4 is released.

      AFFECTED VERSIONS
      The 3.0 release series up until (and including) 3.0.3.

      FIX VERSION
      The fix is included in the 3.0.4 stable release.

      DOCUMENTATION

      Original description

      There appears to be a bug where data can go missing in MongoDB. The problem seems to occur when doing a large number of concurrent inserts into a sharded cluster while the balancer is running. In all instances where we lost data, wiredTiger was the storage engine and the shard key was effectively a random UUID, so that inserts were going to all shards.

      The test description and program shows the error with a Java application using a MongoDB database, but we also had the problem (less frequently) with a Python script.

      The bug seems to be some sort of concurrency/race condition problem. It is not guaranteed to happen on any one run, but we were able to replicate it fairly consistently. The number of documents missing range from 1-1000 during 400,000-800,000 inserts.

      Also note that we counted documents in two ways to determine that we had lost documents. The first was by running an aggregate to count the documents in an effort to avoid the way count() works when the balancer is running. We also waited for the balancer to finish and ran both a count() and an aggregate to ensure that documents are missing.

      System Configuration & Setup

      We tested with the following versions
      MongoDB Versions: 3.0.1, 3.0.2, 3.0.3. For any given test, the instances were all running the same version of mongo.
      OS Version: Centos 6.6 for 3.0.1 and 3.0.2, MacOSX for 3.0.3.
      Write Concern: Acknowledged, Majority. On a few tests we had journaling enabled as well.
      Java version: 8u25
      Java Driver: 3.0.0, 2.12.2

      Cluster Setup

      We had the error occur in two different configurations, one with many servers and one mongod on each, and one with all mongo applications on one server.
      Please note that the servers we were running on were virtual and did not have particularly high iops.

      Single Server:
      3 shards, each shard was a replicate set with only one mongod instance, run with options:

      ./bin/mongod –replSet <REPL_SET_ID> --dbpath <DB_PATH> --port <PORT> --logpath <LOG_PATH> --fork –storageEngine wiredTiger
      1 config server, run with options:
      ./bin/mongod –configsvr –dbpath <DB_PATH> --port <PORT> --logpath <LOG_PATH> --fork –storageEngine wiredTiger
      

      1 mongos, run with options:

      ./bin/mongos –configdb <CONFIG_SERVER_PATHS> --logpah <LOG_PATH> --port <PORT> --fork
      

      Many Servers:
      3 shards, each shard was a replicate set with two mongod instances and one arbiter, run with options:

      ./bin/mongod –replSet <REPL_SET_ID> --dbpath <DB_PATH> --port <PORT> --logpath <LOG_PATH> --fork –storageEngine wiredTiger
      3 config server, run with options:
      ./bin/mongod –configsvr –dbpath <DB_PATH> --port <PORT> --logpath <LOG_PATH> --fork –storageEngine wiredTiger
      1 mongos, run with options:
      ./bin/mongos –configdb <CONFIG_SERVER_PATHS> --logpah <LOG_PATH> --port <PORT> --fork
      

      The replica sets were configured and primaries were elected prior to running the tests.

      Our application
      The program can be pulled from github at https://github.com/jeffhj11/MongoBug.git

      The Java application simply starts up a number of threads and each thread inserts a number of documents. There are options in the application to specify a message to more easily determine what data went missing.

      The document structure is simarly to:

      { _id: “”, // Either a copy of the uuid field or the mongo generated ObjectId
        uuid: “”, // A UUID created by java.util.UUID.getRandomUUID().toString()
        threadName: “”, // The name of the thread that inserted the document
        message: “”, // A message that can be specified on a per-run basis
        count: “” // A one-up number generated on a per-thread basis
      }
      

      The Tests
      To run a test, we would run the Java application to insert 400,000 documents into an unsharded database collection. After that completed, we would log into the mongos and shard the collection with the commands:

      db.ourcollection.createIndex({<SHARD_KEY>})
      sh.shardCollection(‘database.collection’, {<SHARD_KEY>})
      

      When that was completed, we would manually insert one document to start the balancer. We would watch the logs to ensure that the balancer was working. Once the balancer started, we would run our application again, so that multiple threads in the application were inserting documents into Mongo while the balancer was running.

      Common Themes and what we noticed

      The following seemed to be the common pieces that would cause MongoDB to lose data. Please keep in mind that this does not mean these are the actual causes, simply our observations and speculation.

      • We only lost data when WiredTiger was the storage engine, never with MMAP
      • We only lost data when the shard key was effectively or random UUID, whether it was sharded on the ‘uuid’ field or it was sharded on the ‘_id’ field when the ‘_id’ field was set to our UUID.
      • We only lost data when the balancer was running and we were inserting data at that time. The only data that was lost was the data being inserted while the balancer was running, the data that was there prior to the balancer running always existed at the end of the test.
      • On tests where we did something to slow down the inserts the likelihood of there being problems decreased.
      • One one test we did a find() immediately after doing an insert and match each field to ensure the exact document was found. That test did lose data, so one record was lost after our application and run a find() to retrieve it and it was successful.
      • We noticied the problem with both Python and Java, but it occurred much more often with our Java program. We are also better Java programmers, and our Java program runs much faster than our python program.

      1. MongoDBBug-FullWriteup.docx
        110 kB
        Jeffrey Johnson

        Issue Links

          Activity

          Hide
          ramon.fernandez Ramon Fernandez added a comment -

          Thanks for the detailed report Jeffrey Johnson, we're investigating this issue and will let you know what we find.

          Show
          ramon.fernandez Ramon Fernandez added a comment - Thanks for the detailed report Jeffrey Johnson , we're investigating this issue and will let you know what we find.
          Hide
          ramon.fernandez Ramon Fernandez added a comment -

          Jeffrey Johnson, this is to let you know that we've identified the root cause of this issue and are working on a fix. We'll post updates on this ticket as they become available.

          Thanks,
          Ramón.

          Show
          ramon.fernandez Ramon Fernandez added a comment - Jeffrey Johnson , this is to let you know that we've identified the root cause of this issue and are working on a fix. We'll post updates on this ticket as they become available. Thanks, Ramón.
          Hide
          jeffhj11 Jeffrey Johnson added a comment -

          Thank you so much, I look forward to learning the details of the problem and when the fix will be available.
          Thanks!

          Show
          jeffhj11 Jeffrey Johnson added a comment - Thank you so much, I look forward to learning the details of the problem and when the fix will be available. Thanks!
          Hide
          schwerin Andy Schwerin added a comment - - edited

          The root cause of this error was the same as the cause of SERVER-15192. Specifically, the logOp listeners, which are the hooks that replication, sharding and other systems use to monitor database writes within a single mongod node, were executing before rather than after those database writes committed. For most of the listeners, the only harm this introduced was to make it illegal to roll back after the writer completed the logOp() function call. However, the logOpForSharding observer (invoked by logOp) makes certain state used by chunk migration visible to other nodes. As a result, during a chunk migration, the recipient of a chunk might be instructed to observe some change performed on the donor of the chunk before the change commits. The recipient would then perform a regular query to ask the donor about the change, and not observe the change. In this particular case, the change was the insertion of a document into the migrating chunk.

          The fix is to back port SERVER-15192 to the 3.0 branch, which is accomplished in https://github.com/mongodb/mongo/commit/9e1657ebd156ed70f44a4585d2330510d75e2ae5.

          Show
          schwerin Andy Schwerin added a comment - - edited The root cause of this error was the same as the cause of SERVER-15192 . Specifically, the logOp listeners, which are the hooks that replication, sharding and other systems use to monitor database writes within a single mongod node, were executing before rather than after those database writes committed. For most of the listeners, the only harm this introduced was to make it illegal to roll back after the writer completed the logOp() function call. However, the logOpForSharding observer (invoked by logOp) makes certain state used by chunk migration visible to other nodes. As a result, during a chunk migration, the recipient of a chunk might be instructed to observe some change performed on the donor of the chunk before the change commits . The recipient would then perform a regular query to ask the donor about the change, and not observe the change. In this particular case, the change was the insertion of a document into the migrating chunk. The fix is to back port SERVER-15192 to the 3.0 branch, which is accomplished in https://github.com/mongodb/mongo/commit/9e1657ebd156ed70f44a4585d2330510d75e2ae5 .

            People

            • Votes:
              0 Vote for this issue
              Watchers:
              35 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                  Agile