Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-3998

Fix a bug where stable timestamp was ignored on shutdown

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 3.6.5, 4.0.0-rc0, WT3.1.0
    • Affects Version/s: 3.7.4
    • Component/s: None
    • Labels:
    • Storage Non-NYC 2018-04-09, Storage Non-NYC 2018-04-23

      The result of which was that checkpoints could be written including data newer than the stable timestamp - i.e: data visibility rules were ignored.

      Attached will be a starter set of data. It contains a fresh replica set secondary and an oplog with data generated by find9.js followed by update_multi3.js.

      The `find9.js` file inserts a bunch of large documents (creates cache pressure) and `update_multi3.js` runs a sequence of oplog entries such that if the entries are replayed, will result an insert into a drop-pending namespace and causes a crash.

      To reproduce, first unarchive the attached file:

      tar -xzf repro_data.tgz
      

      Then repeatedly run and control-C mongod[1].

      ./mongod --dbpath data/db1/ --replSet rs0 --wiredTigerCacheSizeGB 1
      

      Note both the --replSet rs0 flag and --wiredTigerCacheSizeGB 1. The first is required to replay the oplog entries. The second I believe is required to reproduce the bug.

      Keep the process up long enough to complete startup oplog recovery. The way the repro works, one run will cause corruption on shutdown and the followup run will crash while starting up.

      More details on what the process does and what goes wrong:

      Upon opening the datafiles, mongod will query the "recovery" timestamp of the WiredTiger checkpoint. That value then gets set as the stable timestamp.

      This time is also passed back to replication which will play all oplog entries after the recovery timestamp to the top of the oplog. The next repro step is to shut down the node.

      None of the data written should persist on disk as part of the stable checkpoint on close. The next startup will replay from the beginning. However, in a run that reproduces the bug, the node will be in a state where the collection metadata (stored in table:_mdb_catalog) contains the collections created (and dropped) from the previous run.

      In particular what causes the problem, the dataset was created by only performing half of the two-phase drop on the update_multi3 collection. When recovery plays over the inserts again, replication recovery will crash due to trying to insert into this "drop pending" collection.

      [1] Replication recovery has completed when mongod starts spamming the following log lines every half second:

      2018-03-26T22:22:27.560-0400 I REPL     [replexec-2] Member localhost:27018 is now in state RS_DOWN
      2018-03-26T22:22:28.060-0400 I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to localhost:27018
      2018-03-26T22:22:28.060-0400 I ASIO     [NetworkInterfaceASIO-Replication-0] Failed to connect to localhost:27018 - HostUnreachable: Connection refused
      2018-03-26T22:22:28.060-0400 I ASIO     [NetworkInterfaceASIO-Replication-0] Dropping all pooled connections to localhost:27018 due to failed operation on a connection
      2018-03-26T22:22:28.060-0400 I REPL_HB  [replexec-0] Error in heartbeat (requestId: 7) to localhost:27018, response status: HostUnreachable: Connection refused
      

        1. repro_data.tgz
          362 kB
          Daniel Gottlieb
        2. startupRecovery.js
          4 kB
          Daniel Gottlieb

            Assignee:
            sue.loverso@mongodb.com Susan LoVerso
            Reporter:
            daniel.gottlieb@mongodb.com Daniel Gottlieb (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            14 Start watching this issue

              Created:
              Updated:
              Resolved: