Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-3515

Better parent/child process coordination in timestamp-abort

    • Storage Engines

      In WT-3514 the timestamp-abort process hung after the checkpoint call in the child process got an error. There really are two bugs here:

      1. The hang happened because the parent is waiting to start its "kill the child after N seconds" timer until after the first checkpoint completes. The child checkpoint thread got the error on the first checkpoint and never created the file that parent is looking for. So the parent hung. One fix is to have the parent wait up to MAX_TIME for that file to appear and error if it doesn't appear by then.

      2. Since this is a kill test any error or problem that happens in the child process is lost. The parent does a kill -9 on the child and expects it to die and does not look at exit status. If the problem in WT-3514 had happened on any checkpoint other than the first one we never would have noticed or detected it. Consider adding a mode to the test that, instead of the parent doing a kill -9 on the child, the child runs N seconds and then cleanly shuts down. Add several "clean" mode tests to smoke.sh to get adequate run coverage of clean shutdown arg combinations. Testing correctness after clean shutdown is also more coverage that we don't have at the moment.

      Although errors in a kill-situation would still be lost, the idea, similar to test/format, is that a bug like WT-3514 would eventually show itself in the clean shutdown case the parent would detect the non-zero exit status of the child when in it would expect a zero exit status.

            Assignee:
            sue.loverso@mongodb.com Susan LoVerso (Inactive)
            Reporter:
            sue.loverso@mongodb.com Susan LoVerso (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: