The workload runner currently just runs the operations, ignoring WT_NOTFOUND, and failing the execution on other errors. Ignoring return codes (and not just WT_NOTFOUND) is somewhat unfortunate, as it can obscure some differences between WiredTiger and the model.
The workload runner should thus record the return codes and compare them across WiredTiger and model executions, in addition to comparing the database state.