Michael Cahill, I just noticed
WT-1698 is related to WT-3155 (for which I pushed a pull request a couple of days ago), but raises some different issues. I'll pull WT-1698 into my sprint, hopefully we can fix both of them.
- The original issue was concern that eviction could re-configure and attempt to allocate additional sessions while the application is closing the connection. Because eviction does the allocation asynchronously to the re-configuration, it's possible to hit this with a well-behaved application, that is, one that re-configures and then calls close. We have a `DIAGNOSTIC` assert in `__open_session `, that we're not in connection close when opening a session, so it's possible for a well-behaved application to trigger an assertion.
- You said you wanted to keep the assertion in `__open_session` ("This has been the source of a number of bugs in the past where internal threads can race with connection close if the connection is only open for a short period.")
- I said there are at least two other places this problem can occur, the LSM manager and async operations.
- You commented about where in the code we set the `WT_CONN_CLOSING` flag: it's after we shouldn't open more data handles, and `__wt_conn_btree_open` asserts we're not in the close path.
The PR I submitted (#3344), makes related changes:
- I moved the setting of the `WT_CONN_CLOSING` flag up in the close process and removed the assertion in `__wt_conn_btree_open` that we weren't in the process of closing.
- I changed the `DIAGNOSTIC` assertion in `__open_session` to be a return of `EBUSY`, without any associated error message.
My thinking was that it's OK for asynchronous threads to be in the open-file and open-session paths during close – they're running asynchronously and they can be in any normal code path until we shut them down. That said, I understand the diagnostic value, and expected that returning `EBUSY` out of `open_session` would trigger an error message we would see during testing.
Reviewing this all today, it occurred to me I'd introduced a problem, because `EBUSY` from session-allocation was likely to make its caller panic, which could race with the close and prevent normal shutdown.
However, I also figured out that LSM doesn't give its worker threads sessions, and both eviction and the async code allocate additional sessions in the reconfiguration path, so neither of them will open new sessions during close, at least in a well-behaved application. (I think async has always been that way, it's eviction that changed in August of 2016.)
In summary, at this point we could put the assertion back into `__open_session`, because we don't open new internal sessions in the close path. Given the possibility of an application calling close and other WiredTiger APIs concurrently, we could even go further, and panic in that case.