[SERVER-42711] PyMongo's retryable reads spec tests causes server 4.0 to segfault with MMAPv1 Created: 08/Aug/19  Updated: 20/Aug/19  Resolved: 20/Aug/19

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.0.11
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Prashant Mital (Inactive) Assignee: Daniel Gottlieb (Inactive)
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File mongod.log    
Issue Links:
Depends
is depended on by PYTHON-1948 Test retryable reads on MMAPv1 Backlog
Duplicate
duplicates SERVER-42922 failCommand + closeConnection can der... Closed
Operating System: ALL
Steps To Reproduce:
  • Spin up a replica set with 2 replicas and 1 arbiter running MongoDB 4.0.11 on the MMAPv1 storage engine with test commands enabled. mlaunch command used was:

$ mlaunch init --replicaset --name repl0 --nodes 2 --arbiter --binarypath $MONGODB_4011_BIN --port 27017 --hostname localhost --setParameter enableTestCommands=1 --storageEngine mmapv1

  • Clone the following PyMongo branch:

$ git clone --branch PYTHON-1934/retryWrites-with-MMAPv1-raises-actionable-error https://github.com/prashantmital/mongo-python-driver.git

  • Run the retryable reads tests:

$ python setup.py test -s test.test_retryable_reads

Sprint: Execution Team 2019-08-26
Participants:

 Description   

The issue can be observed when running PyMongo's test-suite on server version 4.0.11 with 2 replicas and 1 arbiter using the MMAPv1 storage engine. I have been able to reproduce this on OSX.



 Comments   
Comment by Daniel Gottlieb (Inactive) [ 20/Aug/19 ]

I'm closing this in favor of SERVER-42922 (though I arguably could have rewritten the contents of this ticket). For your convenience, I've linked PYTHON-1948 as depending on this new ticket, though there might be workarounds available.

What's being observed is a general server bug regarding the use of failCommand with closeConnection. It attempts to close the client connection, but the server can also run command code internally (DBDirectClient) which doesn't have an actual network interface to close. What you're seeing is a null dereference here.

I suspect why this is only showing for you on mmapv1 tests is that 4.0 mmapv1 has bugs which fail the retryable reads tests. My guess is the python suite is leaving the mongod in a state where the failpoint is still engaged. When a background thread that uses DBDirectClient runs an iteration, it encounters the failpoint (that was not intended for it), crashing the mongod. My hypothesis is that the crash is only observed after a python test has already failed.

One thing worth trying is having the tests disable any outstanding failpoints on teardown. Technically there's still a race where the background thread can observe this failpoint state, even in a passing test. But it seems the window to hit the race is sufficiently small given you haven't observed this with WT (which I assume always passes the suite).

Comment by Prashant Mital (Inactive) [ 08/Aug/19 ]

I also encountered the same behavior on an evergreen spawn host [rhel62-small, Python 2.7 (/opt/python/2.7/bin/python), MMAPv1] running a 3-member replica set with MMAPv1 (no arbiter).

Comment by Danny Hatcher (Inactive) [ 08/Aug/19 ]

As MMAPv1 is deprecated as a storage engine in 4.0, is this something we want to fix?

Generated at Thu Feb 08 05:01:13 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.