[SERVER-10372] ReplicaSetMonitor creates a thread that references memory it does not own Created: 29/Jul/13  Updated: 10/Dec/14  Resolved: 06/Aug/13

Status: Closed
Project: Core Server
Component/s: Internal Client
Affects Version/s: 2.2.4, 2.4.5
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Gerry F Assignee: Tad Marshall
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Windows, Linux confirmed


Issue Links:
Related
is related to SERVER-8707 dbclient_rs_test threading issue Closed
is related to SERVER-8891 Simple client fail with segmentation ... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

The simplest approach is to call mongo::ReplicaSetMonitor::remove() to destroy the referenced memory in a running process. The new thread will reference the freed memory within 10 seconds.

Using Microsoft's ODBCtest, start and end a connection to a replica set. When the last connection to that driver .dll is closed, the program will drop the .dll and crash the next time the new thread wakes up.

Participants:

 Description   

The ReplicaSetMonitor constructor creates a new thread that references memory belonging to the constructor thread. If the constructor thread exits or frees the ReplicaSetMonitor, the new thread references freed memory.

This is likely related to SERVER-8707. The problem was previously obscured by SERVER-8891.

We need this fixed in 2.4. The 2.4.5 C++ API is working well with older Mongo servers so we don't need this backported into 2.2.



 Comments   
Comment by Tad Marshall [ 06/Aug/13 ]

This was fixed by the commits made for linked tickets SERVER-8707 and SERVER-8891.

Comment by Tad Marshall [ 06/Aug/13 ]

Hi Gerry,

Thanks! I'll resolve this ticket.

Can you file tickets against the C++ driver documentation and a suggestion to make ReplicaSetMonitorWatcher more manageable? I think that Randolph is also hoping to address ReplicaSetMonitorWatcher ... it's the thread that crashes in the remaining problem in SERVER-8707.

Tad

Comment by Gerry F [ 06/Aug/13 ]

Hi Tad,

I think this ticket can be closed.

There are two issues remaining, but maybe they should get their own tickets:

  • The C++ driver documentation needs some love, with additional detail on when and when not to call functions. Also, the connect/disconnect workflow is not obvious.
  • IF the watcher thread is necessary, and IF BackgroundJob is the right way to start that thread, then I think we need a way to tell that thread it should die the next time it wakes.

Looking forward to 2.4.6,
Gerry

Comment by Tad Marshall [ 06/Aug/13 ]

Hi Gerry,

The remaining issues in SERVER-8707 are looking like a bug in the "mock" replica sets used by that test, so if there are remaining issues in this ticket I think that they are distinct from the remaining issues in SERVER-8707.

All of the fixes found so far have been backported to the 2.4 branch and will be in version 2.4.6.

If you don't call ReplicaSetMonitor::remove() (which I don't think is something that users of the C++ driver should do), are there still issues to be fixed for this ticket?

Tad

Comment by Gerry F [ 31/Jul/13 ]

I was using ReplicaSetMonitor::remove() based on some older sample code in the hopes that it would prevent our ReplicaSetMonitor related crashes.

In testing today, it has become clear that the ReplicaSetMonitor was not the problem. Removing the ReplicaSetMonitor caused the Watcher thread to do nothing every ten seconds, instead of referencing damaged memory every ten seconds. Basically, it hid the symptoms of an unrelated bug.

Sometimes when I started a second connection to the same replica server, I would get memory trashing. This doesn't show up in Dr Memory or valgrind, because it is valid memory being changed.

If I copy the dbclient_rs.cpp and dbclient_rs.h fixes for SERVER-8891 into my 2.4.5 directory and remake, the memory corruption doesn't happen. I cannot see that there any changes in these two files that could possibly affect this, but....

In summary, the SERVER-8891 fix is good and also mysteriously fixes a memory trashing problem in my application. The SERVER-8707 bug is real and annoying. The second way of reproducing this bug is simply a way to destroy the memory used by the watcher thread while the process continues. I think this bug can be considered a duplicate of 8707.

Comment by Tad Marshall [ 31/Jul/13 ]

Hi Gerry,

Thanks for the sample program!

I discussed the correct usage of ReplicaSetMonitor with another developer here, and he shares my feeling that this API (ReplicaSetMonitor::remove) was never intended to be called by C++ driver users. I also talked to people in our documentation group and confirmed my impression that the C++ driver was more-or-less undocumented, and so users are probably reading MongoDB source code to try to figure out how to use the C++ driver.

I apologize for the poor state of information on proper use of callable functions in the C++ driver.

I'll try your sample program and see if I can make MongoDB more defensive against calls like this.

In the meantime, can you describe what you are trying to accomplish in removing the replica set that is being used by the DBClientReplicaSet that you created? There may be a better way to do it that will not require additional defensive code in the C++ driver.

Thanks!

Tad

Comment by Gerry F [ 31/Jul/13 ]

This simple test case works correctly, except for a rare crash on exit (1 crash in over 30 runs). Running in valgrind seems to get luckier, at over 60 runs without an exit crash or invalid memory access.

Clearly, the added complexity of the real application matters.

// SERVER-10372
#include <mongo/client/dbclient.h>
using namespace mongo;
 
int main() {
 
    std::vector<HostAndPort> hosts;
       hosts.push_back(HostAndPort("192.168.202.249:27017"));
       hosts.push_back(HostAndPort("192.168.202.249:27018"));
       hosts.push_back(HostAndPort("192.168.202.249:27020"));
 
       DBClientReplicaSet connection("rs0", hosts);
       connection.connect();
 
       mongo::sleepsecs(12);
 
       mongo::ReplicaSetMonitor::remove("rs0", true);
 
       mongo::sleepsecs(30);
 
}

Comment by Gerry F [ 31/Jul/13 ]

My current test case is too embedded in a tool environment to share nicely.

While I am trying to cut my test case down, you might try this:
Make a connection to a replica set.
Disconnect from the replica set.
Call ReplicatSetMonitor::remove().
Call sleep for 20 seconds.

If I am right about what the problem is, this simple test will read freed memory and be detectable in valgrind.

It will take me a while to convert my test case to something you can run, but it is priority one.

Comment by Tad Marshall [ 31/Jul/13 ]

Hi Gerry,

Thanks for this bug report.

As you know from SERVER-8891, we've had a long-standing problem with static objects being destroyed in an "incorrect" order in which later-destroyed objects reference objects that were destroyed earlier. Our "fix" for this in the MongoDB server has been to always exit by calling _exit(), which prevents any static destructors from running. Imposing this requirement on users of the C++ driver has seemed onerous, so we're trying to eliminate the crashes by making cleanup more predictable in the cases that we can identify.

There is a remaining problem with the ReplicaSetMonitorWatcher thread, because the thread doesn't terminate until after the object containing the data it uses has been destroyed. This only causes problems intermittently, because of the 10 second sleep that this thread does on each pass; if the process termination happens while the thread is sleeping, nothing bad happens. When the termination happens when the thread is not in the sleep, we get the problems seen in SERVER-8707, including things like "src/third_party/boost/boost/thread/pthread/mutex.hpp:154: void boost::timed_mutex::lock(): Assertion `!pthread_mutex_lock(&m)' failed.".

I think that you are describing a different problem. You mentioned calling remove() on a ReplicaSetMonitor and then waiting 10 seconds (the duration of the sleep in ReplicaSetMonitorWatcher's loop) so it sounds like this is not the process termination case described in SERVER-8707.

Can you describe the steps to reproduce the problem you are seeing, or post sample code that we can use to reproduce the problem?

Thanks!

Tad

Generated at Thu Feb 08 03:22:59 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.