[SERVER-7163] Replica set crash with segfault Created: 26/Sep/12 Updated: 15/Feb/13 Resolved: 22/Oct/12 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 2.2.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker - P1 |
| Reporter: | Roman Janusz | Assignee: | Eric Milkie |
| Resolution: | Duplicate | Votes: | 2 |
| Labels: | crash, replicaset | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Linux 3.0.0-12-server #20-Ubuntu SMP x86_64 x86_64 x86_64 GNU/Linux |
||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||
| Operating System: | Linux | ||||||||||||
| Participants: | |||||||||||||
| Description |
|
We had a 4-node replica set crashed, all nodes at the same time. Here are some log fragments showing the segfault: Node 1:
Node 2:
Node 3:
Node 4:
|
| Comments |
| Comment by Eric Milkie [ 25/Oct/12 ] | |||||||||||||||||||||||
|
| |||||||||||||||||||||||
| Comment by Nico Schottelius [ 25/Oct/12 ] | |||||||||||||||||||||||
|
I was wondering - shouldn't be there a bugfix in the server as well? I mean, a malformed request can always come in from the network and the server should probably close the connection to that particular client and ignore it instead of segfaulting. | |||||||||||||||||||||||
| Comment by Daniel Pasette (Inactive) [ 22/Oct/12 ] | |||||||||||||||||||||||
|
This crash was caused by | |||||||||||||||||||||||
| Comment by Eric Milkie [ 02/Oct/12 ] | |||||||||||||||||||||||
|
It absolutely should but it doesn't yet. The message parsing off the wire does minimal protocol conformance checks. This is something we're looking to add in a future release. | |||||||||||||||||||||||
| Comment by Roman Janusz [ 02/Oct/12 ] | |||||||||||||||||||||||
|
I created separate issue for the Java driver, but shouldn't the server be able to protect itself against such situations? | |||||||||||||||||||||||
| Comment by Jeffrey Yemin [ 02/Oct/12 ] | |||||||||||||||||||||||
|
Yes, please do. | |||||||||||||||||||||||
| Comment by Roman Janusz [ 02/Oct/12 ] | |||||||||||||||||||||||
|
Mongo is now up for almost 90 hours without any issue, so it seems quite likely that the driver was the source of crashes. Should I create another ticket for that? | |||||||||||||||||||||||
| Comment by Roman Janusz [ 29/Sep/12 ] | |||||||||||||||||||||||
|
Right now replica set in version 2.0.7 is up for almost 30 hours and we did not have any problems with it. We will also try to upgrade mongo again to 2.2.0 and see if something bad happens with that version, but I guess at the moment the main suspicion is that Java Mongo driver was broken in some version above 2.8.0. By the way, another interesting symptom of our problems with mongo is that some corrupt data appeared in database - supposedly during the time that driver 2.9.1 was used. For example, badly-named databases and collections appeared in our database. Their names are often malformed versions of correct names - e.g. some part of the name is missing. If this helps with debugging, our collection names are often fully qualified Java class names, so they're long and contain a few dots. | |||||||||||||||||||||||
| Comment by Eric Milkie [ 28/Sep/12 ] | |||||||||||||||||||||||
|
In the meantime I'd check to make sure everything is configured as you expect, and that the port numbers are correct for all servers and drivers. | |||||||||||||||||||||||
| Comment by Roman Janusz [ 28/Sep/12 ] | |||||||||||||||||||||||
|
There is no way that anything other than our application connects to database. These IPs are OK, our application is running on these hosts. Could this be possible that MongoDB Java Driver 2.9.1 is corrupt? We have made an upgrade from 2.8.0 to 2.9.1 just about the time mongo started regularly crashing. We have downgraded the driver to 2.8.0 and we are monitoring the situation. We'll let you know if mongo crashes again. | |||||||||||||||||||||||
| Comment by Eric Milkie [ 28/Sep/12 ] | |||||||||||||||||||||||
|
Hello Roman. Can you identify server 10.220.40.25 or 10.220.40.26? They have processes connecting to the mongod port but the processes aren't valid mongo drivers. | |||||||||||||||||||||||
| Comment by Roman Janusz [ 28/Sep/12 ] | |||||||||||||||||||||||
|
Again, system crashed. This time however it was MongoDB version 2.0.7 and 2 out of 4 nodes crashed (acs3, acs4). Logs are attached. | |||||||||||||||||||||||
| Comment by Roman Janusz [ 27/Sep/12 ] | |||||||||||||||||||||||
|
The crash has happened again. We suspected corrupted data to be the reason last time as mongo kept crashing, but this time we started with empty database and it run about 20h before finally crashing. Here is more detailed info about the environment: Mongo was running in replicaset on 4 nodes: acs1 was most probably the primary node - it got all the writes and a little bit of reads (most reads were going to secondaries) Here is current result from rs.config() - after the restart:
Application was running on three nodes - acs1, acs3, acs4. It is a Java application using Java MongoDB driver 2.9.1. Logs from all nodes are attached. | |||||||||||||||||||||||
| Comment by Eric Milkie [ 26/Sep/12 ] | |||||||||||||||||||||||
|
Can you post more of the logs from each of the four nodes? Also post the configuration of your replica set by issuing rs.config() in the Mongo shell. At first glance, it appears that one or more rogue servers connected to the mongod's but did not speak the correct wire protocol. Can you identify the servers by the ip addresses in the log fragments? For example, who is 10.220.40.25? |