[SERVER-7841] Segmentation faults and lost replicaset seed on rs.remove() Created: 04/Dec/12 Updated: 15/Feb/13 Resolved: 06/Dec/12 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication, Sharding, Stability |
| Affects Version/s: | 2.0.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Brett Kiefer | Assignee: | Randolph Tan |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | crash, replicaset | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Ubuntu 12.04.1 LTS on a hi1.4xlarge AWS instance |
||
| Issue Links: |
|
||||||||
| Operating System: | Linux | ||||||||
| Steps To Reproduce: | The description field is in narrative format with stack traces. The cluster is 2 shards, each with a 3-machine replica set with one node as a 1 hour delay slave, and a total of about 300Gb of actual data. |
||||||||
| Participants: | |||||||||
| Description |
|
We were taking an old MongoDB server out of a replicaset (one of two shards), to be replaced with a newer build of the system. The machine to be taken out was called mongo2-1, and it was the primary. I first stepped it down from Primary: mongo2-4 was elected the new primary. That went very smoothly, and mongo2-4 was serving requests with little or no effect on the site from the transition. I let that run for a minute, and since everything looked fine, I did The new primary (mongo2-4) logged some very strange replica set behavior: Note that the seed there is the replica set member I had just removed. Then MongoDB on mongo2-4 seg faulted: Mon Dec 3 21:57:00 Invalid access at address: 0xffffffffffffffe0 Mon Dec 3 21:57:00 Got signal: 11 (Segmentation fault). Mon Dec 3 21:57:00 Backtrace: After it restarted, we were consistenly getting socket exceptions in Mongo clients connecting through mongos, and the mongos logs were showing: Mon Dec 3 21:57:36 [conn18303] SyncClusterConnection connecting to [mongo-config-1.prod.trello.local:27019] Finally, we added mongo2-1 back into the replicaset, restarted all mongos and client processes, and things went back to normal. An hour later, we were able to remove mongo2-1 from the replica set and nothing exploded. All told, about a 15 minute site outage. |
| Comments |
| Comment by Brett Kiefer [ 06/Dec/12 ] |
|
Yes, we are scheduled to upgrade to 2.0.8, and I agree that SERVER 5166 and SERVER 5833 (linked from SERVER 5110) look like the same issue, so hopefully we won't see this post-upgrade. Thank you. |
| Comment by Randolph Tan [ 05/Dec/12 ] |
|
Hi, Would it be possible to upgrade to 2.0.8? We made a couple of fixes in mongos replica set handling, most notably, Thanks! |