[SERVER-17147] sharded cluster replSet mongod crash Created: 02/Feb/15 Updated: 20/Mar/15 Resolved: 20/Mar/15 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Index Maintenance |
| Affects Version/s: | 2.6.7 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | SRR | Assignee: | Unassigned |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Operating System: | ALL |
| Participants: |
| Description |
|
Sharded cluster with replica sets (2 members 1 arbiter); after upgrading from 2.4.5 to 2.6.6 and/or 2.6.7 the mongod(s) crash at random times. Sometimes both members of a replica set crash one after the other..
|
| Comments |
| Comment by Sam Kleinman (Inactive) [ 20/Mar/15 ] | ||
|
Sorry for not getting back to you sooner, and sorry that this is frustrating. While the 2.4->2.6 upgrade process can be frustrating, the additional restrictions in 2.6, do make deployments more stable and reliable. It looks like the upgradeCheckAllDBs method will detect cases where you have collections that don't have indexes on the _id field. If these collections exist on secondaries; you can restart these instances as standalone (i.e. non-replica set), with the older version, and create the required indexes or drop these collections. At this point you should be able to upgrade the instances starting them as members of the replica set. See the [documentation|http://docs.mongodb.org/manual/release-notes/2.6-upgrade/| for complete upgrade instructions. Sorry again for the frustration, and hope you can successfully upgrade. Regards, | ||
| Comment by SRR [ 09/Feb/15 ] | ||
|
So if anyone else runs into this problem you must kill the mongod process (if it's not dead already) with the stale collection and then restart it (same settings/conf) but with the mongod executable from MongoDB 2.4.10. For the issue I mentioned in the previous comment I had it happen again and had to rs.stepDown(10) all the replica set primaries such that a secondary became primary before it started working again.......ridiculous | ||
| Comment by SRR [ 03/Feb/15 ] | ||
|
So.. I tried to drop that tmp collection.. but it complained that it had to be the master.. so I tried to step down the master but of course the secondary died as shown in the initial log... frustratingly, the database then stopped working for our app "failed: MR post processing failed: { errmsg: "exception: could not initialize cursor across all shards because : socket exception [SEND_ERROR]" even though only that secondary was dead.. so I began restarting the app/mongod/that secondary/mongos but nothing worked, kept getting the same error.. after 15 minutes it magically started working again... so maybe some database cache was the problem... sigh any ideas? | ||
| Comment by SRR [ 03/Feb/15 ] | ||
|
Okay.. actually I got a warning when I connected which I haven't seen before so this is probably my issue: Server has startup warnings: | ||
| Comment by Ramon Fernandez Marina [ 03/Feb/15 ] | ||
|
aerospace, apologies if I wasn't clear – I meant to ask you to run
in all the data-bearing nodes where you see the assertion (not against mongos). Can you please connect to each mongod and run the above command? Note that in secondaries you may need to run:
first. If any of the mongod servers produces a non-empty output then we'll know we're dealing with Thanks, | ||
| Comment by SRR [ 02/Feb/15 ] | ||
|
mongos> db.system.indexes.getIndexSpecs() | ||
| Comment by Andy Schwerin [ 02/Feb/15 ] | ||
|
Unfortunately, the stack trace is so deep that it's hard to be 100% certain, but the log suggests that the crash happens while deleting a temporary collection created by a mapreduce that was running on the former primary. The node whose log snippet is presented is just becoming primary at the top, and goes to drop a collection with the telltale .tmp.mr. in its name. The stack trace, though truncated, indicates that the drop-collection call was initiated by MongoD, not the client, which also indicates temp-collection clean-up. | ||
| Comment by Ramon Fernandez Marina [ 02/Feb/15 ] | ||
|
aerospace, this might be a duplicate of
? |