[SERVER-9995] corruption on primaries after upgrade to 2.4.4 Created: 23/Jun/13 Updated: 10/Dec/14 Resolved: 26/Jun/13 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 2.4.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | charity majors | Assignee: | Unassigned |
| Resolution: | Duplicate | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Participants: | |||||||||||||||||
| Description |
|
I'm currently running a handful of 2.2.4 clusters. On June 11th, I upgraded one primary and one secondary apiece on two clusters to 2.4.4. A couple days later, I started getting reports of data corruption. Looking through my logs, I saw a ton of these on one of the 2.4.4 primaries:
I failed over to another primary (also 2.4.4), and within 4 days it started generating the same assertions.
I just checked on my other 2.4.4 primary on a totally different replica set, and sure enough, it has a shitload of "corrupt db" errors too. Thousands. asya suggested this may be due to index corruption, not data corruption, so I'm going to try rebuilding the indexes on these nodes once I can take them offline. |
| Comments |
| Comment by Daniel Pasette (Inactive) [ 26/Jun/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
marking as duplicate of | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by charity majors [ 25/Jun/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Ok, thanks. I look forward to re-attempting an upgrade when 2.4.5 is out. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by David Hows [ 25/Jun/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Charity, I've gone over the logs you posted and found that these issues are From your logs I can see (for example) that you create two indexes on collection appdata45.app_cf6a4e7c-9303-4928-972d-8c82ca4bf973:TestTable
This works fine until you drop the indexes on that collection. The drop will remove one of the index entries and the index itself, but the second instance of the { inedx: 1 }index will remain. This leaves you with an index that points at an empty space.
Then from that point on all of the queries to collection this will fail when they attempt to use that index.
Given this the steps to recover are:
If you are unable to change the behaviour in your application then the recommended action is to rollback to 2.2 before stepping down the primary and re-syncing. Regards, | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by charity majors [ 24/Jun/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Yes, thanks, I was afk. Uploaded. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Pasette (Inactive) [ 24/Jun/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Charity, not sure if you received notification, but I created a private ticket for you to upload logfiles here: SUPPORT-615. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by charity majors [ 24/Jun/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Yes, I can compress and upload, but we would like to keep them private. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Asya Kamsky [ 24/Jun/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I'm about 80% certain that you are hitting | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Pasette (Inactive) [ 24/Jun/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Do you have the complete log file from one of the nodes starting from the time you upgraded until you started seeing the issues? If so, can you compress and upload? If you prefer to keep this log information private, I can start a SUPPORT ticket for this issue. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by charity majors [ 24/Jun/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Yes, I was able to run getIndexes() on at least one of the impacted collections. parse2:PRIMARY> db["app_df7688a2-419b-470e-8adb-f1071e960753:Answer"].getIndexes() , , , | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Pasette (Inactive) [ 24/Jun/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Can you run getIndexes() on the impacted collections? Trying to see if this is fallout from | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by charity majors [ 24/Jun/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I was unable to drop the indexes. parse2:PRIMARY> db["app_df7688a2-419b-470e-8adb-f1071e960753:Answer"].dropIndex("_p_child_1") I'm rolling back to 2.2 and going to repair these nodes. |