[SERVER-4305] Deadlock of secondary trying to sync the oplog if index versions are mixed on master Created: 17/Nov/11 Updated: 14/May/12 Resolved: 14/May/12 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 2.0.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker - P1 |
| Reporter: | Steffen | Assignee: | Kristina Chodorow (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | index, replication, stale | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Linux 2.6.32-35-server, Ubuntu 10.04, MongoDB 2.0.1, Replicaset with 3 Nodes, NUMA, 2x XEON E5620 , 24 GB RAM |
||
| Attachments: |
|
| Operating System: | ALL |
| Participants: |
| Description |
|
What we want to do: How we do this: What is the problem: What we suspect: We downgraded the indexes again with an older mongod binary (1.8.4). After this was finished, we connected the secondary to the replicaset again and it replayed the oplog without a problem and is now in sync again. All hosts have the mongod binary version 2.0.1. I've attached iostat and mongostat output. The host with the problem is mn01. Regards, |
| Comments |
| Comment by Ian Whalen (Inactive) [ 14/May/12 ] | |||||||||||||||||||||||||||
|
Closing as Cannot Reproduce - please reopen if you continue to run into the same problem. | |||||||||||||||||||||||||||
| Comment by Steffen [ 18/Jan/12 ] | |||||||||||||||||||||||||||
|
Hello, we tried another run to upgrade the indexes. Maybe -repair startup parameter doesn't work, but compact does. | |||||||||||||||||||||||||||
| Comment by Steffen [ 27/Dec/11 ] | |||||||||||||||||||||||||||
|
We will try another rebuild of the indexes next year. I will report then about the results. | |||||||||||||||||||||||||||
| Comment by Kristina Chodorow (Inactive) [ 13/Dec/11 ] | |||||||||||||||||||||||||||
|
Sorry about the delay, consulting with coworkers. Can you attach gdb and get a stack trace for all threads at a few different times, so that we can see what code is running when the CPU is pegged? To get a backtrace of all threads in gdb, run:
| |||||||||||||||||||||||||||
| Comment by Steffen [ 01/Dec/11 ] | |||||||||||||||||||||||||||
|
Yes, already posted: | |||||||||||||||||||||||||||
| Comment by Kristina Chodorow (Inactive) [ 30/Nov/11 ] | |||||||||||||||||||||||||||
|
Unfortunately, MMS doesn't give us that kind of insight (it's just statistics about your data, not the data or queries themselves). That does look like something weird is happening during the repair, do you happen to have the log from when you did the repair? | |||||||||||||||||||||||||||
| Comment by Steffen [ 23/Nov/11 ] | |||||||||||||||||||||||||||
|
Can you read from mms the operations happening? The obvious difference between this two replicasets, we are using, is just the index versions of the databases.
| |||||||||||||||||||||||||||
| Comment by Kristina Chodorow (Inactive) [ 22/Nov/11 ] | |||||||||||||||||||||||||||
|
Okay, thanks. I have been unable to reproduce so far. Can you describe what kinds of operations are happening on the primary? | |||||||||||||||||||||||||||
| Comment by Steffen [ 22/Nov/11 ] | |||||||||||||||||||||||||||
|
Should be the 15th. | |||||||||||||||||||||||||||
| Comment by Kristina Chodorow (Inactive) [ 21/Nov/11 ] | |||||||||||||||||||||||||||
|
Was the mongostat taken on the 15th? 16th? | |||||||||||||||||||||||||||
| Comment by Steffen [ 18/Nov/11 ] | |||||||||||||||||||||||||||
|
log of repair process and first start after repair. | |||||||||||||||||||||||||||
| Comment by Steffen [ 18/Nov/11 ] | |||||||||||||||||||||||||||
|
Log of the secondary from nov 15 | |||||||||||||||||||||||||||
| Comment by Kristina Chodorow (Inactive) [ 18/Nov/11 ] | |||||||||||||||||||||||||||
|
Thanks for all the information. I'm not sure what day mongostat was from, but from MMS, it looks like it was up to November 15th that would be of interest, log-wise. Do you have the secondary's log from then? Or was the lock up after that spike? (Also, please don't mark comments viewable by Users only, it just messes us up (and they're still visible to everyone).) | |||||||||||||||||||||||||||
| Comment by Steffen [ 17/Nov/11 ] | |||||||||||||||||||||||||||
|
log of the secondary which is behind the master and tries to catch up. | |||||||||||||||||||||||||||
| Comment by Steffen [ 17/Nov/11 ] | |||||||||||||||||||||||||||
|
all data on this replicaset uses 125 GB. We have tried this with another node which did a full resync but we ran into this bug: https://jira.mongodb.org/browse/SERVER-4294 We have another replicaset with all indexes at version 1.8 (v:0). We added there also a clean node for backup purposes which has the new indexes from the initial sync and it works without problems. | |||||||||||||||||||||||||||
| Comment by Eliot Horowitz (Inactive) [ 17/Nov/11 ] | |||||||||||||||||||||||||||
|
Can you attach the log from the secondary when it can't keep up? How big is the data set? Have you tried not doing a repair, but just letting a secondary do a full resync? |