[SERVER-4174] Replicaset resync skips building of _id index in certain circumstances Created: 29/Oct/11 Updated: 15/Aug/12 Resolved: 09/Mar/12 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 2.0.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Y. Wayne Huang | Assignee: | Kristina Chodorow (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | indexing, replicaset, replication | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
ubuntu 10.04 x86_64; 4 x replicaset shards, each replicaset with two participating members and one arbiter |
||
| Issue Links: |
|
||||||||
| Operating System: | Linux | ||||||||
| Participants: | |||||||||
| Description |
|
We are in the process of resyncing to address fragmentation and to upgrade all indexes to the newer v1 format. This process worked fine for 3 of our 4 shards. The 4th shard is the primary for the namespace and because we have not yet fully tested MR output to a sharded collection, it also stores the output of all MR tasks. None of the indexes on this shard have been upgraded to v1. In 4 resync attempts (remove all data on secondary), we have noticed that the { _id: 1 }index on some collections is not built during the resync process and hypothesize that there is some corruption or incompatibility in the index that appears to leave the index functional for queries but somehow prevents it from being rebuilt during resync attempts. We believe that this impacts replay of the oplog, causing it to seemingly hang--optimeDate does not advance (perhaps it is working but extremely slow). Shutting down the node shows the in-progress operation being interrupted: Fri Oct 28 16:26:33 [rsSync] replSet syncTail: 11600 interrupted at shutdown, syncing: { ts: Timestamp 1319826732000|30, h: 1028901388370155163, op: "u", ns: "mydb.mycoll", o2: { _id: { ... } }, o: { _id: { ... }, value: { ...data... }} } The timestamp matches the optimeDate. We have seen this affect both inserts and updates. It appears to only affect collections which are the target for MR output. Additionally, it may only affect collections wherein MR results are re-reduced/re-finalized, but we are less confident about this assessment. |
| Comments |
| Comment by Kristina Chodorow (Inactive) [ 09/Mar/12 ] | |||
|
Suspected issue, as _id index is created after clone. | |||
| Comment by Kristina Chodorow (Inactive) [ 09/Mar/12 ] | |||
|
I'm sorry it took me so long to get back to you. We're a bit overloaded and I lost track of things. I'll link this to the bug I think caused the problem and mark as Cannot Reproduce. | |||
| Comment by Y. Wayne Huang [ 09/Mar/12 ] | |||
|
I'm sorry, but I no longer remember enough details about this issue to provide any further information. I suggest that if you mark a bug as 'debugging with submitter' that it doesn't occur months after the issue is reported. | |||
| Comment by Kristina Chodorow (Inactive) [ 09/Mar/12 ] | |||
|
This might be the same as | |||
| Comment by Kristina Chodorow (Inactive) [ 19/Jan/12 ] | |||
|
Sorry for the delay! Can you send the output of running:
Also, if you can try initial syncing with -vvvvv and pasting the logs, it would be very helpful. | |||
| Comment by Eliot Horowitz (Inactive) [ 09/Nov/11 ] | |||
|
Kristina has been out, but should continue working on this tomorrow. | |||
| Comment by Y. Wayne Huang [ 08/Nov/11 ] | |||
|
any updates? | |||
| Comment by Y. Wayne Huang [ 01/Nov/11 ] | |||
|
updates? | |||
| Comment by Y. Wayne Huang [ 31/Oct/11 ] | |||
|
serverStatus: { , }, , , , }, , , , , , , , }, | |||
| Comment by Y. Wayne Huang [ 31/Oct/11 ] | |||
|
Okay, after a bit more debugging it looks like the missing { _id: 1 }indexes do eventually get built but it occurs a few hours after the sync is complete and the recovering node has already become a secondary. Isn't this problematic because replaying update operations from the oplog will result in a full collection scan without this index? Here is a portion of the log during a full resync (remove all secondary data + restart) showing one of the affected collections: Sat Oct 29 21:11:39 [rsSync] 413286 objects cloned so far from collection mydb_0.mycoll_0 ... For what it's worth, our initial assessment about a corrupt index on the primary is probably incorrect. We tested the behavior after rebuilding indexes in the affected collections and the same behavior occurs (_id index is built a few hours later). | |||
| Comment by Eliot Horowitz (Inactive) [ 29/Oct/11 ] | |||
|
CAn you attach all logs from the secondary that is failing? | |||
| Comment by Scott Hernandez (Inactive) [ 29/Oct/11 ] | |||
|
Can you post db.serverStatus() on the secondaries please? | |||
| Comment by Y. Wayne Huang [ 29/Oct/11 ] | |||
|
On the two affected collections, here is .stats() from the primary: PRIMARY> db.mycoll.stats() , , |