[SERVER-569] better invalid object debugging (WAS: 1.1.3 -> 1.2.1 replica pair (slave) initial cloning fails) Created: 26/Jan/10 Updated: 12/Jul/16 Resolved: 26/Jan/10 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 1.2.1 |
| Fix Version/s: | 1.2.2, 1.3.2 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Erwan Arzur | Assignee: | Eliot Horowitz (Inactive) |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Linux ec2-xxx-xxx-xxx-xxx.compute-1.amazonaws.com 2.6.21.7-2.fc8xen #1 SMP Fri Feb 15 12:34:28 EST 2008 x86_64 x86_64 x86_64 GNU/Linux |
||
| Participants: |
| Description |
|
Trying to upgrade our replica pairs to 1.2.1, i started from a fresh slave (no initial data) running with the following command line: /opt/mongodb-linux-x86_64-1.2.1/bin/mongod --dbpath=/mongo/data --nssize 160 --noauth --pairwith=production-shard1-001 After a while (about 30GB cloned), the slave issue the following message: Tue Jan 26 09:38:12 Assertion: Invalid dbref/code/string/symbol size Given the delay between the first message (38:12) and the next one (38:33), i'm not even sure the object that cause the error is in this collection. I thought that 1.2.1 would report this error including the _id of the culprit and just go on ... this is not the case, strace on the mongod process shows that it is just sitting there on a wait4 call: Process 8175 attached - interrupt to quit As 1.1.3 doesn't have the newly introduced bsonsize() api call, how can i identify and get rid of those invalid objects in the master's database ? |
| Comments |
| Comment by Eliot Horowitz (Inactive) [ 26/Jan/10 ] |
|
You could upgrade the master to 1.2.1 so you can use bsonsize() |
| Comment by Erwan Arzur [ 26/Jan/10 ] |
|
I agree with your latest comment, but i wonder how much time it's going to take. Let me explain. The replica pair showing the issue holds a 230 GB database. When the problem occurs, the cloning process is well under 10% complete, and it takes a few hours just to get there. Restarting the slave means restarting the cloning process from zero. With a more than probable chance to get to clone another corrupted object, which will start cloning from scratch again. This is just not practical for us. |
| Comment by Eliot Horowitz (Inactive) [ 26/Jan/10 ] |
|
No - that's not a bug, thats the correct behavior in our opinion. |
| Comment by Erwan Arzur [ 26/Jan/10 ] |
|
Thanks Elliot, did you fix the problem with the slave stopping cloning the database at the same time ? Erwan |
| Comment by Eliot Horowitz (Inactive) [ 26/Jan/10 ] |
|
you'll be able to see the object id now |