[SERVER-569] better invalid object debugging (WAS: 1.1.3 -> 1.2.1 replica pair (slave) initial cloning fails) Created: 26/Jan/10  Updated: 12/Jul/16  Resolved: 26/Jan/10

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 1.2.1
Fix Version/s: 1.2.2, 1.3.2

Type: Bug Priority: Major - P3
Reporter: Erwan Arzur Assignee: Eliot Horowitz (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Linux ec2-xxx-xxx-xxx-xxx.compute-1.amazonaws.com 2.6.21.7-2.fc8xen #1 SMP Fri Feb 15 12:34:28 EST 2008 x86_64 x86_64 x86_64 GNU/Linux


Participants:

 Description   

Trying to upgrade our replica pairs to 1.2.1, i started from a fresh slave (no initial data) running with the following command line:

/opt/mongodb-linux-x86_64-1.2.1/bin/mongod --dbpath=/mongo/data --nssize 160 --noauth --pairwith=production-shard1-001

After a while (about 30GB cloned), the slave issue the following message:

Tue Jan 26 09:38:12 Assertion: Invalid dbref/code/string/symbol size
skipping corrupt object from production.messages_dxxxxxxxxxxxxxxxxxx
Tue Jan 26 09:38:33 invalid object size: 11031214
Tue Jan 26 09:38:33 Assertion: Invalid BSONObj spec size
Tue Jan 26 09:38:33 repl: AssertionException Invalid BSONObj spec size
Tue Jan 26 09:38:33 repl: sleep 2sec before next pass

Given the delay between the first message (38:12) and the next one (38:33), i'm not even sure the object that cause the error is in this collection. I thought that 1.2.1 would report this error including the _id of the culprit and just go on ... this is not the case, strace on the mongod process shows that it is just sitting there on a wait4 call:

Process 8175 attached - interrupt to quit
wait4(-1,
...

As 1.1.3 doesn't have the newly introduced bsonsize() api call, how can i identify and get rid of those invalid objects in the master's database ?



 Comments   
Comment by Eliot Horowitz (Inactive) [ 26/Jan/10 ]

You could upgrade the master to 1.2.1 so you can use bsonsize()

Comment by Erwan Arzur [ 26/Jan/10 ]

I agree with your latest comment, but i wonder how much time it's going to take. Let me explain.

The replica pair showing the issue holds a 230 GB database. When the problem occurs, the cloning process is well under 10% complete, and it takes a few hours just to get there.

Restarting the slave means restarting the cloning process from zero. With a more than probable chance to get to clone another corrupted object, which will start cloning from scratch again.

This is just not practical for us.

Comment by Eliot Horowitz (Inactive) [ 26/Jan/10 ]

No - that's not a bug, thats the correct behavior in our opinion.

Comment by Erwan Arzur [ 26/Jan/10 ]

Thanks Elliot,

did you fix the problem with the slave stopping cloning the database at the same time ?

Erwan

Comment by Eliot Horowitz (Inactive) [ 26/Jan/10 ]

you'll be able to see the object id now

Generated at Thu Feb 08 02:54:33 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.