[SERVER-10939] Unable to start Secondary due error:- Assertion: 10320:BSONElement: bad type 101 Created: 27/Sep/13 Updated: 15/Nov/21 Resolved: 06/Nov/13 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 2.2.2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker - P1 |
| Reporter: | Appnique OMS | Assignee: | Unassigned |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||
| Operating System: | ALL | ||||
| Participants: | |||||
| Comments |
| Comment by Appnique OMS [ 01/Oct/13 ] |
|
Hi Mike, Is there any way to repair database without increasing data size i.e without doubling data size. what operation does repair performs internally. |
| Comment by Appnique OMS [ 30/Sep/13 ] |
|
Yes Mike, running behind means that the last oplog entry they applied is behind that problematic secondary. |
| Comment by Michael Grundy [ 30/Sep/13 ] |
|
When you say "running behind" do you mean that the last oplog entry they applied is behind that secondary, or that you have a chained replication setup where those other secondaries are replicating from the problem secondary? |
| Comment by Appnique OMS [ 30/Sep/13 ] |
|
But Mike,other secondaries are running behind the problematic one... |
| Comment by Michael Grundy [ 30/Sep/13 ] |
|
If the other secondaries are fine, I would lean towards resyncing the problem secondary. There is no specific bug in 2.2.2 that has been found that matches your issue. There have been similar occurrences, caused by a faulty driver or more frequently by disk / network based corruption. A driver issue would have affected the other secondaries when they applied the changes that caused the problem, hence my inclination towards doing a resync of the problematic one. Mike |
| Comment by Appnique OMS [ 29/Sep/13 ] |
|
Hi Mike - validated collection as well it is fine.bit confused in where is the actual problem i.e in primary or secondary. rest of the secondaries are behind the secondary which having problem.they are working fine. About application driver the back end uses Moped as the MongoDB driver (Ruby). The application itself uses Mongoid, which is an ADM layer whose only supported driver is Moped. is there any bug in mongodb version 2.2.2 regarding this Ashraf. |
| Comment by Michael Grundy [ 27/Sep/13 ] |
|
Hi Ashraf - Good that oplog validated ok. Did you also run it on the collection? At this stage, the quickest way forward may be to re-sync the node from the Primary. This is only an option if you have enough other nodes available to keep a majority up. Are all Secondary nodes of the Replica Set impacted in the same way? If you don't have enough nodes for a majority, you might have to add an arbiter, but given that this secondary can't start, this probably doesn't apply. Directions for resyncing a stale member are at http://docs.mongodb.org/manual/tutorial/resync-replica-set-member/#replica-set-auto-resync-stale-member , if you have space move the database files out of your dbpath instead of deleting. If the resync is unable to move forward, then repair database on the primary is your next step, followed by resyncing the secondaries. A repair database will fix most issues and all of the indexes will be rebuilt. Please make back up copies when possible before proceeding. I'm concerned about where the invalid BSON objects came from. What driver and version is your application using? Prior to this problem, did you have any issues with crashes or network failures? Are you running with journalling turned off? 2.2.2 can be fine for some installations and problematic for others depending on config and usage patterns. The full list of changes available after 2.2.2 can be reviewed here Thanks! |
| Comment by Appnique OMS [ 27/Sep/13 ] |
|
Hi Mike, second thing with corruption in oplog i validated oplog it was fine. will repairing database definitely solve this error? am running version 2.2.2 is it fine? Thanks |
| Comment by Michael Grundy [ 27/Sep/13 ] |
|
Hi Ashraf - You're getting bad types and bad size (SERVER-10938) because there is invalid bson being inserted or something is corrupting the bson objects. As a first step, I recommend restarting all instances with --objcheck , this will throw an error any time an invalid object is inserted on the primary. There may be corruption in your oplog, or the collection, validating may help isolate the issue, validate docs are here. Please be aware that validate does have a performance overhead, so run it during off-peak or a maintenance window. Repairing the database may get you past this, but be aware that it will require free space on the disk minimally equal to the size of the database being repaired. Finally you should upgrade to the latest version in the 2.2 release line, currently at 2.2.6 . There are significant fixes available since 2.2.2 Thanks |
| Comment by Appnique OMS [ 27/Sep/13 ] |
|
Hi Mike, Here is the log entries :- , uc: 160 } }, o: { $set: { until: new Date(1380235717000) } } } ***aborting after fassert() failure Fri Sep 27 09:26:10 Got signal: 6 (Aborted). Fri Sep 27 09:26:10 Backtrace: Thanks, |
| Comment by Michael Grundy [ 27/Sep/13 ] |
|
Hi Ashraf - Could you provide the log entries from the secondary that contain the assertion and the actions around it? (complete log is ok). Thanks |