[SERVER-2822] secondary cannot recover after failure Created: 23/Mar/11  Updated: 12/Jul/16  Resolved: 24/Mar/11

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Blocker - P1
Reporter: ofer samocha Assignee: Kristina Chodorow (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

v.1.8.0


Operating System: ALL
Participants:

 Description   

two of our 80 mongo servers has been hanged for a while, then machine was restarted.
now secondary cannot repair itself

please help

Wed Mar 23 01:10:39 [initandlisten] Assertion: 10334:Invalid BSONObj size: 1845624949 (0x7500026E) first element: : ?type=115
0x562059 0x4ee45e 0x7840d8 0x64cfad 0x751db3 0x75b387 0x75c101 0x59b1dc 0x59b6f8 0x725981 0x726e04 0x72811d 0x781cda 0x8a7d48 0x8a8c84 0x8a9fcc 0x8aab98 0x8b1b8c 0x2ad8679378a4 0x4e0ff9
/usr/bin/mongod(_ZN5mongo11msgassertedEiPKc+0x129) [0x562059]
/usr/bin/mongod(_ZNK5mongo7BSONObj14_assertInvalidEv+0x46e) [0x4ee45e]
/usr/bin/mongod(_ZN5mongo11BasicCursor7currentEv+0x68) [0x7840d8]
/usr/bin/mongod(_ZN5mongo14processGetMoreEPKcixRNS_5CurOpEiRb+0x58d) [0x64cfad]
/usr/bin/mongod(_ZN5mongo15receivedGetMoreERNS_10DbResponseERNS_7MessageERNS_5CurOpE+0x1f3) [0x751db3]
/usr/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_8SockAddrE+0x14f7) [0x75b387]
/usr/bin/mongod(_ZN5mongo14DBDirectClient4callERNS_7MessageES2_bPSs+0x81) [0x75c101]
/usr/bin/mongod(_ZN5mongo14DBClientCursor11requestMoreEv+0x30c) [0x59b1dc]
/usr/bin/mongod(_ZN5mongo14DBClientCursor4moreEv+0x58) [0x59b6f8]
/usr/bin/mongod(_ZN5mongo6Cloner4copyEPKcS2_bbbbNS_5QueryE+0x5c1) [0x725981]
/usr/bin/mongod(_ZN5mongo6Cloner2goEPKcRSsRKSsbbbb+0xdf4) [0x726e04]
/usr/bin/mongod(_ZN5mongo9cloneFromEPKcRSsRKSsbbbb+0x3d) [0x72811d]
/usr/bin/mongod(_ZN5mongo14repairDatabaseESsRSsbb+0x43a) [0x781cda]
/usr/bin/mongod(_ZN5mongo11doDBUpgradeERKSsSsPNS_14DataFileHeaderE+0x68) [0x8a7d48]
/usr/bin/mongod [0x8a8c84]
/usr/bin/mongod(_ZN5mongo14_initAndListenEiPKc+0x41c) [0x8a9fcc]
/usr/bin/mongod(_ZN5mongo13initAndListenEiPKc+0x18) [0x8aab98]
/usr/bin/mongod(main+0x6e8c) [0x8b1b8c]
/lib64/libc.so.6(__libc_start_main+0xf4) [0x2ad8679378a4]
/usr/bin/mongod(__gxx_personality_v0+0x3b1) [0x4e0ff9]
Wed Mar 23 01:10:39 [initandlisten] getmore local.oplog.rs cid:3581083591747888217 getMore: {} exception 10334 Invalid BSONObj size: 1845624949 (0x7500026E) first element: : ?type=115 bytes:20 nreturned:0 75ms
Wed Mar 23 01:10:39 [initandlisten] exception in initAndListen std::exception: getMore: cursor didn't exist on server, possible restart or timeout?, terminating



 Comments   
Comment by Eliot Horowitz (Inactive) [ 24/Mar/11 ]

There is no known issue with exhausting memory.

Comment by ofer samocha [ 24/Mar/11 ]

we have zabbix, but it only show loadavg increasing till the agent stopped responsing

I'll check and try munin for the next time.

anyway if there is a known issue on memory exhausting, I think that was the problem.

Comment by Eliot Horowitz (Inactive) [ 24/Mar/11 ]

Hard to tell without any monitoring or logs.
Do you have munin style graphs for mongo stats?

Comment by ofer samocha [ 24/Mar/11 ]

it worked ok.

Any known issue for the mongod problem that killed the machines.
I've found nothing in the logs. But load loadavg was above 15 (last time our zabbix agent checked)
and I've couldn't even ssh to those machines, so I assume there was some kind of memory exhausting.

Comment by ofer samocha [ 24/Mar/11 ]

will do.

thanks

Comment by Eliot Horowitz (Inactive) [ 24/Mar/11 ]

Running with --journal or not is up to you.
There are pros and cons.

Yes, for resync, you just need to wipe data.
Do you have an arbiter?
If so, can do with no down time.

Comment by ofer samocha [ 24/Mar/11 ]

the mongo process killed the machine until it restarted itself (and killed the mongod process)
it wasn't run with --journal should it ?
for resync I only need to delete all files in secondary machine and restart the process ?

Comment by Eliot Horowitz (Inactive) [ 24/Mar/11 ]

You did a hard reboot?
The data looks fairly corrupt.
Was this running with --journal since it was 1.8.0?
If not, I would recommend a resync.

Comment by ofer samocha [ 23/Mar/11 ]

This bug is in core server

Generated at Thu Feb 08 03:01:17 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.