[SERVER-19101] ERROR: writer worker caught exception: :: caused by :: 10287 insertDocument :: caused by :: 10287 btree: key+recloc already in index on Created: 24/Jun/15  Updated: 25/Jun/15  Resolved: 24/Jun/15

Status: Closed
Project: Core Server
Component/s: Admin
Affects Version/s: 2.6.1
Fix Version/s: None

Type: Question Priority: Major - P3
Reporter: rujun Assignee: Sam Kleinman (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Participants:

 Description   

two members of a replica set can't restart after unclean shut down(power loss), we use version 2.6.1

at first, the primary power loss, then restart, we find logs :

2015-05-30T14:31:54.231+0800 [repl writer worker 1] omm_spc.vcm_lpr_info  caught assertion addKeysToIndex omm_spc.vcm_lpr_info.$src_1_id: ObjectId('556955b9498ebd989bba8da3')
2015-05-30T14:31:54.232+0800 [repl writer worker 1] ERROR: writer worker caught exception:  :: caused by :: 10287 insertDocument :: caused by :: 10287 btree: key+recloc already in index on: { ts: Timestamp 1432966585000|10, h: -2776666315254503545, v: 2, op: "i", ns: "omm_spc.vcm_lpr_info", o: { _id: ObjectId('556955b9498ebd989bba8da3'), ctime: 20150530141625, alarm: "637503ee069311e582bad8490bc947d4004588", cmrid: 14291, dflag: false, lpr: { bright: 1, color: 0, coords: { bottom: 0.983, left: 0.231, right: 0.282, top: 0.962 }, direc: 0, match: 91, pcolor: 6, pnr: "AT664T", ptype: 1 }, src: 2, stime: 20150530141620, vdid: "", vsid: 421 } }
2015-05-30T14:31:54.232+0800 [repl writer worker 1] Fatal Assertion 16360
2015-05-30T14:31:54.232+0800 [repl writer worker 1] 
 
***aborting after fassert() failure

after restart 10 times, we find logs :

2015-05-30T16:34:45.752+0800 [rsBackgroundSync] replSet syncing to: BH-DCMN:8888
2015-05-30T16:34:45.754+0800 [rsBackgroundSync] replset setting syncSourceFeedback to BH-DCMN:8888
2015-05-30T16:34:45.766+0800 [rsSync] replSet still syncing, not yet to minValid optime 556955bb:6
2015-05-30T16:34:45.922+0800 [repl writer worker 1] ERROR: writer worker caught exception:  :: caused by :: 13440 bad offset:0 accessing file: /sre_storage/data1/mongo/omm_spc.0. See http://dochub.mongodb.org/core/data-recovery on: { ts: Timestamp 1432966583000|8, h: 6965683350597498871, v: 2, op: "i", ns: "omm_spc.omm_case_file", o: { _id: ObjectId('556955b7498ee6b560cd1c61'), _fileId: "aceaa504-9867-450b-a9f0-8dcf23d29fd8", _len: 449594, _host: [ { _m_id: "6", _s_inx: 4 } ], _objectType: "RESOURCE_TYPE", fdescr: "", _ctime: new Date(1432966583201), online: false, longi: "117.30741844508651", timestamp: 0, cmrId: 4655, height: 0, offset: "3728:42", vhflag: 0, _acls: [], msStatus: 1, _mtime: new Date(1432966583201), _resourceLastRead: new Date(1432966583201), vendor: 0, fps: 0, width: 0, fcount: 0, _musr: "admin", vusrId: 1, cmrdr: "", srvId: 2, oldSrvId: 0, lati: "31.875971894223834", ftype: 1, _cusr: "admin", _poid: "542af3e6ed501b43fe57e9e2", _utime: new Date(1432966583201), _name: "32_20150530141621_269.jpg" } }
2015-05-30T16:34:45.922+0800 [repl writer worker 1] Fatal Assertion 16360
2015-05-30T16:34:45.922+0800 [repl writer worker 1] 
 
***aborting after fassert() failure

after 30 minutes, a secondary power loss, then restart, we find logs :

2015-05-30T16:19:49.538+0800 [rsBackgroundSync] replSet syncing to: BH-DCMN:8888
2015-05-30T16:19:49.539+0800 [rsSync] replSet still syncing, not yet to minValid optime 55695d7d:61
2015-05-30T16:19:49.582+0800 [rsBackgroundSync] replset setting syncSourceFeedback to BH-DCMN:8888
2015-05-30T16:19:49.640+0800 [repl writer worker 1] omm_spc.vcm_alarm_files  caught assertion addKeysToIndex omm_spc.vcm_alarm_files.$alarm_1_id: ObjectId('55695d7d498e2fa3545a614f')
2015-05-30T16:19:49.641+0800 [repl writer worker 1] ERROR: writer worker caught exception:  :: caused by :: 10287 insertDocument :: caused by :: 10287 btree: key+recloc already in index on: { ts: Timestamp 1432968573000|97, h: 7186593903883142160, v: 2, op: "i", ns: "omm_spc.vcm_alarm_files", o: { _id: ObjectId('55695d7d498e2fa3545a614f'), imgid: ObjectId('55695d7d498ef202c3e00044'), alarm: "076f9c44069811e582b578d752971959006024", dflag: 0 } }
2015-05-30T16:19:49.641+0800 [repl writer worker 1] Fatal Assertion 16360
2015-05-30T16:19:49.641+0800 [repl writer worker 1] 
 
***aborting after fassert() failure

finally, we clean up the dbpath of two members, and copy data from the other, this issue resolved.
However, this process takes three hours, and our business has been cut off, because we have a large amount of data.
Is there any other way to solve or evade the problem more quickly?



 Comments   
Comment by Sam Kleinman (Inactive) [ 25/Jun/15 ]

I just want to make it clear that I'm very sorry that you've encountered this issue, and I definitely understand it's severity and impact. Without more information its difficult to determine the exact source of the issue. If you can provide a reproduction script or instructions against the current stale release (i.e. 3.0.x) we can investigate. Issues like this; however, may also be a result of errors at the storage or networking layer are are difficult to isolate without the ability to reproduce the issue and on older versions.

Using additional replica set members, and re-syncing when you encounter a data validity bug is the correct way to manage issues like this in the future. If you are able to reliably reproduce this error, we'd be more eager to revisit this issue and investigate that further.

Sorry again that you encountered this issue.

Regards,
sam

Comment by rujun [ 25/Jun/15 ]

Thank you Sam
is this a bug of mongodb, or is there a plan to solve this problem, Once this problem occurs, it is quite serious

Comment by Sam Kleinman (Inactive) [ 24/Jun/15 ]

It looks as if this issue is caused by an invalid document in the oplog, which could be caused by the power failure or another issue. In general, the most expedient solution in this case is to resync the member as soon as you encounter a data validity issue like this. You may also consider increasing the number of replica set members to increase the amount of redundancy and allow the system to remain available during disruptions and maintenance..

Additionally, we recommend always running the latest release in your release stable (2.6.10 for 2.6).

I hope this helps.

Regards,
sam

Generated at Thu Feb 08 03:49:49 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.