[SERVER-12413] Assertion on config servers Created: 20/Jan/14 Updated: 29/Sep/15 Resolved: 29/Sep/15 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 2.4.8 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | igor lasic | Assignee: | Bruce Lucas (Inactive) |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
centos, vmware, nexgen |
||
| Attachments: |
|
| Operating System: | ALL |
| Steps To Reproduce: | had major san outages and this started showing. |
| Participants: |
| Description |
|
Mon Jan 20 16:35:48.110 [conn2699] update config.mongos query: { _id: "render-mu04.colo:27017" }update: { $set: { ping: new Date(1390253748104), up: 111620, waiting: true, mongoVersion: "2.4.8" } } idhack:1 fastmod:1 keyUpdates:0 exception: assertion src/mongo/db/pdfile.cpp:1816 locks(micros) w:17346 8ms update: { $set: { ping: new Date(1390253748288), up: 111618, waiting: false, mongoVersion: "2.4.8" } } idhack:1 fastmod:1 keyUpdates:0 exception: assertion src/mongo/db/pdfile.cpp:1816 locks(micros) w:17495 8ms update: { $set: { ping: new Date(1390253748298), up: 111618, waiting: true, mongoVersion: "2.4.8" } } idhack:1 fastmod:1 keyUpdates:0 exception: assertion src/mongo/db/pdfile.cpp:1816 locks(micros) w:17230 8ms update: { $set: { ping: new Date(1390253748931) } } nscanned:1 keyUpdates:1 exception: assertion src/mongo/db/pdfile.cpp:1816 locks(micros) w:16427 8ms |
| Comments |
| Comment by Bruce Lucas (Inactive) [ 21/Jan/14 ] | |||||||
|
Hi Igor, Glad to hear things are working now. Very happy to have been of assistance. Bruce | |||||||
| Comment by igor lasic [ 21/Jan/14 ] | |||||||
|
Copied the survirving configuration around. Restarted. So far so good. Closing. Thank you for your help. | |||||||
| Comment by Bruce Lucas (Inactive) [ 21/Jan/14 ] | |||||||
|
Yes, just stop that config server (to make sure the database files are static), and copy that data, and replicate it to the other two config servers. Bruce | |||||||
| Comment by igor lasic [ 21/Jan/14 ] | |||||||
|
Definitely San related I am following config server restore instructions Should i copy the data of the surviving one around or is there a different | |||||||
| Comment by Bruce Lucas (Inactive) [ 21/Jan/14 ] | |||||||
|
Hi Igor, Thanks for uploading the log. It looks like there is corruption in one of the files of the local database on the config server. I think it's reasonable to assume that it was caused by a storage error related to the SAN outage; what was the timing of that? The first signs of trouble in the log is the following entry at 12:50:
However, because the local.oplog.$main collection is a circular capped collection, it is possible that the corruption occurred some time earlier and was only seen by mongod when the collection wrapped around to the corrupted region. To recover from this you can reinitialize that config server, after ensuring that the storage is working normally, following the same procedure as for replacing a config server. Before doing that you should be sure do run any relevant hardware diagnostics and fsck the filesystem. If you would like us to investigate and look for more definitive evidence that this corruption was caused by the SAN outage, before recovering that config server please take the following steps:
If the resulting two files are less than 150MB and you are comfortable with them being publicly visible you can attach them to this ticket; if they are too large or you would like to keep them private we can provide a secure private location for you to upload them. Thanks, | |||||||
| Comment by igor lasic [ 21/Jan/14 ] | |||||||
|
one of the config servers log. Errors start around Jan 19 11:00 First error below. Sun Jan 19 11:17:21.750 [conn60] end connection 10.84.150.52:43273 (27 connections now open) update: { $set: { ping: new Date(1390148992191), up: 6901, waiting: false, mongoVersion: "2.4.8" } } idhack:1 nupdated:1 fastmod:1 keyUpdates:0 locks(micros) w:958524 479ms | |||||||
| Comment by Daniel Pasette (Inactive) [ 21/Jan/14 ] | |||||||
|
Can you upload the log files from the config server which is showing the exception since just before the time of the san outage? | |||||||
| Comment by igor lasic [ 20/Jan/14 ] | |||||||
|
previous log was from config servers. This is what mongos say: Mon Jan 20 16:57:29.728 [Balancer] SyncClusterConnection connecting to [mongo-c02:27019] update: { $set: { ping: new Date(1390255073928) }} gle1: { err: "!loc.isNull()", n: 0, connectionId: 101, waited: 30, ok: 1.0 }gle2: { updatedExisting: true, n: 1, lastOp: Timestamp 1390255074000|1, connectionId: 100, waited: 29, err: null, ok: 1.0 } Mon Jan 20 16:57:59.778 [Balancer] SyncClusterConnection connecting to [mongo-c01:27019] update: { $set: { ping: new Date(1390255053085) }} gle1: { err: "!loc.isNull()", n: 0, connectionId: 81, waited: 28, ok: 1.0 }gle2: { updatedExisting: true, n: 1, lastOp: Timestamp 1390255053000|1, connectionId: 80, waited: 2, err: null, ok: 1.0 } Mon Jan 20 16:57:37.616 [Balancer] SyncClusterConnection connecting to [mongo-c01:27019] update: { $set: { ping: new Date(1390255083205) }} gle1: { err: "!loc.isNull()", n: 0, connectionId: 102, waited: 33, ok: 1.0 }gle2: { updatedExisting: true, n: 1, lastOp: Timestamp 1390255083000|1, connectionId: 101, waited: 27, err: null, ok: 1.0 }render-mu04.colo: update: { $set: { ping: new Date(1390255041851) }} gle1: { err: "!loc.isNull()", n: 0, connectionId: 79, waited: 16, ok: 1.0 }gle2: { updatedExisting: true, n: 1, lastOp: Timestamp 1390255041000|1, connectionId: 78, waited: 16, err: null, ok: 1.0 } Mon Jan 20 16:57:26.336 [Balancer] SyncClusterConnection connecting to [mongo-c01:27019] update: { $set: { ping: new Date(1390255071970) }} gle1: { err: "!loc.isNull()", n: 0, connectionId: 99, waited: 34, ok: 1.0 }gle2: { updatedExisting: true, n: 1, lastOp: Timestamp 1390255072000|1, connectionId: 98, waited: 25, err: null, ok: 1.0 } Mon Jan 20 16:57:56.392 [Balancer] SyncClusterConnection connecting to [mongo-c01:27019] update: { $set: { ping: new Date(1390255044423) }} gle1: { err: "!loc.isNull()", n: 0, connectionId: 80, waited: 10, ok: 1.0 }gle2: { updatedExisting: true, n: 1, lastOp: Timestamp 1390255044000|1, connectionId: 79, waited: 19, err: null, ok: 1.0 } Mon Jan 20 16:57:26.753 [Balancer] SyncClusterConnection connecting to [mongo-c01:27019] |