[SERVER-11647] segfault during rs.stepdown() while background index build in progress on 2.5.3, in a mixed version cluster with 2.4 Created: 08/Nov/13  Updated: 11/Jul/16  Resolved: 04/Dec/13

Status: Closed
Project: Core Server
Component/s: Index Maintenance, Replication
Affects Version/s: 2.5.3
Fix Version/s: 2.5.5

Type: Bug Priority: Major - P3
Reporter: Michael O'Brien Assignee: Matt Dannenberg
Resolution: Done Votes: 0
Labels: 26qa
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
Operating System: ALL
Participants:

 Description   

i'm not sure exactly how this happened, as it only surfaced after a while of aggressively playing with replset stepdown and background index builds with different index types (hashed, text, etc)
in this case i tried triggering multiple hashed index builds at once, in the background.
The 2.5.3 node is the one that crashed, the other nodes were all running 2.4.4.

log output below. sorry for the dodgy formatting, i had to grab it quickly from my tmux session before it went off the scrollback buffer.

m31000| Fri Nov  8 17:10:20.343 [conn597] end connection 127.0.0.1:52894 (5 connections now open)
m31000| Fri Nov  8 17:10:20.343 [initandlisten] connection accepted from 127.0.0.1:52946 #600 (6 connections now open)
rsrs1.getPrimary().getDB("poo").test.dropIndexes()
m31001| Fri Nov  8 17:10:20.466 [conn1] CMD: dropIndexes poo.test
{
	    "nIndexesWas": 1,
		  "msg": "non-_id indexes dropped for collection",
		    "ok": 1
}
>  m31002| Fri Nov  8 17:10:20.472 [repl writer worker 1] CMD: dropIndexes poo.test
m31000| Fri Nov  8 17:10:20.473 [repl writer worker 1] CMD: dropIndexes poo.test
m27000| 2013-11-08T17:10:20.476-0500 [repl writer worker 1] CMD: dropIndexes poo.test
kk m31000| Fri Nov  8 17:10:22.419 [conn598] end connection 127.0.0.1:52895 (5 connections now open)
m31000| Fri Nov  8 17:10:22.419 [initandlisten] connection accepted from 127.0.0.1:52947 #601 (6 connections now open)
m31002| Fri Nov  8 17:10:25.110 [conn553] end connection 127.0.0.1:52897 (3 connections now open)
m31002| Fri Nov  8 17:10:25.110 [initandlisten] connection accepted from 127.0.0.1:52949 #556 (4 connections now open)
m31002| Fri Nov  8 17:10:25.499 [conn554] end connection 127.0.0.1:52900 (3 connections now open)
m31002| Fri Nov  8 17:10:25.500 [initandlisten] connection accepted from 127.0.0.1:52950 #557 (4 connections now open)
rsrs1.getPrimary().getDB("poo").test.ensureIndex({x:"text"},{background:true})
m31001| Fri Nov  8 17:10:26.787 [conn1] build index poo.test { _fts: "text", _ftsx: 1 } background
m31001| Fri Nov  8 17:10:29.000 [conn1]                Background Index Build Progress: 751300/1010000 74%
m31001| Fri Nov  8 17:10:29.773 [conn1] build index done.  scanned 1010000 total records. 2.986 secs
m31001| Fri Nov  8 17:10:29.773 [conn1] insert poo.system.indexes ninserted:1 keyUpdates:0 numYields: 29 locks(micros) w:5928168 2986ms
m31000| Fri Nov  8 17:10:29.774 [repl writer worker 1] info: indexing in foreground on this replica; was a background index build on the primary
m31000| Fri Nov  8 17:10:29.774 [repl writer worker 1] build index poo.test { _fts: "text", _ftsx: 1 }
>  m31002| Fri Nov  8 17:10:29.774 [repl writer worker 1] info: indexing in foreground on this replica; was a background index build on the primary
m31002| Fri Nov  8 17:10:29.774 [repl writer worker 1] build index poo.test { _fts: "text", _ftsx: 1 }
m31002| Fri Nov  8 17:10:31.249 [repl writer worker 1] build index done.  scanned 1010000 total records. 1.475 secs
m31000| Fri Nov  8 17:10:31.253 [repl writer worker 1] build index done.  scanned 1010000 total records. 1.479 secs
m31000| Fri Nov  8 17:10:31.254 [conn580] getmore local.oplog.rs query: { ts: { $gte: Timestamp 1383948437000|1 } } cursorid:22090358664672348 ntoreturn:0 keyUpdates:0 numYields: 1 locks(micros) r:141 nreturned:1 reslen:186 5759ms
m27000| 2013-11-08T17:10:31.254-0500 [repl index builder 20] build index on: poo.test properties: { "v" : 1, "key" : { "_fts" : "text", "_ftsx" : 1 }, "ns" : "poo.test", "name" : "x_text", "background" : true, "weights" : { "x" : 1 }, "default_language" : "english", "language_override" : "language", "textIndexVersion" : 2 }
m31001| Fri Nov  8 17:10:33.688 [conn553] end connection 127.0.0.1:52922 (8 connections now open)
m31001| Fri Nov  8 17:10:33.689 [initandlisten] connection accepted from 127.0.0.1:52952 #556 (9 connections now open)
m27000| 2013-11-08T17:10:34.000-0500 [repl index builder 20]           Background Index Build Progress: 407300/101000040%
m31001| Fri Nov  8 17:10:35.859 [conn554] end connection 127.0.0.1:52927 (8 connections now open)
m31001| Fri Nov  8 17:10:35.859 [initandlisten] connection accepted from 127.0.0.1:52953 #557 (9 connections now open)
m27000| 2013-11-08T17:10:37.000-0500 [repl index builder 20]           Background Index Build Progress: 852500/101000084%
rsrs1.getSecondaries().forEach(function(x){printjson(x.getDB("poo").test.getIndexes())})
[
{
"v": 1,
"key": {
"_id": 1
},
"ns": "poo.test",
"name": "_id_"
},
{
"v": 1,
"key": {
"_fts": "text",
"_ftsx": 1
},
"ns": "poo.test",
"name": "x_text",
"background": true,
"weights": {
"x": 1
},
"default_language": "english",
"language_override": "language",
"textIndexVersion": 1
}
]
[
{
"v": 1,
"key": {
"_id": 1
},
"ns": "poo.test",
"name": "_id_"
},
{
"v": 1,
"key": {
"_fts": "text",
"_ftsx": 1
},
"ns": "poo.test",
"name": "x_text",
"background": true,
"weights": {
"x": 1
},
"default_language": "english",
"language_override": "language",
"textIndexVersion": 1
}
]
[
{
"v": 1,
"key": {
"_id": 1
},
"ns": "poo.test",
"name": "_id_"
},
{
"v": 1,
"key": {
"_fts": "text",
"_ftsx": 1
},
"ns": "poo.test",
"name": "x_text",
"background": true,
"weights": {
"x": 1
},
"default_language": "english",
"language_override": "language",
"textIndexVersion": 2
}
]
>  m27000| 2013-11-08T17:10:38.061-0500 [repl index builder 20] build index done.  scanned 1010000 total records. 6.806 secs
m31001| Fri Nov  8 17:10:39.360 [conn555] end connection 127.0.0.1:52933 (8 connections now open)
m27000| 2013-11-08T17:10:39.360-0500 [conn234] end connection 127.0.0.1:52932 (3 connections now open)
m27000| 2013-11-08T17:10:39.360-0500 [conn233] end connection 127.0.0.1:52934 (3 connections now open)
m27000| 2013-11-08T17:10:39.360-0500 [conn235] end connection 127.0.0.1:52935 (3 connections now open)
m31001| Fri Nov  8 17:10:39.361 [initandlisten] connection accepted from 127.0.0.1:52957 #558 (9 connections now open)
m27000| 2013-11-08T17:10:39.361-0500 [initandlisten] connection accepted from 127.0.0.1:52955 #236 (2 connections now open)
m27000| 2013-11-08T17:10:39.361-0500 [initandlisten] connection accepted from 127.0.0.1:52954 #237 (3 connections now open)
m27000| 2013-11-08T17:10:39.362-0500 [initandlisten] connection accepted from 127.0.0.1:52956 #238 (4 connections now open)
m31000| Fri Nov  8 17:10:41.506 [conn599] end connection 127.0.0.1:52943 (5 connections now open)
m31000| Fri Nov  8 17:10:41.506 [initandlisten] connection accepted from 127.0.0.1:52958 #602 (6 connections now open)
>  m31002| Fri Nov  8 17:10:47.691 [conn555] end connection 127.0.0.1:52945 (3 connections now open)
m31002| Fri Nov  8 17:10:47.692 [initandlisten] connection accepted from 127.0.0.1:52959 #558 (4 connections now open)
m31000| Fri Nov  8 17:10:50.359 [conn600] end connection 127.0.0.1:52946 (5 connections now open)
m31000| Fri Nov  8 17:10:50.360 [initandlisten] connection accepted from 127.0.0.1:52960 #603 (6 connections now open)
m31000| Fri Nov  8 17:10:52.432 [conn601] end connection 127.0.0.1:52947 (5 connections now open)
m31000| Fri Nov  8 17:10:52.432 [initandlisten] connection accepted from 127.0.0.1:52961 #604 (6 connections now open)
m31002| Fri Nov  8 17:10:55.125 [conn556] end connection 127.0.0.1:52949 (3 connections now open)
m31002| Fri Nov  8 17:10:55.126 [initandlisten] connection accepted from 127.0.0.1:52964 #559 (4 connections now open)
m31002| Fri Nov  8 17:10:55.511 [conn557] end connection 127.0.0.1:52950 (3 connections now open)
m31002| Fri Nov  8 17:10:55.511 [initandlisten] connection accepted from 127.0.0.1:52965 #560 (4 connections now open)
m31001| Fri Nov  8 17:11:03.700 [conn556] end connection 127.0.0.1:52952 (8 connections now open)
m31001| Fri Nov  8 17:11:03.700 [initandlisten] connection accepted from 127.0.0.1:53022 #559 (9 connections now open)
m31001| Fri Nov  8 17:11:05.875 [conn557] end connection 127.0.0.1:52953 (8 connections now open)
m31001| Fri Nov  8 17:11:05.876 [initandlisten] connection accepted from 127.0.0.1:53023 #560 (9 connections now open)
m31001| Fri Nov  8 17:11:09.376 [conn558] end connection 127.0.0.1:52957 (8 connections now open)
m27000| 2013-11-08T17:11:09.376-0500 [conn238] end connection 127.0.0.1:52956 (3 connections now open)
m27000| 2013-11-08T17:11:09.376-0500 [conn236] end connection 127.0.0.1:52955 (3 connections now open)
m27000| 2013-11-08T17:11:09.376-0500 [conn237] end connection 127.0.0.1:52954 (3 connections now open)
m31001| Fri Nov  8 17:11:09.376 [initandlisten] connection accepted from 127.0.0.1:53027 #561 (9 connections now open)
m27000| 2013-11-08T17:11:09.376-0500 [initandlisten] connection accepted from 127.0.0.1:53024 #239 (2 connections now open)
m27000| 2013-11-08T17:11:09.377-0500 [initandlisten] connection accepted from 127.0.0.1:53025 #240 (3 connections now open)
m27000| 2013-11-08T17:11:09.377-0500 [initandlisten] connection accepted from 127.0.0.1:53026 #241 (4 connections now open)
m31000| Fri Nov  8 17:11:11.517 [conn602] end connection 127.0.0.1:52958 (5 connections now open)
m31000| Fri Nov  8 17:11:11.517 [initandlisten] connection accepted from 127.0.0.1:53033 #605 (6 connections now open)
m31002| Fri Nov  8 17:11:17.706 [conn558] end connection 127.0.0.1:52959 (3 connections now open)
m31002| Fri Nov  8 17:11:17.706 [initandlisten] connection accepted from 127.0.0.1:53035 #561 (4 connections now open)
m31000| Fri Nov  8 17:11:20.374 [conn603] end connection 127.0.0.1:52960 (5 connections now open)
m31000| Fri Nov  8 17:11:20.374 [initandlisten] connection accepted from 127.0.0.1:53036 #606 (6 connections now open)
m31000| Fri Nov  8 17:11:22.447 [conn604] end connection 127.0.0.1:52961 (5 connections now open)
m31000| Fri Nov  8 17:11:22.447 [initandlisten] connection accepted from 127.0.0.1:53037 #607 (6 connections now open)
m31002| Fri Nov  8 17:11:25.140 [conn559] end connection 127.0.0.1:52964 (3 connections now open)
m31002| Fri Nov  8 17:11:25.140 [initandlisten] connection accepted from 127.0.0.1:53039 #562 (4 connections now open)
rsrs1.getPrimary().getDB("poo").test.createIndex({x:"hashed"},{background:true})
m31001| Fri Nov  8 17:11:25.352 [conn1] build index poo.test { x: "hashed" } background
>  m31002| Fri Nov  8 17:11:25.525 [conn560] end connection 127.0.0.1:52965 (3 connections now open)
m31002| Fri Nov  8 17:11:25.526 [initandlisten] connection accepted from 127.0.0.1:53041 #563 (4 connections now open)
m31001| Fri Nov  8 17:11:26.554 [journal] old journal file will be removed: /data/db/test-1/journal/j._2
m31001| Fri Nov  8 17:11:26.555 [journal] old journal file will be removed: /data/db/test-1/journal/j._3
m31001| Fri Nov  8 17:11:28.000 [conn1]                Background Index Build Progress: 283500/1010000 28%
>  m31001| Fri Nov  8 17:11:31.000 [conn1]              Background Index Build Progress: 500000/1010000 49%
m31001| Fri Nov  8 17:11:33.713 [conn559] end connection 127.0.0.1:53022 (8 connections now open)
m31001| Fri Nov  8 17:11:33.714 [initandlisten] connection accepted from 127.0.0.1:53044 #562 (9 connections now open)
m31001| Fri Nov  8 17:11:34.001 [conn1]                Background Index Build Progress: 681500/1010000 67%
m31001| Fri Nov  8 17:11:35.889 [conn560] end connection 127.0.0.1:53023 (8 connections now open)
m31001| Fri Nov  8 17:11:35.889 [initandlisten] connection accepted from 127.0.0.1:53046 #563 (9 connections now open)
m31001| Fri Nov  8 17:11:37.000 [conn1]                Background Index Build Progress: 847000/1010000 83%
rsrs1.getPrimary().getDB("admin").runCommand({replSetStepDown:null})
m31001| Fri Nov  8 17:11:39.390 [conn561] end connection 127.0.0.1:53027 (8 connections now open)
m27000| 2013-11-08T17:11:39.390-0500 [conn239] end connection 127.0.0.1:53024 (3 connections now open)
m27000| 2013-11-08T17:11:39.390-0500 [conn241] end connection 127.0.0.1:53026 (3 connections now open)
m27000| 2013-11-08T17:11:39.390-0500 [conn240] end connection 127.0.0.1:53025 (3 connections now open)
m27000| 2013-11-08T17:11:39.390-0500 [initandlisten] connection accepted from 127.0.0.1:53048 #242 (2 connections now open)
m31001| Fri Nov  8 17:11:39.390 [initandlisten] connection accepted from 127.0.0.1:53050 #564 (9 connections now open)
m27000| 2013-11-08T17:11:39.390-0500 [initandlisten] connection accepted from 127.0.0.1:53047 #243 (3 connections now open)
m27000| 2013-11-08T17:11:39.390-0500 [initandlisten] connection accepted from 127.0.0.1:53049 #244 (4 connections now open)
m31001| Fri Nov  8 17:11:40.042 [conn1]                Background Index Build Progress: 990500/1010000 98%
m31001| Fri Nov  8 17:11:40.425 [conn1] build index done.  scanned 1010000 total records. 15.072 secs
m31001| Fri Nov  8 17:11:40.426 [conn1] insert poo.system.indexes ninserted:1 keyUpdates:0 numYields: 111 locks(micros) w:24721819 15074ms
m31000| Fri Nov  8 17:11:40.428 [repl writer worker 1] info: indexing in foreground on this replica; was a background index build on the primary
m31002| Fri Nov  8 17:11:40.428 [repl writer worker 1] info: indexing in foreground on this replica; was a background index build on the primary
m31000| Fri Nov  8 17:11:40.431 [repl writer worker 1] build index poo.test { x: "hashed" }
m31002| Fri Nov  8 17:11:40.431 [repl writer worker 1] build index poo.test { x: "hashed" }
{
"closest": NumberLong("1383948629"),
"difference": NumberLong("71"),
"ok": 0,
"errmsg": "no secondaries within 10 seconds of my optime"
}
>  m31000| Fri Nov  8 17:11:41.533 [conn605] end connection 127.0.0.1:53033 (5 connections now open)
m31000| Fri Nov  8 17:11:41.533 [initandlisten] connection accepted from 127.0.0.1:53051 #608 (6 connections now open)
m31000| Fri Nov  8 17:11:46.905 [repl writer worker 1] old journal file will be removed: /data/db/test-0/journal/j._7
m31000| Fri Nov  8 17:11:46.910 [repl writer worker 1] old journal file will be removed: /data/db/test-0/journal/j._8
m31002| Fri Nov  8 17:11:46.921 [repl writer worker 1] build index done.  scanned 1010000 total records. 6.49 secs
m31000| Fri Nov  8 17:11:46.965 [repl writer worker 1] build index done.  scanned 1010000 total records. 6.534 secs
m27000| 2013-11-08T17:11:47.024-0500 [repl index builder 21] build index on: poo.test properties: { "v" : 1, "key" : { "x" : "hashed" }, "ns" : "poo.test", "name" : "x_hashed", "background" : true }
m27000| 2013-11-08T17:11:47.026-0500 [rsBackgroundSync] replSet syncing to: mikes-MacBook-Pro.local:31001
m31000| Fri Nov  8 17:11:47.026 [conn580] end connection 127.0.0.1:52813 (5 connections now open)
m31001| Fri Nov  8 17:11:47.026 [initandlisten] connection accepted from 127.0.0.1:53052 #565 (10 connections now open)
m31001| Fri Nov  8 17:11:47.027 [initandlisten] connection accepted from 127.0.0.1:53053 #566 (11 connections now open)
m27000| Invalid access at address: 0x50 from thread: rsSyncNotifier
m27000| Got signal: 11 (Segmentation fault: 11).
 



 Comments   
Comment by Githook User [ 04/Dec/13 ]

Author:

{u'username': u'dannenberg', u'name': u'Matt Dannenberg', u'email': u'matt.dannenberg@10gen.com'}

Message: SERVER-11647 fix potential seg fault in syncSourceFeedback
Branch: master
https://github.com/mongodb/mongo/commit/1ed9cd9471ff1b3f7abfc8d1afd5696064a88772

Comment by Eric Milkie [ 11/Nov/13 ]

mattd@10gen.com can you determine which null pointer we called a function on?

Comment by Michael O'Brien [ 11/Nov/13 ]

Didn't see any other stacktrace info after "got signal 11". I'm working on trying to reproduce it again.

I had triggered lots of stepdowns and index builds in rapid succession, but this was the first occurrence where "no secondaries within 10 seconds" was logged so perhaps that situation is what led to the crash.

Comment by Eric Milkie [ 09/Nov/13 ]

Was there a stack trace following the Segmentation fault message?
This is unlikely to be related to background indexing; more likely it is a result of the new upstream sync notification mechanism that replaced an oplog query.

Generated at Thu Feb 08 03:26:23 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.