[SERVER-38177] Repair with bind_ip results in a null pointer dereference Created: 16/Nov/18  Updated: 08/Jan/24  Resolved: 02/Jan/19

Status: Closed
Project: Core Server
Component/s: Networking
Affects Version/s: 4.0.3, 4.1.3
Fix Version/s: 4.0.6, 4.1.7

Type: Bug Priority: Major - P3
Reporter: Daniel Gottlieb (Inactive) Assignee: Mira Carey
Resolution: Fixed Votes: 1
Labels: neweng
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
is depended on by SERVER-38714 Standalone replica set nodes with unf... Closed
Duplicate
is duplicated by SERVER-39389 --repair crash mongod when repair wor... Closed
Problem/Incident
is caused by SERVER-28990 when started with --repair mongod sho... Closed
Related
is related to SERVER-38683 Restarting mongod with --repair and u... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.0
Steps To Reproduce:

./mongod --repair --bind_ip 0.0.0.0

Sprint: Service Arch 2018-12-17, Service Arch 2018-12-31, Service Arch 2019-01-14
Participants:

 Description   

The null pointer dereference: https://github.com/mongodb/mongo/blob/dc9e1ee045af74c74360ffce2bec88868b08e1dc/src/mongo/executor/network_interface_tl.cpp#L133-L134

Also consider changing this test to examine the error code on a repair run:
https://github.com/mongodb/mongo/blob/abb1b353648260175c3dfe02ac8ae54c083956f7/jstests/multiVersion/downgrade_to_36_only_with_recovered_data.js#L72-L76

This patch would have surfaced the problem sooner, though I don't think anyone would reasonably pass in --bind_ip with --repair:

diff --git a/jstests/multiVersion/downgrade_to_36_only_with_recovered_data.js b/jstests/multiVersion/downgrade_to_36_only_with_recovered_data.js
index 87e816f629..99fadbd7ab 100644
--- a/jstests/multiVersion/downgrade_to_36_only_with_recovered_data.js
+++ b/jstests/multiVersion/downgrade_to_36_only_with_recovered_data.js
@@ -25,6 +25,9 @@
     const name = "rs";
     let conn = MongoRunner.runMongod(
         {binVersion: "latest", shardsvr: "", replSet: name, syncdelay: "600"});
+    let dbpath = conn.dbpath;
+    let port = conn.port;
+    printjson({"DBPath": dbpath, "Port": port});
 
     assert.neq(conn, null, "mongod was unable to start up");
 
@@ -71,10 +74,17 @@
     MongoRunner.stopMongod(conn);
 
     jsTestLog("Running repair. The process will automatically exit when complete.");
-    options.shardsvr = undefined;
-    options.replSet = undefined;
-    options.repair = '';
-    MongoRunner.runMongod(options);
+    // options.shardsvr = undefined;
+    // options.replSet = undefined;
+    // options.repair = '';
+    // MongoRunner.runMongod(options);
+
+    // Fake a port because `runMongoProgram` thinks that is required.
+    // --bind_ip 0.0.0.0 is necessary to reproduce crash
+    assert.eq(
+        0,
+        runMongoProgram(
+            "mongod", "--repair", "--dbpath", dbpath, "--port", port, "--bind_ip", "0.0.0.0"));
 
     jsTestLog("Restarting after repair with replication.");
     options.shardsvr = '';



 Comments   
Comment by Githook User [ 15/Jan/19 ]

Author:

{'username': 'hanumantmk', 'email': 'jcarey@argv.me', 'name': 'Jason Carey'}

Message: SERVER-38177 Fix --repair with --bind_ip

Setting bind ips in server global params causes an error when spinning
up an egress only transport layer. It's more appropriate in that case
to ignore the bind ips.

(cherry picked from commit 94f6c4d2832e4ec88b30045ceb1907af54725c78)
Branch: v4.0
https://github.com/mongodb/mongo/commit/04c72f98a31ea1824acbacb1accb6a1672e6a2db

Comment by Githook User [ 02/Jan/19 ]

Author:

{'username': 'hanumantmk', 'email': 'jcarey@argv.me', 'name': 'Jason Carey'}

Message: SERVER-38177 Fix --repair with --bind_ip

Setting bind ips in server global params causes an error when spinning
up an egress only transport layer. It's more appropriate in that case
to ignore the bind ips.
Branch: master
https://github.com/mongodb/mongo/commit/94f6c4d2832e4ec88b30045ceb1907af54725c78

Comment by Mira Carey [ 19/Dec/18 ]

gregory.wlodarek, it's not a particularly difficult fix. The error should be better, but actually fixing the behavior is just a one-liner.

Should be able to get something out in the next couple of days

Comment by Gregory Wlodarek [ 19/Dec/18 ]

mira.carey@mongodb.com, I've separated the repair work that depends on this issue into a different ticket, which isn't urgent, but it would be nice to have this unblocked in the next couple of sprints. Thanks!

Comment by Gregory Wlodarek [ 19/Dec/18 ]

mira.carey@mongodb.com, this issue is currently blocking project work to modify --repair behavior and add some new jstest testing coverage for standalones (SERVER-38351 & SERVER-37637). Do you have any idea when this might be fixed, and whether it's difficult or simply? It would be great to fast-track it a little, if feasible.

Comment by Dianna Hohensee (Inactive) [ 18/Dec/18 ]

Cool. I just wanted to make sure that the connection was noted somewhere so it would be explained eventually – which you just did Linking doesn't notify anyone, so it's very subtle and sometimes verification is missed as a result.

Comment by Daniel Gottlieb (Inactive) [ 18/Dec/18 ]

dianna.hohensee that's not quite the case. I believe the crash only happened because the repro also used --bind_ip when it launched mongod. I don't think the in-progress index builds had any affect on the outcome.

Logs from running the SERVER-38683 repro including --bind_ip (via MongoRunner):

[js_test:repro] 2018-12-18T15:31:13.065-0500 2018-12-18T15:31:13.065-0500 I -        [js] shell: started program (sh15459):  /home/dgottlieb/xgen/mongo/mongod <snip> --repair --port 20021 --bind_ip 0.0.0.0 <snip>
<snip>
[js_test:repro] 2018-12-18T15:31:14.702-0500 d20021| 2018-12-18T15:31:14.702-0500 W ASIO     [initandlisten] No TransportLayer configured during NetworkInterface startup
[js_test:repro] 2018-12-18T15:31:14.702-0500 d20021| 2018-12-18T15:31:14.702-0500 F -        [initandlisten] Invalid access at address: 0
[js_test:repro] 2018-12-18T15:31:14.760-0500 d20021| 2018-12-18T15:31:14.760-0500 F -        [initandlisten] Got signal: 11 (Segmentation fault).
[js_test:repro] 2018-12-18T15:31:14.760-0500 d20021|  0x561bc6b682d3 0x561bc6b67da5 0x561bc6b677b1 0x7f36dcbf5390 0x561bc5e63325 0x561bc5e32c60 0x561bc5e322ee 0x561bc5e32e89 0x561bc40d79ef 0x561bc40d0433 0x561bc48e4676 0x561bc48e4403 0x561bc60c27a8 0x561bc60c2593 0x561bc5ad2e2b 0x561bc4cfce5b 0x561bc477da99 0x561bc477aab7 0x561bc40c92c9 0x561bc40c6be3 0x561bc40c4f8f 0x561bc40c4aaa 0x7f36dc83a830 0x561bc40c4839
[js_test:repro] 2018-12-18T15:31:14.760-0500 d20021| ----- BEGIN BACKTRACE -----

Comment by Dianna Hohensee (Inactive) [ 18/Dec/18 ]

This was repro'ed in SERVER-38683, too: --repair on a node with any in-progress index builds causes a crash.

Comment by Spencer Jackson [ 19/Nov/18 ]

We took a look at this, and the crash seems to be occurring in the transport layer.

Generated at Thu Feb 08 04:48:11 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.