[SERVER-4926] segfault in mongo shell Created: 10/Feb/12  Updated: 15/Aug/12  Resolved: 04/Mar/12

Status: Closed
Project: Core Server
Component/s: Shell
Affects Version/s: 2.0.2
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Zac Witte Assignee: Tad Marshall
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

ubuntu 11.04


Issue Links:
Duplicate
duplicates SERVER-2986 Mongo CLI responds poorly to the use ... Closed
Participants:

 Description   

I issued a query in a sharded environment, hit Ctrl+C in the shell and got this strange error message about "unknown shell/collection.js". Hit Ctrl+C again and got this segfault.

mongos> db.hourly_stats.findOne({log_file_name:/^2012-02-02/, $and: [{sub_key:null}, {sub_key:{$exists:true}}], publish:{$gt:0}})
^CFri Feb 10 03:08:13 Error: error doing query: unknown shell/collection.js:151
^CFri Feb 10 03:10:06 ERROR: MessagingPort::call() wrong id got:1b4 expect:1b6
  toSend op: 2004
  response msgid:269662
  response len:  203
  response op:  1
  remote: 127.0.0.1:27017
Fri Feb 10 03:10:06   Assertion failure false util/net/message_port.cpp 245
0x4a2196 0x4a4bfb 0x5ec8b2 0x5ecae4 0x4c15f4 0x4e515d 0x4b960c 0x4cffa1 0x4c1883 0x4c2392 0x4d144a 0x471dd9 0x475232 0x4764c6 0x7f93b9b8f30d 0x46efc9
 bin/mongo(_ZN5mongo12sayDbContextEPKc+0x96) [0x4a2196]
 bin/mongo(_ZN5mongo8assertedEPKcS1_j+0xfb) [0x4a4bfb]
 bin/mongo(_ZN5mongo13MessagingPort4recvERKNS_7MessageERS1_+0x272) [0x5ec8b2]
 bin/mongo(_ZN5mongo13MessagingPort4callERNS_7MessageES2_+0x34) [0x5ecae4]
 bin/mongo(_ZN5mongo18DBClientConnection4callERNS_7MessageES2_bPSs+0x34) [0x4c15f4]
 bin/mongo(_ZN5mongo14DBClientCursor4initEv+0xad) [0x4e515d]
 bin/mongo(_ZN5mongo12DBClientBase5queryERKSsNS_5QueryEiiPKNS_7BSONObjEii+0x3ac) [0x4b960c]
 bin/mongo(_ZN5mongo18DBClientConnection5queryERKSsNS_5QueryEiiPKNS_7BSONObjEii+0xa1) [0x4cffa1]
 bin/mongo(_ZN5mongo17DBClientInterface5findNERSt6vectorINS_7BSONObjESaIS2_EERKSsNS_5QueryEiiPKS2_i+0xa3) [0x4c1883]
 bin/mongo(_ZN5mongo17DBClientInterface7findOneERKSsRKNS_5QueryEPKNS_7BSONObjEi+0x72) [0x4c2392]
 bin/mongo(_ZN5mongo18DBClientConnection10runCommandERKSsRKNS_7BSONObjERS3_i+0x7a) [0x4d144a]
 bin/mongo(_Z21sayReplSetMemberStatev+0x399) [0x471dd9]
 bin/mongo(_Z5_mainiPPc+0x26c2) [0x475232]
 bin/mongo(main+0x26) [0x4764c6]
 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed) [0x7f93b9b8f30d]
 bin/mongo(_ZNSt15basic_streambufIcSt11char_traitsIcEE6xsputnEPKcl+0x59) [0x46efc9]
> exit
bye



 Comments   
Comment by Tad Marshall [ 03/Mar/12 ]

@Zac,

Are you able to answer the questions I asked in the final paragraph of my earlier response? It would be helpful in reproducing this problem, which in turn would help us fix it.

Thanks!

Tad Marshall

Comment by Tad Marshall [ 10/Feb/12 ]

The error message "error doing query: unknown shell/collection.js:151" is an error message and a location: "error doing query: unknown" is the error and "shell/collection.js:151" is the location. shell/collection.js is a JavaScript helper function compiled into the code as text and 151 is the line number in that source file. Here is the relevant bit of the file shell/collection.js:

149 DBCollection.prototype.findOne = function( query , fields ){
150     var cursor = this._mongo.find( this._fullName , this._massageObject( query ) || {} , fields , 
151         -1 /* limit */ , 0 /* skip*/, 0 /* batchSize */ , 0 /* options */ );
152     if ( ! cursor.hasNext() )
153         return null;
154     var ret = cursor.next();
155     if ( cursor.hasNext() ) throw "findOne has more than 1 result!";
156     if ( ret.$err )
157         throw "error " + tojson( ret );
158     return ret;
159 }

So, the error happened in the findOne() helper function when it had called the .find() function and that call was interrupted by the ^C.

Ideally, the ^C should have caused a clean abort of the operation and more polite and sensible messages ending with a return to the prompt. Apparently, it continued to process something, and the second ^C interrupted the MessagingPort::call() function.

The actual crash happened after control had returned to the main prompt loop, where the sayReplSetMemberState() routine is used to generate text to display at the prompt (e.g. "ReplSetName:PRIMARY> "). As a guess, it seems like whatever was making the findOne() query take too long was also making the query to get the replication state take too long (hence the second ^C), but the shell's internal state was not consistent due to incorrect handling of the first ^C, leading to the crash.

I assume that the findOne() operation was taking a long time and that's why you hit ^C. To help us reproduce this, can you tell us a little about the configuration of the database and the hourly_stats collection? How long had you been waiting for a response and do you know why the response would have been delayed (e.g. known problems at the time of the query)? Have you used ^C on slow responses like this before and had it work properly? Anything else you can think of that might help us set up and environment to reproduce this? Thanks!

Generated at Thu Feb 08 03:07:22 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.