RLP fails to tokenize Chinese strings with ESC control characters

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Done
    • Priority: Minor - P4
    • None
    • Affects Version/s: 3.1.2
    • Component/s: Text Search
    • ALL
    • Hide
      var t = db.rlp;
      t.drop();
      
      assert.commandWorked(t.ensureIndex({a: 'text'}));
      assert.eq(t.find({$text: {$search: '\u001b', $language: 'zht'}}).itcount(), 0);
      
      Show
      var t = db.rlp; t.drop(); assert.commandWorked(t.ensureIndex({a: 'text'})); assert.eq(t.find({$text: {$search: '\u001b', $language: 'zht'}}).itcount(), 0);
    • None
    • 3
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      This is a bug in RLP, but it can cause issues for MongoDB users who attempt to query for, or index, Chinese strings with ESC control characters.

      It affects both Traditional Chinese (zht) and Simplified Chinese (zhs) strings.

      > var t = db.rlp;
      > t.drop();
      true
      
      > t.ensureIndex({a: 'text'});
      {
      	"createdCollectionAutomatically" : true,
      	"numIndexesBefore" : 1,
      	"numIndexesAfter" : 2,
      	"ok" : 1
      }
      
      // Traditional Chinese
      > t.find({$text: {$search: '\u001b', $language: 'zht'}});
      Error: error: {
      	"$err" : "Unable to process the document with return code: -10005, and document '\u001b'.",
      	"code" : 28627
      }
      
      // Simplified Chinese
      > t.find({$text: {$search: '\u001b', $language: 'zhs'}});
      Error: error: {
      	"$err" : "Unable to process the document with return code: -10005, and document '\u001b'.",
      	"code" : 28627
      }
      

              Assignee:
              DO NOT USE - Backlog - Platform Team
              Reporter:
              Kamran K. (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: