[SERVER-2265] Map reduce failed with special character (utf-8) and a pipe ( | ) Created: 21/Dec/10  Updated: 24/Jun/13  Resolved: 24/Jun/13

Status: Closed
Project: Core Server
Component/s: JavaScript
Affects Version/s: 1.6.5
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Sandro Munda Assignee: Unassigned
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

osx x86_64 - mongodb 1.6.5


Attachments: File testcase.js    
Issue Links:
Depends
depends on SERVER-2407 Switch to v8 Closed
Operating System: ALL
Participants:

 Description   

Hi,

I just want to use the map reduce function of MongoDB. So I wrote a simple map_index function in javascript :

function map_index()
{
var _translit = [
[/ä|æ/g, 'ae'],
];
...
...
...

After that, I try to load the file (index.js) into Mongo Shell and :

> load('index.js');
map reduce failed: {
"assertion" : "map compile failed: JS Error: SyntaxError: unterminated regular expression literal nofile_b:1",
"assertionCode" : 9012,
"errmsg" : "db assertion failure",
"ok" : 0
}

There is no problem if I remove "| æ".

var _translit = [
[/ä/g, 'ae'],
];

The same behavior when I remove "ä" and keep "æ".

Of course, the file is encoded in utf-8. Full test code can be found as Attachment.



 Comments   
Comment by Tad Marshall [ 24/Jun/13 ]

SpiderMonkey has been replaced by V8.

Comment by Tad Marshall [ 15/Aug/12 ]

It's a SpiderMonkey bug. You don't need to involve the database or map-reduce at all.

> var regex = /ä|æ/g
> regex
/ä|æ/g
> function return_regex() { return /ä|æ/g ; };
> return_regex
function return_regex() {
    return /ä|æ;
}
> 

The function has corrupted the regex already. Apparently, it is using the count of Unicode characters as the byte count, so two UTF-8 characters of two bytes each cause two bytes to be lost from the end of the string.

> function trash_3() { return /äää987654321/ }
> trash_3
function trash_3() {
    return /äää9876543;
}
> function trash_2() { return /ää987654321/ }
> trash_2
function trash_2() {
    return /ää98765432;
}
> function trash_1() { return /ä987654321/ }
> trash_1
function trash_1() {
    return /ä987654321;
}
> 

Three 2-byte UTF-8 characters cause three bytes to be lost from the end; two UTF-8 characters loses two bytes; one UTF-8 character loses one bytes.

Comment by Antoine Girbal [ 12/Oct/11 ]

It does work correctly in V8
> str = "/ä|æ/g"
/ä|æ/g
> re = /ä|æ/g
/ä|æ/g
> var f = function()

{ var re = /ä|æ/g;}
> f
function () { var re = /ä|æ/g;}

should consider opening bug against SM, though it may be fixed in SM 1.8

Comment by Antoine Girbal [ 12/Oct/11 ]

Looks like the server is seeing a partial view of the expression, it's missing the closing "/g".

Tue Oct 11 22:04:22 [conn35] JS Error: SyntaxError: unterminated regular expression literal nofile_b:1
Tue Oct 11 22:04:22 [conn35] compile failed for: function map_index() {
var _translit = [[/ä|æ, "ae"], ];
emit(this._id,

{title:[this.title]}

);
}

A simpler view of the problem is:

> em = function map_index(){
... var _translit = [
... [/ä|æ/g, 'ae'],
... ];
... emit(this._id,

{ title: [this.title] }

);
... }
function map_index() {
var _translit = [[/ä|æ, "ae"], ];
emit(this._id,

{title:[this.title]}

);
}

As you can see the 2 last characters of the regular expression have disappeared.
This is due to the JS engine miscalculating the length of the expression.
The special characters use 2 bytes each, but somehow the engine assumes all characters are 1 byte.

Within a string the result is correct:
> str = "/ä|æ/g"
/ä|æ/g

Even a simple regular expression object works:
> var re = [ /ä|æ/g, "ae"]
> re
[ /ä|æ/g, "ae" ]

But if it's within a function, bug appears:
> var f = function()

{ var re = /ä|æ/g;}

> f
function () {
var re = /ä|æ;
}

Looks like this is a bug with spidermonkey.
Will try with v8.

Comment by Sandro Munda [ 22/Dec/10 ]

Hello,

I wrote a simple testcase. You can find it in attachment. Before loading this file, you can execute this in the mongo shell :

> db.test.drop()
true
> db.test.save(

{title: "foo"}

)
>

Thanks !

Comment by Eliot Horowitz (Inactive) [ 21/Dec/10 ]

There needs to be some certain fields, an empty object doesn't trigger it.

Also, we don't know what "slug" is.

Comment by Sandro Munda [ 21/Dec/10 ]

This exemple failed with all collections. Not need a specific collection.
Thanks.

Comment by Eliot Horowitz (Inactive) [ 21/Dec/10 ]

Can you also attach some sample data?
Ideally a mongodump file?

Generated at Thu Feb 08 02:59:26 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.