[CDRIVER-2159] Ignore non-ASCII when downcasing domain names Created: 10/May/17  Updated: 28/Oct/23  Resolved: 13/Jun/18

Status: Closed
Project: C Driver
Component/s: libmongoc, network
Affects Version/s: None
Fix Version/s: 1.11.0

Type: Task Priority: Minor - P4
Reporter: A. Jesse Jiryu Davis Assignee: Unassigned
Resolution: Fixed Votes: 0
Labels: neweng
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
Backwards Compatibility: Fully Compatible

 Description   

SDAM requires us to downcase domain names of MongoDB servers:

https://github.com/mongodb/specifications/blob/master/source/server-discovery-and-monitoring/server-discovery-and-monitoring.rst#hostnames-are-normalized-to-lower-case

Our decision was to translate uppercase ASCII chars to lowercase and leave non-ASCII chars unchanged.

The C Driver uses mongoc_lowercase for this. I'm concerned that a multibyte UTF-8 character could be corrupted by its algorithm:

void
mongoc_lowercase (const char *src, char *buf /* OUT */)
{
   for (; *src; ++src, ++buf) {
      *buf = tolower (*src);
   }
}

It should probably use bson_utf8_next_char to advance through the string, instead of using "++".



 Comments   
Comment by Githook User [ 13/Jun/18 ]

Author:

{'username': 'spencemc', 'name': 'Spencer McKenney', 'email': 'spencermck@me.com'}

Message: CDRIVER-2159 ignore non-ascii in lowercase
Branch: master
https://github.com/mongodb/mongo-c-driver/commit/c36e7a9c42a9aaa70447da858a49752845ac3e50

Comment by Kevin Albertson [ 13/Jun/18 ]

spencer.mckenney@10gen.com when you're ready let's go through the changes for this and the workflow for working on a ticket.

Comment by A. Jesse Jiryu Davis [ 08/May/18 ]

Good point. Still, this code seems to work by accident. Wouldn't it be better to always iterate UTF-8 characters using bson_utf8_next_char instead of ++?

Comment by Kevin Albertson [ 08/May/18 ]

From the man page of tolower:

If the argument is an upper-case letter, the tolower() function returns the corresponding lower-case letter if there is one; otherwise, the argument is returned unchanged.

Multi-byte UTF-8 characters consist of all bytes with a leading the leading bit set to 1. So I think those should get returned unaltered.
 

Generated at Wed Feb 07 21:14:21 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.