
I think the new emoji for UTF–8 begin at a given unicode point. But the older ones are mixed up

You have the complete list here (take care, the page is slow to load): https://www.unicode.org/emoji/charts/full-emoji-list.html

Clearly Racket needs a char-emoji?
predicate. :grin:

What does char->integer
produce on that character?

I am no expert on UTF–8, so to be clear: are you saying you think that string of UTF–8 bytes isn’t the right UTF–8 encoding of that code point, or that you don’t know how to correlate the UTF–8 encoding with the code point?

Also: some emojis are multi-codepoint sequences, IIUC.

Does string->list
on that string produce a single element? (I’d check, but I’m on my phone.)

I didn’t think UTF–8 used surrogate pairs (that’s UTF–16), but I could be wrong—again, not an expert.

(Sorry, had to step away for a moment.) In any case, whether or not UTF–8 technically uses things called “surrogate pairs” or not is probably not helpful—the point is it’s a variable-length encoding. If you want a regexp pattern, can you use regexp-quote
?

I think you need “\U1f601” (capital “U”), because “\u” only uses 4 characters.

> (string-length "\u1f601")
2
> (string-length "\U1f601")
1

I forget what backward-compatibility problem made the u/U distinction necessary.


I am less scared of refactoring in Racket.

@mbutterick I thought that racket strings are logically sequences of code points, independent of an encoding, and the fact that they’re internally represented with UTF8 bytestrings was an implementation detail. So why would surrogate pairs affect regexp matching?

Does emoji characters work in a regexp character range?

@notjack they’re internally represented as UTF32 (or maybe UCS4) so that character indexing is constant time. This was probably not the right decision but it was less obvious in 2005(?) when Racket switched to unicode

I think this choice has worked out well, and Chez Scheme uses roughly the same representation, except that individual characters are tagged. Symbols are represented in UTF–8 in Racket, but they use strings in Chez Scheme / Racket CS.

I’m curious why you’d pick UTF32 again, as opposed to UTF8. Rust and Go have both chosen UTF8, and Swift also doesn’t provide constant-time random access and they seem pretty happy with that.

In the case of Chez it could be to support string-set!
more efficiently.

@samdphillips yes, that’s the traditional approach in Scheme-family languages — provide constant time access to every part of the string

Also string->list
list->string
don’t require as much codec logic.

These are all very scheme-y reasons

Right, I think also all of those operations are less sensible than they were in the pre-Unicode era

I’ve wondered about having (in any language) an immutable UTF–8 encoded string type and a mutable rope as standard.

(actually not quite mutable for the rope, but at least a way to build up “changed” strings)

Mainly I’m glad I don’t need to know or care what the internal representation is; if some people are surprised it’s actually UTF–16 or UCS4 or EBCDIC, and had no idea otherwise, then I think that’s a wonderful thing. [Edit: To be clear, I had no idea which, myself.]

@samth Is the argument in favor of UTF–8 internal rep, mainly space?

Also that you usually want to interact in UTF–8 and that avoiding the conversion is faster. Also iterating over UTF–8 encoded ASCII text (pretty common) is faster.

IME, UTF–32 sounds like it has some advantages, since you get constant-time access to code points, and a “code point” sounds kind of like something that might mean “single unit of text,” in contrast to things like “surrogate pairs” which are just building blocks. The problem is that this isn’t actually true; lots of things that render as a single glyph are made from multiple code points. IIUC, “grapheme clusters” are the closest thing defined by Unicode to “a single unit of text,” and these are themselves made from variable-width strings of code points (of arbitrary length), so a fixed-width encoding is hopeless from the start.

I don’t know of any programming language in existence that represents entire grapheme clusters with its char
/character datatype. The only choices I think I’ve ever seen are “a character is a byte,” “a character is a UTF–16 code unit,” and “a character is a Unicode code point/scalar value.”

@lexi.lambda I think Swift does that

Really? Fascinating.


That’s cool! I wonder what problems that choice has. :grimacing: (It seems like all of them have at least some…)

I think “Swift strings are complicated” is probably one of them

Fair, though I think that’s still better than the alternative, which is an API that tricks people into thinking strings are not complicated, followed by doing things that are wrong.

In any case, I agree with both Sam and Greg: I’m not sure what advantage UTF–32 has over UTF–8 as an internal representation, but I’m glad Racket’s API doesn’t expose that implementation detail (C API notwithstanding).

(The discussion of what char?
means is different, but it seems unlikely that’s ever going to change for Racket!)

But but but strings are simple, just like dates and time zones and people’s names and addresses and … :smile:

Strings would be easy if it wasn’t for all of those other pesky non-english languages.

Oh and math

I think naïve and clichéd are perfectly cromulent English words. :grin:

Madame, this is a Wendy’s. We call them freedom strings.

To hell with it, I’m just gonna make a text?
type that’s a pair of an immutable bytestring and a charset name

There’s now a “mini-bar-plot” collection in the “racket-benchmarks” package. (I considered other places, but dumped it there as fairly specific to “racket-benchmarks”, at least for now.) It’s not documented, so let me know if you become interested in trying to use it.