Racket Slack Archive

jerome.martin.dev

2019-5-10 14:18:03

I think the new emoji for UTF–8 begin at a given unicode point. But the older ones are mixed up

jerome.martin.dev

2019-5-10 14:20:11

You have the complete list here (take care, the page is slow to load): https://www.unicode.org/emoji/charts/full-emoji-list.html

lexi.lambda

2019-5-10 15:09:31

Clearly Racket needs a char-emoji? predicate. :grin:

lexi.lambda

2019-5-10 15:20:52

What does char->integer produce on that character?

lexi.lambda

2019-5-10 15:23:14

I am no expert on UTF–8, so to be clear: are you saying you think that string of UTF–8 bytes isn’t the right UTF–8 encoding of that code point, or that you don’t know how to correlate the UTF–8 encoding with the code point?

lexi.lambda

2019-5-10 15:25:31

Also: some emojis are multi-codepoint sequences, IIUC.

lexi.lambda

2019-5-10 15:26:51

Does string->list on that string produce a single element? (I’d check, but I’m on my phone.)

lexi.lambda

2019-5-10 15:28:36

I didn’t think UTF–8 used surrogate pairs (that’s UTF–16), but I could be wrong—again, not an expert.

lexi.lambda

2019-5-10 15:38:40

(Sorry, had to step away for a moment.) In any case, whether or not UTF–8 technically uses things called “surrogate pairs” or not is probably not helpful—the point is it’s a variable-length encoding. If you want a regexp pattern, can you use regexp-quote?

mflatt

2019-5-10 15:39:20

I think you need “\U1f601” (capital “U”), because “\u” only uses 4 characters.

mflatt

2019-5-10 15:40:01

&gt; (string-length "\u1f601")
2
&gt; (string-length "\U1f601")
1

mflatt

2019-5-10 15:40:57

I forget what backward-compatibility problem made the u/U distinction necessary.

soegaard2

2019-5-10 16:25:23

https://twitter.com/sperbsen/status/1126866140028968961

soegaard2

2019-5-10 16:26:08

I am less scared of refactoring in Racket.

notjack

2019-5-10 17:43:39

@mbutterick I thought that racket strings are logically sequences of code points, independent of an encoding, and the fact that they’re internally represented with UTF8 bytestrings was an implementation detail. So why would surrogate pairs affect regexp matching?

soegaard2

2019-5-10 17:54:33

Does emoji characters work in a regexp character range?

samth

2019-5-10 18:17:26

@notjack they’re internally represented as UTF32 (or maybe UCS4) so that character indexing is constant time. This was probably not the right decision but it was less obvious in 2005(?) when Racket switched to unicode

mflatt

2019-5-10 18:26:13

I think this choice has worked out well, and Chez Scheme uses roughly the same representation, except that individual characters are tagged. Symbols are represented in UTF–8 in Racket, but they use strings in Chez Scheme / Racket CS.

samth

2019-5-10 18:35:06

I’m curious why you’d pick UTF32 again, as opposed to UTF8. Rust and Go have both chosen UTF8, and Swift also doesn’t provide constant-time random access and they seem pretty happy with that.

samdphillips

2019-5-10 18:37:13

In the case of Chez it could be to support string-set! more efficiently.

samth

2019-5-10 18:38:02

@samdphillips yes, that’s the traditional approach in Scheme-family languages — provide constant time access to every part of the string

samdphillips

2019-5-10 18:44:01

Also string->list list->string don’t require as much codec logic.

samdphillips

2019-5-10 18:44:31

These are all very scheme-y reasons

samth

2019-5-10 18:45:39

Right, I think also all of those operations are less sensible than they were in the pre-Unicode era

samdphillips

2019-5-10 18:54:30

I’ve wondered about having (in any language) an immutable UTF–8 encoded string type and a mutable rope as standard.

samdphillips

2019-5-10 18:57:20

(actually not quite mutable for the rope, but at least a way to build up “changed” strings)

greg

2019-5-10 19:03:04

Mainly I’m glad I don’t need to know or care what the internal representation is; if some people are surprised it’s actually UTF–16 or UCS4 or EBCDIC, and had no idea otherwise, then I think that’s a wonderful thing. [Edit: To be clear, I had no idea which, myself.]

greg

2019-5-10 19:03:29

@samth Is the argument in favor of UTF–8 internal rep, mainly space?

samth

2019-5-10 19:04:48

Also that you usually want to interact in UTF–8 and that avoiding the conversion is faster. Also iterating over UTF–8 encoded ASCII text (pretty common) is faster.

lexi.lambda

2019-5-10 19:20:23

IME, UTF–32 sounds like it has some advantages, since you get constant-time access to code points, and a “code point” sounds kind of like something that might mean “single unit of text,” in contrast to things like “surrogate pairs” which are just building blocks. The problem is that this isn’t actually true; lots of things that render as a single glyph are made from multiple code points. IIUC, “grapheme clusters” are the closest thing defined by Unicode to “a single unit of text,” and these are themselves made from variable-width strings of code points (of arbitrary length), so a fixed-width encoding is hopeless from the start.

lexi.lambda

2019-5-10 19:22:39

I don’t know of any programming language in existence that represents entire grapheme clusters with its char/character datatype. The only choices I think I’ve ever seen are “a character is a byte,” “a character is a UTF–16 code unit,” and “a character is a Unicode code point/scalar value.”

samth

2019-5-10 19:24:53

@lexi.lambda I think Swift does that

lexi.lambda

2019-5-10 19:25:03

Really? Fascinating.

samth

2019-5-10 19:25:17

https://developer.apple.com/documentation/swift/character

lexi.lambda

2019-5-10 19:26:41

That’s cool! I wonder what problems that choice has. :grimacing: (It seems like all of them have at least some…)

samth

2019-5-10 19:28:16

I think “Swift strings are complicated” is probably one of them

lexi.lambda

2019-5-10 19:29:36

Fair, though I think that’s still better than the alternative, which is an API that tricks people into thinking strings are not complicated, followed by doing things that are wrong.

lexi.lambda

2019-5-10 19:30:32

In any case, I agree with both Sam and Greg: I’m not sure what advantage UTF–32 has over UTF–8 as an internal representation, but I’m glad Racket’s API doesn’t expose that implementation detail (C API notwithstanding).

lexi.lambda

2019-5-10 19:31:30

(The discussion of what char? means is different, but it seems unlikely that’s ever going to change for Racket!)

greg

2019-5-10 19:35:27

But but but strings are simple, just like dates and time zones and people’s names and addresses and … :smile:

samdphillips

2019-5-10 19:39:25

Strings would be easy if it wasn’t for all of those other pesky non-english languages.

samdphillips

2019-5-10 19:39:30

Oh and math

lexi.lambda

2019-5-10 19:40:50

I think naïve and clichéd are perfectly cromulent English words. :grin:

greg

2019-5-10 19:48:06

Madame, this is a Wendy’s. We call them freedom strings.

notjack

2019-5-10 19:51:21

To hell with it, I’m just gonna make a text? type that’s a pair of an immutable bytestring and a charset name

mflatt

2019-5-11 01:59:02

There’s now a “mini-bar-plot” collection in the “racket-benchmarks” package. (I considered other places, but dumped it there as fairly specific to “racket-benchmarks”, at least for now.) It’s not documented, so let me know if you become interested in trying to use it.