jerome.martin.dev
2019-5-10 14:18:03

I think the new emoji for UTF–8 begin at a given unicode point. But the older ones are mixed up


jerome.martin.dev
2019-5-10 14:20:11

You have the complete list here (take care, the page is slow to load): https://www.unicode.org/emoji/charts/full-emoji-list.html


lexi.lambda
2019-5-10 15:09:31

Clearly Racket needs a char-emoji? predicate. :grin:


lexi.lambda
2019-5-10 15:20:52

What does char->integer produce on that character?


lexi.lambda
2019-5-10 15:23:14

I am no expert on UTF–8, so to be clear: are you saying you think that string of UTF–8 bytes isn’t the right UTF–8 encoding of that code point, or that you don’t know how to correlate the UTF–8 encoding with the code point?


lexi.lambda
2019-5-10 15:25:31

Also: some emojis are multi-codepoint sequences, IIUC.


lexi.lambda
2019-5-10 15:26:51

Does string->list on that string produce a single element? (I’d check, but I’m on my phone.)


lexi.lambda
2019-5-10 15:28:36

I didn’t think UTF–8 used surrogate pairs (that’s UTF–16), but I could be wrong—again, not an expert.


lexi.lambda
2019-5-10 15:38:40

(Sorry, had to step away for a moment.) In any case, whether or not UTF–8 technically uses things called “surrogate pairs” or not is probably not helpful—the point is it’s a variable-length encoding. If you want a regexp pattern, can you use regexp-quote?


mflatt
2019-5-10 15:39:20

I think you need “\U1f601” (capital “U”), because “\u” only uses 4 characters.


mflatt
2019-5-10 15:40:01
> (string-length "\u1f601")
2
> (string-length "\U1f601")
1

mflatt
2019-5-10 15:40:57

I forget what backward-compatibility problem made the u/U distinction necessary.



soegaard2
2019-5-10 16:26:08

I am less scared of refactoring in Racket.


notjack
2019-5-10 17:43:39

@mbutterick I thought that racket strings are logically sequences of code points, independent of an encoding, and the fact that they’re internally represented with UTF8 bytestrings was an implementation detail. So why would surrogate pairs affect regexp matching?


soegaard2
2019-5-10 17:54:33

Does emoji characters work in a regexp character range?


samth
2019-5-10 18:17:26

@notjack they’re internally represented as UTF32 (or maybe UCS4) so that character indexing is constant time. This was probably not the right decision but it was less obvious in 2005(?) when Racket switched to unicode


mflatt
2019-5-10 18:26:13

I think this choice has worked out well, and Chez Scheme uses roughly the same representation, except that individual characters are tagged. Symbols are represented in UTF–8 in Racket, but they use strings in Chez Scheme / Racket CS.


samth
2019-5-10 18:35:06

I’m curious why you’d pick UTF32 again, as opposed to UTF8. Rust and Go have both chosen UTF8, and Swift also doesn’t provide constant-time random access and they seem pretty happy with that.


samdphillips
2019-5-10 18:37:13

In the case of Chez it could be to support string-set! more efficiently.


samth
2019-5-10 18:38:02

@samdphillips yes, that’s the traditional approach in Scheme-family languages — provide constant time access to every part of the string


samdphillips
2019-5-10 18:44:01

Also string->list list->string don’t require as much codec logic.


samdphillips
2019-5-10 18:44:31

These are all very scheme-y reasons


samth
2019-5-10 18:45:39

Right, I think also all of those operations are less sensible than they were in the pre-Unicode era


samdphillips
2019-5-10 18:54:30

I’ve wondered about having (in any language) an immutable UTF–8 encoded string type and a mutable rope as standard.


samdphillips
2019-5-10 18:57:20

(actually not quite mutable for the rope, but at least a way to build up “changed” strings)


greg
2019-5-10 19:03:04

Mainly I’m glad I don’t need to know or care what the internal representation is; if some people are surprised it’s actually UTF–16 or UCS4 or EBCDIC, and had no idea otherwise, then I think that’s a wonderful thing. [Edit: To be clear, I had no idea which, myself.]


greg
2019-5-10 19:03:29

@samth Is the argument in favor of UTF–8 internal rep, mainly space?


samth
2019-5-10 19:04:48

Also that you usually want to interact in UTF–8 and that avoiding the conversion is faster. Also iterating over UTF–8 encoded ASCII text (pretty common) is faster.


lexi.lambda
2019-5-10 19:20:23

IME, UTF–32 sounds like it has some advantages, since you get constant-time access to code points, and a “code point” sounds kind of like something that might mean “single unit of text,” in contrast to things like “surrogate pairs” which are just building blocks. The problem is that this isn’t actually true; lots of things that render as a single glyph are made from multiple code points. IIUC, “grapheme clusters” are the closest thing defined by Unicode to “a single unit of text,” and these are themselves made from variable-width strings of code points (of arbitrary length), so a fixed-width encoding is hopeless from the start.


lexi.lambda
2019-5-10 19:22:39

I don’t know of any programming language in existence that represents entire grapheme clusters with its char/character datatype. The only choices I think I’ve ever seen are “a character is a byte,” “a character is a UTF–16 code unit,” and “a character is a Unicode code point/scalar value.”


samth
2019-5-10 19:24:53

@lexi.lambda I think Swift does that


lexi.lambda
2019-5-10 19:25:03

Really? Fascinating.



lexi.lambda
2019-5-10 19:26:41

That’s cool! I wonder what problems that choice has. :grimacing: (It seems like all of them have at least some…)


samth
2019-5-10 19:28:16

I think “Swift strings are complicated” is probably one of them


lexi.lambda
2019-5-10 19:29:36

Fair, though I think that’s still better than the alternative, which is an API that tricks people into thinking strings are not complicated, followed by doing things that are wrong.


lexi.lambda
2019-5-10 19:30:32

In any case, I agree with both Sam and Greg: I’m not sure what advantage UTF–32 has over UTF–8 as an internal representation, but I’m glad Racket’s API doesn’t expose that implementation detail (C API notwithstanding).


lexi.lambda
2019-5-10 19:31:30

(The discussion of what char? means is different, but it seems unlikely that’s ever going to change for Racket!)


greg
2019-5-10 19:35:27

But but but strings are simple, just like dates and time zones and people’s names and addresses and … :smile:


samdphillips
2019-5-10 19:39:25

Strings would be easy if it wasn’t for all of those other pesky non-english languages.


samdphillips
2019-5-10 19:39:30

Oh and math


lexi.lambda
2019-5-10 19:40:50

I think naïve and clichéd are perfectly cromulent English words. :grin:


greg
2019-5-10 19:48:06

Madame, this is a Wendy’s. We call them freedom strings.


notjack
2019-5-10 19:51:21

To hell with it, I’m just gonna make a text? type that’s a pair of an immutable bytestring and a charset name


mflatt
2019-5-11 01:59:02

There’s now a “mini-bar-plot” collection in the “racket-benchmarks” package. (I considered other places, but dumped it there as fairly specific to “racket-benchmarks”, at least for now.) It’s not documented, so let me know if you become interested in trying to use it.