Racket Slack Archive

gknauth

2021-7-22 14:23:04

for did the job very nicely, thanks again

sschwarzer

2021-7-22 14:43:04

Two somewhat related questions: • How can I get the inode number of a file on Posix? • How can I create a hardlink? https://docs.racket-lang.org/reference/Filesystem.html#%28def._%28%28quote._~23~25kernel%29._make-file-or-directory-link%29%29\|make-file-or-directory-link seems to create only softlinks.

massung

2021-7-22 14:46:07

I don’t know a portable way to do it (read: using the racket functions), but if you’re limiting yourself to *nix, I imagine you could just use system to do both of those?

sschwarzer

2021-7-22 14:53:08

Calling a command line tool to make the hardlinks may actually be feasible because (probably) I need to create only a few hardlinks.

For the inodes though, I need to determine the inodes for all files in a directory tree, and I suspect this will very slow. In theory, I could speed this up by piping a list of paths into xargs and parse its output, but that would be quite cumbersome, especially since I want to call a function for each file via https://docs.racket-lang.org/reference/Filesystem.html#%28def._%28%28lib._racket%2Ffile..rkt%29._fold-files%29%29\|fold-files .

sschwarzer

2021-7-22 14:53:59

So I’m now torn between experimenting with the Racket/C API or using a different programming language. :smile:

sschwarzer

2021-7-22 14:54:24

fold-files looked so nice. sigh

sschwarzer

2021-7-22 14:56:30

I was a bit surprised to not find the equivalent of a stat call.

sorawee

2021-7-22 15:13:04

Do you know Rash? Not sure if it could be useful to you: https://docs.racket-lang.org/rash/

samth

2021-7-22 15:26:10

is file-or-directory-identity what you want for inodes?

samth

2021-7-22 15:28:48

I think adding an option to make-file-or-directory-link for hard links would be a good idea

massung

2021-7-22 15:44:50

Can someone point me to docs/tutorial for building tests for a package (something the pkgd will run automatically as well when published)?

soegaard2

2021-7-22 15:55:36

I don’t know of any tutorials. But here are something to read.

Basically the package server runs raco test --drdr after a package has been succesfully installed.

And, the description of --drdr is:

> —drdr — Configures defaults to imitate the DrDr continuous testing system: ignore non-modules, run tests in separate processes (unless —thread or —direct is specified) use as many jobs as available processors (unless —jobs is specified), set the default timeout to 90 seconds (unless —timeout is specified), create a fresh PLTUSERHOME and TMPDIR for each test, count stderr output as a test failure, quiet program output, provide empty program input, and print a table of results. 2. describes how to configure tests.

Is an example of how jeapostrophe has tests for the automata library. (This was picked more or less at random)
Click the icon “passing tests” to see the test results.

samdphillips

2021-7-22 15:55:39

The rich FFI is one of the joys of Racket. You should take the plunge!

massung

2021-7-22 15:56:19

thanks

soegaard2

2021-7-22 15:56:20

There is a scary amount of options for raco test…

greg

2021-7-22 16:02:21

@massung Unless you’re doing something exotic, I think the answer is you don’t have to do anything. By default it’s going to run all your .rkt files. I think that’s the super short story for most packages?

greg

2021-7-22 16:02:38

Most of the options have to do with not running all your .rkt files, or all of them.

greg

2021-7-22 16:03:22

If a .rkt file has test submodule, then it will only run that (not the other modules). That can be handy if say a file also has a main submodule that does things like connect to a server or whatever, that you don’t want to or can’t have happen in a test.

massung

2021-7-22 16:03:38

Sure, but that just validates that they don’t have any syntax errors. But - for example - what if I want to ensure that CSV parsing code is working and I have a simple csv file in a test/ path or something?

sschwarzer

2021-7-22 16:03:49

> is file-or-directory-identity what you want for inodes? The description of file-or-directory-identity is a rather vague. I thought the description would mean that to files with the same “identity” would have the same “real path” (see https://man7.org/linux/man-pages/man3/realpath.3.html ), which would give different “identities” for hardlinks.

I now tried file-or-directory-identity. It gives me 150018644241532918675078402 for both files that are hardlinked to each other. The inode listed by ls -i is 8132527. It makes sense that the identity would not be just the inode because the same inode number may occur in different mounted file systems. I can imagine that the bits of the inode are somehow included in the result of file-or-directory-identity.

So it seems I can use file-or-directory-identiy, but from the vague description I would never know for sure if it will always give the same id for hardlinked files (see realpath for another definition of “identity”; they don’t use the word “identity”, but you get the point :slightly_smiling_face: ). The currently same result for the hardlinked file might be an implementation detail.

If file-or-directory-identity is actually supposed to have “file system plus inode” semantics this should be made clearer in the documentation. But I don’t know what the supposed semantic is.

greg

2021-7-22 16:03:54

There’s also something like compile-test-omit-files setting in info.rkt, to exclude whole files or subdires.

massung

2021-7-22 16:04:09

> If a .rkt file has test submodule Do you mean, if I did something like: (module+ test (run-test-function-here)) ?

greg

2021-7-22 16:04:22

The “old way” was you would have some test/ sub dir and all the things in it would be just more .rkt files that get run. Or even say a single tests.rkt file. And people still do that.

greg

2021-7-22 16:04:31

The “new way” is you can use test submodules.

greg

2021-7-22 16:05:31

So I just mean that by default the build server will run all your files. If some of them have tests, then it runs your tests.

greg

2021-7-22 16:05:45

The options are mostly about when you don’t want to run some files or parts of them.

massung

2021-7-22 16:07:04

Do you know of a simple package that does the test submodules I can just take a look at? If you don’t know one off-hand then I’ll look.

samth

2021-7-22 16:07:10

The problem is that windows doesn’t have inodes, so the description is intended to be cross-platform. Maybe it should say that it includes the filesystem and inode information on some platforms, and more information about what it does on windows as well.

sschwarzer

2021-7-22 16:08:00

I usually just use https://docs.racket-lang.org/rackunit/\|rackunit . Here’s an https://git.sr.ht/~sschwarzer/sudoku-solver/tree/main/item/games/sudoku-solver/solver.rkt#L365\|example . As greg said, the tests are executed when you run the file with raco test.

greg

2021-7-22 16:08:05

Here’s one of mine: https://github.com/greghendershott/sha/blob/master/sha/main.rkt

massung

2021-7-22 16:08:42

> The problem is that windows doesn’t have inodes I think that’s only true for FAT32 (which, admittedly, is likely the majority of windows machines out there). NTFS has i-nodes, IIRC.

massung

2021-7-22 16:09:02

Thank you both

greg

2021-7-22 16:10:06

I could have put those tests in a separate test.rkt file that did (require "main.rkt"). But for lite things, (module+ test ___) is handy, and keeps the tests near the code that they test.

greg

2021-7-22 16:10:14

Either way, the build server will run them.

samth

2021-7-22 16:11:06

Well, as I read the docs is has numbers that are similar but are not called inodes.

sschwarzer

2021-7-22 16:12:58

Could be interesting to check what other languages/platforms use in the stat result on Windows.

Here’s a description from the Python documentation https://docs.python.org/3/library/os.html#os.stat_result : > st_ino > Platform dependent, but if non-zero, uniquely identifies the file for a given value of st_dev. Typically: > the inode number on Unix, > the <https://msdn.microsoft.com/en-us/library/aa363788|file index> on Windows and “file index” is a link to https://msdn.microsoft.com/en-us/library/aa363788 .

greg

2021-7-22 16:14:24

Another thing about module+ is that will splice together multiple instances into the same single submodule. So you can alternative (define thing __) and (module+ test test-thing) forms in the file, and the code and tests are adjacent.

greg

2021-7-22 16:15:04

Whether you find that handy or distracting, mixing like that, is a matter of opinion of course. The good news is you or your team can pick whatever you want.

sschwarzer

2021-7-22 16:21:21

Another advantage of having a test module inside the file/module to test is that you can easily test code that’s private to the module (i. e. not provided).

greg

2021-7-22 16:23:22

Yes, that’s better than using something evil like https://docs.racket-lang.org/rackunit/api.html#%28form._%28%28lib._rackunit%2Fmain..rkt%29._require%2Fexpose%29%29 :smile:

sschwarzer

2021-7-22 16:23:53

> The rich FFI is one of the joys of Racket. You should take the plunge! Here comes the next Racket project. :wink: But thanks for the encouragement. :slightly_smiling_face:

sschwarzer

2021-7-22 16:24:58

I had seen require/expose at some point, but never tried it due to the warnings. :smile:

samth

2021-7-22 16:26:25

yeah sadly there’s no direct binding to fstat — it would be interesting to have a more low-level posix api in racket too

sschwarzer

2021-7-22 16:29:36

https://nim-lang.org\|Nim uses a https://nim-lang.org/docs/os.html#FileInfo\|FileInfo type which has a tuple of device id and file id. I’m quite sure on Posix the file id is the inode number.

pavpanchekha

2021-7-22 17:20:53

Random comment: In Herbie we had a major inefficiency where we enumerate over the cartesian product of several lists. But sometimes one of the last lists was null, which meant loads and loads of time spent to produce the empty list.

pavpanchekha

2021-7-22 17:21:09

It would improve cartesian-product to handle that case specially.

massung

2021-7-22 17:23:52

Is there any form for disabling GC while a body executes?

samth

2021-7-22 17:32:05

At the Racket level, there isn’t. On Racket CS, you could use vm-eval to adjust the collect-request-handler parameter, although I don’t know if there are potential bad side effects of doing that.

massung

2021-7-22 17:36:31

The increment gc is killing my perf, whereas if i just let it go and do a single, big collect at the end it would save me a good 15s (according to time).

samth

2021-7-22 17:38:14

I’m not sure exactly how you’re measuring that but do you mean you’re using incremental GC? Or just that the minor collections are taking a lot of time?

massung

2021-7-22 17:41:18

(time ...) ;=> cpu time: 65295 real time: ... gc time: 15892 Basically, ~20–25% of the runtime is just spent doing incremental collects over a long runtime. I’d happily let memory spike, drop the runtime by 15s, then do one big collect that only takes ~1s at the end.

samth

2021-7-22 17:42:16

Why do you think the collection at the end would take only 1 sec?

samth

2021-7-22 17:42:45

You might also be interested in measuring how much you’re allocating, my gc-stats package is useful for that.

massung

2021-7-22 17:43:24

Just past experiences when I see the memory usage in Dr Racket up to 3 GB or so, I call (collect-garbage 'major) and it takes very little time to reduce usage down to 500 MB or so

samth

2021-7-22 17:48:06

My guess is that your application is allocating a lot over those 65 seconds, and just doing no collection would use a lot of memory. But you can try it out.

samth

2021-7-22 17:49:54

A fairly typical allocation rate in Racket code would be 500 MB/sec, so if you just ran for the 50 seconds of mutator time you’d hit 25 GB resident before the end.

massung

2021-7-22 17:50:13

ok, good to know

samth

2021-7-22 17:50:32

You can try it out easily with PLTDISABLEGC=1

massung

2021-7-22 17:54:00

Ah, ok… so testing (with smaller data set)

; gc enabled (time ...) ;=> cpu time: 12765 real time: 12716 gc time: 2156 ;=> peak memory: ~1.1 GB ; gc disabled (time ...) ;=> cpu time: 9453 real time: 9596 gc time: 0 ;=> peak memory: ~5.0 GB

massung

2021-7-22 17:55:28

5x is a bit more than I was expecting, certainly makes the “no gc” optional untenable.

samth

2021-7-22 17:56:30

Is this an application where you’re trying to optimize latency or throughput?

massung

2021-7-22 17:56:59

no. just reading data off disk and parsing it during load

massung

2021-7-22 17:57:20

once in memory im fine

samth

2021-7-22 17:57:45

that sounds like throughput is the goal (ie finish the whole thing as fast as possible)

massung

2021-7-22 17:57:48

so, i guess, technically yes to “throughput”

massung

2021-7-22 17:58:31

Was hoping to not have to drop down into C

samth

2021-7-22 17:59:02

I think my next advice would be “allocate less”

samth

2021-7-22 17:59:13

rather than switch to C

samth

2021-7-22 18:02:26

You could potentially also use vm-eval with collect-trip-bytes to increase the size of the young generation

ben.knoble

2021-7-22 18:23:20

My one addition is that rackunit makes it easy to define test-suites, aggregating cases, etc., but doing that in (module+ test …) means you have to explicitly call (run-tests …) from rackunit/text-ui

soegaard2

2021-7-22 21:18:01

I am confused about the documentation of identifier-binding. It lists what happens in the cases the identifier has a local binding, a module binding or a top-level binding. But an identifier can also be unbound - in which case #f is returned.

samth

2021-7-22 21:20:37

“The result is #f if id-stx has a top-level binding and top-level-symbol? is #f or if id-stx is unbound.”

soegaard2

2021-7-22 21:23:26

Thanks - missed that somehow. I think, I skimmed the bullet points and stopped at “if _id-stx_ has a <https://docs.racket-lang.org/reference/syntax-model.html#%28tech._top._level._binding%29|top-level binding>”. Noticed there were three bullets points, and then read this paragraph with four cases.

A _top-level binding_ is a https://docs.racket-lang.org/reference/syntax-model.html#%28tech._binding%29\|binding from a definition at the top-level; a _module binding_ is a binding from a definition in a module; all other bindings are _local bindings_. Within a module, references to <https://docs.racket-lang.org/reference/syntax-model.html#%28tech._top._level.binding%29|top-level bindings> are disallowed. An identifier without a binding is _unbound.

soegaard2

2021-7-22 21:25:33

Is (equal? (identifier-binding #'foo #f) #f) the best way to check whether an identifier is unbound?

soegaard2

2021-7-22 21:27:40

Wait - does it say what happens if the identifier is unbound and top-level-symbol? is #t ?

samth

2021-7-22 21:28:02

That’s in the third bullet

soegaard2

2021-7-22 21:28:42

• The result is (https://docs.racket-lang.org/reference/pairs.html#%28def._%28%28quote._~23~25kernel%29._list%29%29\|list _top-sym_) if _id-stx_ has a <https://docs.racket-lang.org/reference/syntax-model.html#%28tech._top.level.binding%29|top-level binding> and _top-level-symbol? is true. The _top-sym can different from the name returned by https://docs.racket-lang.org/reference/stxops.html#%28def._%28%28quote._~23~25kernel%29._syntax-~3edatum%29%29\|syntax-datum> when the binding definition was generated by a macro invocation.

soegaard2

2021-7-22 21:29:25

It only mentions “if it has a top-level binding”.

samth

2021-7-22 21:30:02

Oh hm

samth

2021-7-22 21:30:10

Sounds like that needs to be clarified

soegaard2

2021-7-22 21:31:07

I can guess it simply returns #f.

samth

2021-7-22 21:31:21

Right but the docs should say that

soegaard2

2021-7-22 21:39:14

What I really was looking for was an “is-identifier-unbound?” predicate. Is there a better choice, than identifier-binding?

samth

2021-7-22 21:43:09

No, identifier-binding is the right choice for that

rokitna

2021-7-23 00:50:44

For finite sequences, lexicographic ordering of some kind seems like the least surprising option and should probably be the default. For infinite sequences, lexicographic ordering is disappointing-but-not-surprising.

I think the notion of “not caring” about the ordering is a little enigmatic and can be broken up into at least three separate non-default options:

• I intend my code not to depend on what the order is, but just in case my code is wrong, I would prefer the testing of this code to use an ordering that’s likely to surprise me. • I prefer for the traversal to avoid wasting computational resources, even if that makes the order more surprising. • I need the order to eventually visit everything, which the default order doesn’t necessarily do. None one of those axes is quite “I don’t care about the ordering,” because if someone really doesn’t care about the ordering at all, they’ll just use the default.

However, the most obvious naming conventions for the first two would be “hey fuzzer, who has two thumbs and doesn’t think they care about the ordering,” and “hey optimizer, by all means, don’t care about the ordering on my account,” so after dropping a lot of context and inflection to make these reasonably sized keywords, they’d sound just like “I don’t care about the ordering.” That makes them a little hard to tell apart, and maybe they should be a little hard to tell apart, because the caller is really declaring the same information and intending two different systems to see it.

I think the third one, the “eventually visit everything” option, is easier to keep distinct. The keyword could be #:thorough or #:search-strategy or #:fair or #:eventuality or something (adding a question mark if appropriate).

I also think #:ordering is quite fair as a way to bundle all these concerns together, especially if there are three or more specific orderings to choose between. There could be a “fast ordering” or a “surprising ordering” that’s just an option like any other.

notjack

2021-7-23 00:58:10

I think only the second of those three options matters

notjack

2021-7-23 01:01:32

The first option doesn’t make sense as a single isolated feature of a random function on sequences, and is better explored in a holistic manner as part of some property testing framework that possibly integrates with the contract library or something like that. It’s a complex problem domain with complex solutions, and I don’t think there’s much point to designing a special-case API for it that’s only for in-cartesian-product.

notjack

2021-7-23 01:05:11

The third option doesn’t make sense to me. If you have infinite sequences, no ordering visits everything, by definition. If what you mean is “For an arbitrary element of any input sequence, I need the order to eventually visit that element” then I don’t think there’s any reasonable use cases where 1) you need that and 2) the maximally lazy order doesn’t work for you.

notjack

2021-7-23 01:06:54

just implementing two orderings is complex enough already, trying to implement more - especially rarely used and as a result, likely buggier - doesn’t seem worthwhile to me

notjack

2021-7-23 01:08:52

honestly an identifier-bound? predicate would be worth having in the stdlib just so people don’t have to go crawling through the docs for identifier-binding to work out what the return value is for unbound identifiers

samth

2021-7-23 01:10:33

Since identifier-binding produces a true value for bound identifiers and a false value for unbound ones it suggests that we should just clarify the docs

rokitna

2021-7-23 01:11:23

“it eventually visits everything” and “for each thing, that thing is eventually visited” sound the same to me, so I think dismissing one as nonsense doesn’t matter to me as long as the other works for you XD

notjack

2021-7-23 01:15:32

heh, very fair :p

notjack

2021-7-23 01:16:10

there’s a lot of value in making a predicate look like a predicate

rokitna

2021-7-23 01:18:05

If I intended to visit every pair of naturals, I think I would find a “maximally lazy” ordering to be the least surprising way, but I think I’m willing to be a little surprised there if there’s an easier or more standard approach. It sounds like cons/e implements two standard approaches.

notjack

2021-7-23 01:20:06

cons/e implements the maximally lazy approach (the “elegant pairing function” version) and the traditional cantor pairing function approach. It doesn’t implement the lexicographic approach at all. I think the two standard approaches should be the lexicographic one and the maximally lazy one.

rokitna

2021-7-23 01:21:47

it sounds like you think I was saying it was a different option… oh wait, that’s because you’re treating the maximally lazy ordering as the implementation of the “avoid wasting computational resources” option, and you’re considering “visit all the things” to be a use case that can invoke that other option to achieve its ends

rokitna

2021-7-23 01:21:51

is that right?

notjack

2021-7-23 01:24:09

right, yes, in the three options you gave, I think options 2 and 3 would both be best addressed by the maximally lazy iteration order

notjack

2021-7-23 01:24:53

I think option 1 doesn’t need to be supported, and I think the default behavior (maybe call that option zero?) should be lexicographic ordering

rokitna

2021-7-23 01:25:25

I think I may have buried the point of laying out those three options there. I was looking for what to name the option that says “I want to use the maximally lazy ordering,” because I think it doesn’t have to do with not caring about orderings. I think it has to do with wanting all the things to be visited.

notjack

2021-7-23 01:34:18

ah, so #:ordering 'lexicographic and #:ordering 'lazy or something?

rokitna

2021-7-23 01:34:29

(Incidentally, I would expect the maximally lazy ordering to be rather inefficient compared to lexicographic ordering. The fast traversal would probably do lexicographic traversal as much as it can, with some rearranging of the sequences so that the ones backed by small vectors are being traversed in the inner loops for cache locality.)

notjack

2021-7-23 01:35:21

“fast” depends on how the sequences are implemented. if we’re talking abstractly, the only yardstick to measure by is how much of the input sequences are computed.

rokitna

2021-7-23 01:38:00

yeah, for finite sequences, there’s a full traversal to measure, but for infinite stuff it gets to be more like “how fast do you want to go?” “do you want to go faster early on even if it’s slower later?” and “how smooth should it be?” and maybe even some stuff involving memory footprint XD

rokitna

2021-7-23 01:59:14

> ah, so #:ordering 'lexicographic and #:ordering 'lazy or something? > Personally, considering everything discussed here, I’d like that. Maybe 'nested or 'greedy would be less intimidating than 'lexicographic.

Depth-first vs breadth-first is an intuition I have liked to apply to this distinction myself in the past, but I doubt that’s a standard way of referring to it… and the more technical a term sounds, the more standard it should probably be, or it’ll just make the language feel arbitrarily obscure. :)

I’m not familiar with terminology that’s actually standard for this, so I’ll stop short of having an opinion about that. I’m glad @sorawee has been able to speak about existing work here.

rokitna

2021-7-23 02:03:57

(I feel like Z-order curves are related work here too. Perhaps opportunistic use of that kind of ordering would provide good cache locality when multiple sequences involved are large vectors.)

notjack

2021-7-23 02:33:43

honestly maybe just #:lexicographic? boolean? would be best

notjack

2021-7-23 02:34:31

99% of the time you want it so it can default to true, so it being a wordy technical term doesn’t really matter much

notjack

2021-7-23 02:34:42

and I don’t think there’s much value in naming the non-lexicographic case

rokitna

2021-7-23 03:00:36

Hmm, I don’t mind this design. :) It’s kind of funny. If doesn’t say much about what it’s using instead, but it what it does do is rule out a few least-surprise options so that the principle of least surprise can now suggest the next-least-surprise behavior.

(Also, it gives people something distinctive to search for if they haven’t formed this level of familiarity with the topic yet.)

sorawee

2021-7-23 06:50:42

Is it possible to implement a variant of box-cas! that is without spurious failure?