
for
did the job very nicely, thanks again

Two somewhat related questions: • How can I get the inode number of a file on Posix? • How can I create a hardlink? https://docs.racket-lang.org/reference/Filesystem.html#%28def._%28%28quote._~23~25kernel%29._make-file-or-directory-link%29%29\|make-file-or-directory-link seems to create only softlinks.

I don’t know a portable way to do it (read: using the racket functions), but if you’re limiting yourself to *nix, I imagine you could just use system
to do both of those?

Calling a command line tool to make the hardlinks may actually be feasible because (probably) I need to create only a few hardlinks.
For the inodes though, I need to determine the inodes for all files in a directory tree, and I suspect this will very slow. In theory, I could speed this up by piping a list of paths into xargs
and parse its output, but that would be quite cumbersome, especially since I want to call a function for each file via https://docs.racket-lang.org/reference/Filesystem.html#%28def._%28%28lib._racket%2Ffile..rkt%29._fold-files%29%29\|fold-files .

So I’m now torn between experimenting with the Racket/C API or using a different programming language. :smile:

fold-files
looked so nice. sigh

I was a bit surprised to not find the equivalent of a stat
call.

Do you know Rash? Not sure if it could be useful to you: https://docs.racket-lang.org/rash/

is file-or-directory-identity
what you want for inodes?

I think adding an option to make-file-or-directory-link
for hard links would be a good idea

Can someone point me to docs/tutorial for building tests for a package (something the pkgd will run automatically as well when published)?

I don’t know of any tutorials. But here are something to read.
- https://pkg-build.racket-lang.org/about.html
- https://docs.racket-lang.org/raco/test.html
- https://github.com/jeapostrophe/automata/tree/master/automata-test
- https://pkgs.racket-lang.org/package/automata-test
Basically the package server runs raco test --drdr
after a package has been succesfully installed.
And, the description of --drdr
is:
> —drdr — Configures defaults to imitate the DrDr continuous testing system: ignore non-modules, run tests in separate processes (unless —thread or —direct is specified) use as many jobs as available processors (unless —jobs is specified), set the default timeout to 90 seconds (unless —timeout is specified), create a fresh PLTUSERHOME and TMPDIR for each test, count stderr output as a test failure, quiet program output, provide empty program input, and print a table of results. 2. describes how to configure tests.
Is an example of how jeapostrophe has tests for the automata library. (This was picked more or less at random)
Click the icon “passing tests” to see the test results.

The rich FFI is one of the joys of Racket. You should take the plunge!

thanks

There is a scary amount of options for raco test
…

@massung Unless you’re doing something exotic, I think the answer is you don’t have to do anything. By default it’s going to run all your .rkt
files. I think that’s the super short story for most packages?

Most of the options have to do with not running all your .rkt
files, or all of them.

If a .rkt
file has test
submodule, then it will only run that (not the other modules). That can be handy if say a file also has a main
submodule that does things like connect to a server or whatever, that you don’t want to or can’t have happen in a test.

Sure, but that just validates that they don’t have any syntax errors. But - for example - what if I want to ensure that CSV parsing code is working and I have a simple csv file in a test/ path or something?

> is file-or-directory-identity
what you want for inodes? The description of file-or-directory-identity
is a rather vague. I thought the description would mean that to files with the same “identity” would have the same “real path” (see https://man7.org/linux/man-pages/man3/realpath.3.html ), which would give different “identities” for hardlinks.
I now tried file-or-directory-identity
. It gives me 150018644241532918675078402 for both files that are hardlinked to each other. The inode listed by ls -i
is 8132527. It makes sense that the identity would not be just the inode because the same inode number may occur in different mounted file systems. I can imagine that the bits of the inode are somehow included in the result of file-or-directory-identity
.
So it seems I can use file-or-directory-identiy
, but from the vague description I would never know for sure if it will always give the same id for hardlinked files (see realpath for another definition of “identity”; they don’t use the word “identity”, but you get the point :slightly_smiling_face: ). The currently same result for the hardlinked file might be an implementation detail.
If file-or-directory-identity
is actually supposed to have “file system plus inode” semantics this should be made clearer in the documentation. But I don’t know what the supposed semantic is.

There’s also something like compile-test-omit-files
setting in info.rkt
, to exclude whole files or subdires.

> If a .rkt
file has test
submodule Do you mean, if I did something like: (module+ test
(run-test-function-here))
?

The “old way” was you would have some test/
sub dir and all the things in it would be just more .rkt
files that get run. Or even say a single tests.rkt
file. And people still do that.

The “new way” is you can use test
submodules.

So I just mean that by default the build server will run all your files. If some of them have tests, then it runs your tests.

The options are mostly about when you don’t want to run some files or parts of them.

Do you know of a simple package that does the test submodules I can just take a look at? If you don’t know one off-hand then I’ll look.

The problem is that windows doesn’t have inodes, so the description is intended to be cross-platform. Maybe it should say that it includes the filesystem and inode information on some platforms, and more information about what it does on windows as well.

I usually just use https://docs.racket-lang.org/rackunit/\|rackunit . Here’s an https://git.sr.ht/~sschwarzer/sudoku-solver/tree/main/item/games/sudoku-solver/solver.rkt#L365\|example . As greg said, the tests are executed when you run the file with raco test
.

Here’s one of mine: https://github.com/greghendershott/sha/blob/master/sha/main.rkt

> The problem is that windows doesn’t have inodes I think that’s only true for FAT32 (which, admittedly, is likely the majority of windows machines out there). NTFS has i-nodes, IIRC.

Thank you both

I could have put those tests in a separate test.rkt
file that did (require "main.rkt")
. But for lite things, (module+ test ___)
is handy, and keeps the tests near the code that they test.

Either way, the build server will run them.

Well, as I read the docs is has numbers that are similar but are not called inodes.

Could be interesting to check what other languages/platforms use in the stat result on Windows.
Here’s a description from the Python documentation https://docs.python.org/3/library/os.html#os.stat_result : > st_ino
> Platform dependent, but if non-zero, uniquely identifies the file for a given value of st_dev
. Typically: > the inode number on Unix, > the <https://msdn.microsoft.com/en-us/library/aa363788|file index> on Windows and “file index” is a link to https://msdn.microsoft.com/en-us/library/aa363788 .

Another thing about module+
is that will splice together multiple instances into the same single submodule. So you can alternative (define thing __)
and (module+ test test-thing)
forms in the file, and the code and tests are adjacent.

Whether you find that handy or distracting, mixing like that, is a matter of opinion of course. The good news is you or your team can pick whatever you want.

Another advantage of having a test
module inside the file/module to test is that you can easily test code that’s private to the module (i. e. not provide
d).

Yes, that’s better than using something evil like https://docs.racket-lang.org/rackunit/api.html#%28form._%28%28lib._rackunit%2Fmain..rkt%29._require%2Fexpose%29%29 :smile:

> The rich FFI is one of the joys of Racket. You should take the plunge! Here comes the next Racket project. :wink: But thanks for the encouragement. :slightly_smiling_face:

I had seen require/expose
at some point, but never tried it due to the warnings. :smile:

yeah sadly there’s no direct binding to fstat
— it would be interesting to have a more low-level posix api in racket too

https://nim-lang.org\|Nim uses a https://nim-lang.org/docs/os.html#FileInfo\|FileInfo type which has a tuple of device id and file id. I’m quite sure on Posix the file id is the inode number.

Random comment: In Herbie we had a major inefficiency where we enumerate over the cartesian product of several lists. But sometimes one of the last lists was null, which meant loads and loads of time spent to produce the empty list.

It would improve cartesian-product
to handle that case specially.

Is there any form for disabling GC while a body executes?

At the Racket level, there isn’t. On Racket CS, you could use vm-eval
to adjust the collect-request-handler
parameter, although I don’t know if there are potential bad side effects of doing that.

The increment gc is killing my perf, whereas if i just let it go and do a single, big collect at the end it would save me a good 15s (according to time
).

I’m not sure exactly how you’re measuring that but do you mean you’re using incremental GC? Or just that the minor collections are taking a lot of time?

(time ...) ;=> cpu time: 65295 real time: ... gc time: 15892
Basically, ~20–25% of the runtime is just spent doing incremental collects over a long runtime. I’d happily let memory spike, drop the runtime by 15s, then do one big collect that only takes ~1s at the end.

Why do you think the collection at the end would take only 1 sec?

You might also be interested in measuring how much you’re allocating, my gc-stats
package is useful for that.

Just past experiences when I see the memory usage in Dr Racket up to 3 GB or so, I call (collect-garbage 'major)
and it takes very little time to reduce usage down to 500 MB or so

My guess is that your application is allocating a lot over those 65 seconds, and just doing no collection would use a lot of memory. But you can try it out.

A fairly typical allocation rate in Racket code would be 500 MB/sec, so if you just ran for the 50 seconds of mutator time you’d hit 25 GB resident before the end.

ok, good to know

You can try it out easily with PLTDISABLEGC=1

Ah, ok… so testing (with smaller data set)
; gc enabled
(time ...)
;=> cpu time: 12765 real time: 12716 gc time: 2156
;=> peak memory: ~1.1 GB
; gc disabled
(time ...)
;=> cpu time: 9453 real time: 9596 gc time: 0
;=> peak memory: ~5.0 GB

5x is a bit more than I was expecting, certainly makes the “no gc” optional untenable.

Is this an application where you’re trying to optimize latency or throughput?

no. just reading data off disk and parsing it during load

once in memory im fine

that sounds like throughput is the goal (ie finish the whole thing as fast as possible)

so, i guess, technically yes to “throughput”

Was hoping to not have to drop down into C

I think my next advice would be “allocate less”

rather than switch to C

You could potentially also use vm-eval
with collect-trip-bytes
to increase the size of the young generation

My one addition is that rackunit
makes it easy to define test-suites, aggregating cases, etc., but doing that in (module+ test …)
means you have to explicitly call (run-tests …)
from rackunit/text-ui

I am confused about the documentation of identifier-binding
. It lists what happens in the cases the identifier has a local binding, a module binding or a top-level binding. But an identifier can also be unbound - in which case #f
is returned.

“The result is #f if id-stx has a top-level binding and top-level-symbol? is #f or if id-stx is unbound.”

Thanks - missed that somehow. I think, I skimmed the bullet points and stopped at “if _id-stx_ has a <https://docs.racket-lang.org/reference/syntax-model.html#%28tech._top._level._binding%29|top-level binding>”. Noticed there were three bullets points, and then read this paragraph with four cases.
A _top-level binding_ is a https://docs.racket-lang.org/reference/syntax-model.html#%28tech._binding%29\|binding from a definition at the top-level; a _module binding_ is a binding from a definition in a module; all other bindings are _local bindings_. Within a module, references to <https://docs.racket-lang.org/reference/syntax-model.html#%28tech._top._level.binding%29|top-level bindings> are disallowed. An identifier without a binding is _unbound.

Is (equal? (identifier-binding #'foo #f) #f)
the best way to check whether an identifier is unbound?

Wait - does it say what happens if the identifier is unbound and top-level-symbol? is #t ?

That’s in the third bullet

• The result is (https://docs.racket-lang.org/reference/pairs.html#%28def._%28%28quote._~23~25kernel%29._list%29%29\|list _top-sym_) if _id-stx_ has a <https://docs.racket-lang.org/reference/syntax-model.html#%28tech._top.level.binding%29|top-level binding> and _top-level-symbol? is true. The _top-sym can different from the name returned by https://docs.racket-lang.org/reference/stxops.html#%28def._%28%28quote._~23~25kernel%29._syntax-~3edatum%29%29\|syntax-datum> when the binding definition was generated by a macro invocation.

It only mentions “if it has a top-level binding”.

Oh hm

Sounds like that needs to be clarified

I can guess it simply returns #f.

Right but the docs should say that

What I really was looking for was an “is-identifier-unbound?” predicate. Is there a better choice, than identifier-binding
?

No, identifier-binding is the right choice for that

For finite sequences, lexicographic ordering of some kind seems like the least surprising option and should probably be the default. For infinite sequences, lexicographic ordering is disappointing-but-not-surprising.
I think the notion of “not caring” about the ordering is a little enigmatic and can be broken up into at least three separate non-default options:
• I intend my code not to depend on what the order is, but just in case my code is wrong, I would prefer the testing of this code to use an ordering that’s likely to surprise me. • I prefer for the traversal to avoid wasting computational resources, even if that makes the order more surprising. • I need the order to eventually visit everything, which the default order doesn’t necessarily do. None one of those axes is quite “I don’t care about the ordering,” because if someone really doesn’t care about the ordering at all, they’ll just use the default.
However, the most obvious naming conventions for the first two would be “hey fuzzer, who has two thumbs and doesn’t think they care about the ordering,” and “hey optimizer, by all means, don’t care about the ordering on my account,” so after dropping a lot of context and inflection to make these reasonably sized keywords, they’d sound just like “I don’t care about the ordering.” That makes them a little hard to tell apart, and maybe they should be a little hard to tell apart, because the caller is really declaring the same information and intending two different systems to see it.
I think the third one, the “eventually visit everything” option, is easier to keep distinct. The keyword could be #:thorough
or #:search-strategy
or #:fair
or #:eventuality
or something (adding a question mark if appropriate).
I also think #:ordering
is quite fair as a way to bundle all these concerns together, especially if there are three or more specific orderings to choose between. There could be a “fast ordering” or a “surprising ordering” that’s just an option like any other.

I think only the second of those three options matters

The first option doesn’t make sense as a single isolated feature of a random function on sequences, and is better explored in a holistic manner as part of some property testing framework that possibly integrates with the contract library or something like that. It’s a complex problem domain with complex solutions, and I don’t think there’s much point to designing a special-case API for it that’s only for in-cartesian-product
.

The third option doesn’t make sense to me. If you have infinite sequences, no ordering visits everything, by definition. If what you mean is “For an arbitrary element of any input sequence, I need the order to eventually visit that element” then I don’t think there’s any reasonable use cases where 1) you need that and 2) the maximally lazy order doesn’t work for you.

just implementing two orderings is complex enough already, trying to implement more - especially rarely used and as a result, likely buggier - doesn’t seem worthwhile to me

honestly an identifier-bound?
predicate would be worth having in the stdlib just so people don’t have to go crawling through the docs for identifier-binding
to work out what the return value is for unbound identifiers

Since identifier-binding
produces a true value for bound identifiers and a false value for unbound ones it suggests that we should just clarify the docs

“it eventually visits everything” and “for each thing, that thing is eventually visited” sound the same to me, so I think dismissing one as nonsense doesn’t matter to me as long as the other works for you XD

heh, very fair :p

there’s a lot of value in making a predicate look like a predicate

If I intended to visit every pair of naturals, I think I would find a “maximally lazy” ordering to be the least surprising way, but I think I’m willing to be a little surprised there if there’s an easier or more standard approach. It sounds like cons/e
implements two standard approaches.

cons/e
implements the maximally lazy approach (the “elegant pairing function” version) and the traditional cantor pairing function approach. It doesn’t implement the lexicographic approach at all. I think the two standard approaches should be the lexicographic one and the maximally lazy one.

it sounds like you think I was saying it was a different option… oh wait, that’s because you’re treating the maximally lazy ordering as the implementation of the “avoid wasting computational resources” option, and you’re considering “visit all the things” to be a use case that can invoke that other option to achieve its ends

is that right?

right, yes, in the three options you gave, I think options 2 and 3 would both be best addressed by the maximally lazy iteration order

I think option 1 doesn’t need to be supported, and I think the default behavior (maybe call that option zero?) should be lexicographic ordering

I think I may have buried the point of laying out those three options there. I was looking for what to name the option that says “I want to use the maximally lazy ordering,” because I think it doesn’t have to do with not caring about orderings. I think it has to do with wanting all the things to be visited.

ah, so #:ordering 'lexicographic
and #:ordering 'lazy
or something?

(Incidentally, I would expect the maximally lazy ordering to be rather inefficient compared to lexicographic ordering. The fast traversal would probably do lexicographic traversal as much as it can, with some rearranging of the sequences so that the ones backed by small vectors are being traversed in the inner loops for cache locality.)

“fast” depends on how the sequences are implemented. if we’re talking abstractly, the only yardstick to measure by is how much of the input sequences are computed.

yeah, for finite sequences, there’s a full traversal to measure, but for infinite stuff it gets to be more like “how fast do you want to go?” “do you want to go faster early on even if it’s slower later?” and “how smooth should it be?” and maybe even some stuff involving memory footprint XD

> ah, so #:ordering 'lexicographic
and #:ordering 'lazy
or something? > Personally, considering everything discussed here, I’d like that. Maybe 'nested
or 'greedy
would be less intimidating than 'lexicographic
.
Depth-first vs breadth-first is an intuition I have liked to apply to this distinction myself in the past, but I doubt that’s a standard way of referring to it… and the more technical a term sounds, the more standard it should probably be, or it’ll just make the language feel arbitrarily obscure. :)
I’m not familiar with terminology that’s actually standard for this, so I’ll stop short of having an opinion about that. I’m glad @sorawee has been able to speak about existing work here.

(I feel like Z-order curves are related work here too. Perhaps opportunistic use of that kind of ordering would provide good cache locality when multiple sequences involved are large vectors.)

honestly maybe just #:lexicographic? boolean?
would be best

99% of the time you want it so it can default to true, so it being a wordy technical term doesn’t really matter much

and I don’t think there’s much value in naming the non-lexicographic case

Hmm, I don’t mind this design. :) It’s kind of funny. If doesn’t say much about what it’s using instead, but it what it does do is rule out a few least-surprise options so that the principle of least surprise can now suggest the next-least-surprise behavior.
(Also, it gives people something distinctive to search for if they haven’t formed this level of familiarity with the topic yet.)

Is it possible to implement a variant of box-cas!
that is without spurious failure?