spdegabrielle
2022-1-3 11:47:10

What does NumPy have that is missing from the Flomat and math libraries?


spdegabrielle
2022-1-3 11:47:55

By Jax do you mean MathJax?



spdegabrielle
2022-1-3 12:26:08

I’m not sure what ‘automatically differentiate native Python and NumPy code’ means?


spdegabrielle
2022-1-3 12:44:33

> JAX uses XLA to compile and run your NumPy code on accelerators, like GPUs and TPUs. https://www.tensorflow.org/xla#xla_frontends\|https://www.tensorflow.org/xla#xla_frontends

and > standalone tfcompile tool, which converts TensorFlow graph into executable code (for x86–64 CPU only).


spdegabrielle
2022-1-3 12:49:54

I’m not familiar with automatic differentiation, but I know what it is. I’m really unclear what automatic differentiation of code is > It can differentiate through a large subset of Python’s features, including loops, ifs, recursion, and closures,




seanbunderwood
2022-1-3 13:08:11

Flomat only does floating point matrices. Numpy supports a variety of data types as the elements, including arbitrary objects.

That’s allowed it to be not just a linear algebra library, but the de facto standard for working with and sharing multidimensional hunks of data. Which then allowed the Python data ecosystem to blossom. Packages can easily interoperate with no or minimal need for glue code.


spdegabrielle
2022-1-3 13:11:16

Oh it is clear for the first 5 min of the video that I do not know what AD is. I was thinking of symbolic diff


seanbunderwood
2022-1-3 13:11:20

For example, I mostly do NLP, so my numpy arrays tend to be full of strings and arrays or sets of strings.


spdegabrielle
2022-1-3 14:03:28

it is a #lang !


spdegabrielle
2022-1-3 14:04:05

thank you


seanbunderwood
2022-1-3 14:08:21

It’s a pretty new thing. If I recall correctly, the first big paper on it was published less than 10 years ago.


spdegabrielle
2022-1-3 14:09:56

Oh


spdegabrielle
2022-1-3 14:10:20

seanbunderwood
2022-1-3 14:26:47

But, for example, Pandas is built on top of numpy, even for the non-numeric column types.


seanbunderwood
2022-1-3 14:27:21

Versus, I believe Racket’s data-frame is doing its own thing with the data.



seanbunderwood
2022-1-3 14:31:18

Which, I haven’t tried doing data science for real in Racket. Scala’s data science ecosystem, which is what I mostly use at work right now, has a similar thing going on to what I think I see in the equivalent Racket ecosystem, though: the linear algebra library only does linear algebra, and so the data and scientific computing packages end up creating their own private types to add support for all the kinds of data they need to support beyond floating point numbers. And then, with so little overlap in their underlying data models, what on is working on doesn’t need to get too complex before you’re suddenly spending a surprising amount of time writing glue code.


samth
2022-1-3 14:45:10

automatic differentiation is quite old; here’s a paper about it from 1964: https://dl.acm.org/doi/10.1145/355586.364791


spdegabrielle
2022-1-3 14:52:35

Fwiw I’ve asked the author to add a racket compat license and if they would be ok with it going on the package catalog. Via a PR https://github.com/ots22/rackpropagator\|https://github.com/ots22/rackpropagator


seanbunderwood
2022-1-3 14:54:08

Differentiable programming, though?


spdegabrielle
2022-1-3 14:54:47

Fascinating either way


samth
2022-1-3 14:55:31

To the degree that “differentiable programming” means “automatic differentiation of programs in a conventional programming language” then it is also very old


laurent.orseau
2022-1-3 14:56:51

Rascas has a few algorithmic operations that it can differentiate through, in particular _if.


spdegabrielle
2022-1-3 14:57:31

I thought racas was symbolic diff?


laurent.orseau
2022-1-3 14:57:43

That’s symbolic AD, yes, there are pros and cons


laurent.orseau
2022-1-3 14:58:37

Rascas also has a _let* form and can make a lot of simplifications with it


laurent.orseau
2022-1-3 14:59:18

There’s a smart-simplify operator that tries various combinations of contract/expand to compress the expression


laurent.orseau
2022-1-3 15:00:53

Autodiff based on (f(x)-f(x+h))/h can become quickly inaccurate with compound functions


laurent.orseau
2022-1-3 15:12:56

_let* is what allows sequential style AD, but it constructs a symbolic expr nonetheless, which allows for further manipulation



badkins
2022-1-3 15:21:37

So, it looks like NumPy is pretty fundamental in the Python data science ecosystem, and the ndarray seems pretty fundamental w/in NumPy. @seanbunderwood does creating the Racket equivalent of ndarray seem like a good starting point to up Racket’s game in this space?


badkins
2022-1-3 15:24:55

I suppose a 2-dimensional ndarray of floats would just use flomat internally, since that already has the BLAS bindings.


seanbunderwood
2022-1-3 16:06:04

It’s just my personal take, but yeah, my impression is that that’s what really distinguishes the Python ecosystem from most others: numpy isn’t just a linear algebra package; it’s a lingua franca for all sorts of data manipulation and interchange that everyone can share.

That said, my best hunch is that the future in this space is Apache Arrow. I would guess that integrating with it is the best way to become a data contender nowadays. It’s language-agnostic, and is already seeing some traction with other communities. I would hope that means that language communities that congregate on it can share more, duplicate less, and move faster.


badkins
2022-1-3 16:16:28

<http://cidrdb.org/cidr2021/papers/cidr2021_paper08.pdf|A Microsoft paper> referenced in <https://news.ycombinator.com/item?id=26018827|a Hacker News post> that seems relevant.


badkins
2022-1-3 16:23:05

In particular, “Magpie uses Arrow [2] and ArrowFlight [40] as a common format for data access across each of its backend engines”


badkins
2022-1-3 16:25:29

One of the comments in that HN thread mentions Arrow not supporting n-dimensional arrays. That seems odd.


seanbunderwood
2022-1-3 16:31:38

Aha, that’s annoying.


seanbunderwood
2022-1-3 16:33:23

massung
2022-1-3 16:35:00

I’m not an expert in NNs (by a long shot), but simple matrix x vector multiplication doesn’t require 2D matrices as opposed to just row (or column) vectors, which would be more efficient anyway.


massung
2022-1-3 16:35:54

Plus for the big stuff they likely load everything into GPU registers anyway.


badkins
2022-1-3 16:41:10

I found some value in the “Getting Things Done” book. In particular, trying to identify the specific “next task” in a complex set of tasks. I think this NumPy.ndarray / Apache Arrow “thing” is a likely “next task”.


badkins
2022-1-3 16:44:46

I’m not entirely clear of the benefits of Arrow if this is implemented in Racket/Scheme code vs. C.


badkins
2022-1-3 16:45:22

i.e. I’m not sure I’d want to go through the effort of using the Arrow format, if the efficiency benefits are nullified in practice


spdegabrielle
2022-1-3 16:49:06

Jax is interesting; > Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more seems like a task that would be simpler with parenthetical syntax (The jit to gpu/tpu bit seems to be done by a call out to XLA)


badkins
2022-1-3 16:51:11

Ah, now I remember @seanbunderwood :) I know we discussed Arrow before in this context. I’m reading the spec now, and I’m wondering if the physical layout is compatible with Racket. If not, then it seems this would need to be implemented in C/C++ with bindings in Racket possibly.


badkins
2022-1-3 16:56:43

“The representation of a flomat therefore matches the expectation of the routines in BLAS and LAPACK closely.” from flomat - this seems to be the important part. I would think Arrow has this same format to allow using BLAS/LAPACK w/ zero copy.


soegaard2
2022-1-3 17:17:50

The representation is as follows: ; m = rows, n = cols, a = mxn array of doubles ; lda = leading dimension of a (see below) (struct flomat (m n a lda) ...)

An “array of doubles” is simply a pointer to a piece of memory: (define _flomat (_cpointer 'flomat))

It is tedious, but trivial to extend flomat to support complex numbers.


badkins
2022-1-3 17:23:00

@soegaard2 any interest in converting flomat to using the <https://arrow.apache.org/docs/format/Columnar.html|Arrow format> ? :) I think I’d be willing to put some serious time into it. I just need to research some more to see if it makes sense.


badkins
2022-1-3 17:29:13

I’ve been reading for a while, and I still haven’t seen any info on Arrow <-> BLAS/LAPACK integration. :(


massung
2022-1-3 17:30:36

It’s probably less about forcing that integration as opposed to making it trivial for anyone who wants to pass Arrow data directly to BLAS


soegaard2
2022-1-3 17:31:09

Skimming the documentation on Arrow, it looks like a big project. If someone (cough) writes the lower levels then replacing the upper level of flomat wouldn’t be too hard. But the idea of flomat was to use as many BLAS/LAPACK routines as possible, since they can be trusted for numerical accuracy.


badkins
2022-1-3 17:33:04

I have a reasonable idea of what it would take to get started on creating a subset of NumPy.ndarray where it would simply use flomat in the specific case where the elements are floating point numbers, and something else, otherwise. But I have much less of a handle on how Arrow impacts this, and I’m less sure of the benefits of Arrow.


soegaard2
2022-1-3 17:33:29

During december I have worked on implementing srfi 179 in (untyped) Racket. https://srfi.schemers.org/srfi-179/srfi-179.html

Maybe we can use that as a starting point for alternative implementation of general arrays in Racket?


badkins
2022-1-3 17:33:31

Having said that, it seems naive to ignore Arrow, so I’ll need to keep reading …


soegaard2
2022-1-3 17:36:00

One of the ideas of the srfi is some arrays are “specialized”. I.e. their data are stored in a contiguous memory area.


badkins
2022-1-3 17:36:39

I only briefly looked at the srfi, but it really does seem like the industry is getting behind Arrow.


badkins
2022-1-3 17:37:06

flomat is probably 90% or more compatible already.


badkins
2022-1-3 17:38:56

I think I’ll continue on my current course, which is to blast through a ton of info on Python & the ecosystem, to become a competent user of the Python ecosystem, and then try and figure out how I might incrementally move Racket forward in this space.


soegaard2
2022-1-3 17:39:39

It is “just” a question of how to package the array of numbers in a larger data structure?


badkins
2022-1-3 17:40:15

I think that’s the bulk of it, but the details may require a fair amount of work.


badkins
2022-1-3 17:41:29

I think my initial goal would be to port an important subset of NumPy.ndarray to racket.


soegaard2
2022-1-3 17:42:13

If you decide to tackle Arrow, feel free to use whatever you can from flomat. I predict I won’t have time do any implementing though.


soegaard2
2022-1-3 17:43:16

Does Numpy support Arrow?



badkins
2022-1-3 17:45:11

My impression is NumPy.ndarray is its own thing currently, but maybe NumPy is moving toward using Arrow - not really sure.


seanbunderwood
2022-1-3 17:46:07

The advantage is being able to interop with other ecosystem. Yes, you would be wrapping the C/C++ libraries for the most part, rather than writing it in pure Racket. What you theoretically get in return, though, is that you can keep getting more things at low cost by wrapping more native libraries from the same ecosystem. Ballista for distributed computing, for example, or libgdf for CUDA-accelerated computation.


badkins
2022-1-3 17:48:45

<https://wesmckinney.com/blog/apache-arrow-pandas-internals/|This blog post> by the creator of Pandas is probably worth reading.


badkins
2022-1-3 17:50:07

Bjarne Stroustrup talks about “two types of languages: the ones people complain about and the ones nobody uses” :) I hear of people complaining about Pandas a lot, but it seems everyone uses it!


seanbunderwood
2022-1-3 17:50:08

Long story short, ndarray and Arrow’s types both have the same underlying data layout. So to create an ndarray of a pyarrow array, what you do is create a new ndarray that shares pointers with the pyarrow array.


badkins
2022-1-3 17:51:23

And that layout is simply columnar, right? So, in theory, the same would apply to flomat.


badkins
2022-1-3 17:52:03

probably 2 functions then: Arrow -> flomat, and flomat -> Arrow


badkins
2022-1-3 17:52:44

both would be essentially zero copy i.e. the main elements wouldn’t be relocated


massung
2022-1-3 17:53:07

> people complaining about Pandas a lot, but it seems everyone uses it I have to constantly remind people at work showing demos or project progress to others that “no comments” is a bad sign. Everyone commenting to you all the things they wish your demo did or features you need to add is what you want. They aren’t knocking the hard work you’ve done. They are able to quickly see how what you’ve done benefits them, their work, and their brains are flooding them with all the things they can now do, wish they could do easier, etc.


badkins
2022-1-3 17:55:37

I’m hoping there are other benefits besides the interop though, because I’m not sure Racket would be that great for interop :( After I made the comment above, I looked through the source for flomat, and it looks like flomat is already using a compatible layout, for the most part. So, I think much of the code could be Racket code. I wouldn’t be too interesting in creating a large C++ library for use in Racket. I paid my dues with C++, and would rather not get back into it :)


badkins
2022-1-3 17:57:02

In my humble opinion, the main benefit to this theoretical work is to have a fantastic ndarray-like thing that other pure Racket libs could make use of. I could be wrong though, I should really look into what it would take to get some Python interop - I think I might be ok with that compromise if it offered enough functionality.


badkins
2022-1-3 17:57:57

But, the only reason to avoid Arrow is probably if it would involve a ton of more work, and at this point, I don’t think that’s the case. If something is going to be built from scratch, some layout needs to be used anyway, so why not the Arrow layout.


massung
2022-1-3 17:58:19

IME, pandas can be broken down into just a few pieces:

• Reading/writing data FAST (for all the various formats). This is actually the most difficult part to do well. • Data frame processing. This is the easiest part if you throw out multi-threading/nodes • Math (eg. scipy). Implementations of common equations and algorithms that are trusted and work on all the various data frame classes: series, groups, etc. This is also fairly simple, but getting it trusted for publishing not so much.


soegaard2
2022-1-3 17:58:35

I am beginning to see where Arrow fits in. It would definitely make it easier to work with libraries written in other languages.


badkins
2022-1-3 18:00:21

Are there standard C (or other) libs that SciPy is using analogous to BLAS/LAPACK? Creating bindings is one thing, but I have no interest, or sufficient skill, to implement a ton of stats functions, for example.


soegaard2
2022-1-3 18:12:08

I think it varies. The linear algebra function seems to be wrappers of BLAS/LAPACK mostly - but the the stat folder seems to be in Python.


badkins
2022-1-3 18:38:04

Wow, somehow I missed the fact that Wes McKinney, the creator of Pandas, is also a co-creator of Apache Arrow! <https://learning.acm.org/techtalks/apache|Apache Arrow and the Future of Data Frames with Wes McKinney>


badkins
2022-1-3 19:02:34

Impressive. At 24:10 in that video Wes compares Pandas’ read_csv(...) to Arrow’s read_csv(...) when reading a 2.5 GB file. Pandas, single threaded, took about 30 seconds. Arrow’s multi-threaded took 1.5 seconds.


badkins
2022-1-3 19:03:21

That’s promising for implementing arrow in Racket - if for no other reason than to be able to easily transfer data between Python and Racket via files.


massung
2022-1-3 19:10:38

I’m usually very wary of benchmark comparisons like that because it’s usually apples/oranges in the comparison. I’m sure Arrow does several smart things that Python can’t do, but here’s likely a list of the “optimization” done… cough deferred cough:

• Slurped the entire file into memory • Made no attempt to “parse” data types as opposed to just slicing the file into byte offset/length references • Likely made no effort to handle N/A values, double quotes, escaped characters, etc. • Returned as soon as the frame was created and “ready” for use - meaning data isn’t materialized and will be materialized on-demand. The same work needs to be done, but there’s hope that the work can be handled across multiple threads, in the background while readying other things (e.g. reading other CSV files), and that - when needed - the entire file won’t need to be materialized, but just a subset of it. Those are smart, good things to do. But it also means - again - apples and oranges comparison. It’d be better to compare a fully materialized table: what was the time footprint? I’m sure it’s still faster, but likely 5–10x longer than the 1.5s claimed.


massung
2022-1-3 19:17:58

Speaking of… if I could have one Racket wishlist item magically fulfilled, it’d be slicing. I so want racket to support true [byte]vector/string slices.


samdphillips
2022-1-3 19:18:33

Support in core or just support?


seanbunderwood
2022-1-3 19:18:39

I haven’t watched the video, but I’m also wondering if he used the pure Python implementation of read_csv, or the C implementation. 30s sounds to me like it might be the former.


massung
2022-1-3 19:18:39

In core


samdphillips
2022-1-3 19:18:45

Do you want it to work like a vector/bytes?


massung
2022-1-3 19:20:56

IMO, all sequences should be slices at the interface level. If I do #(1 2 3 4) there may be an underlying vector created, but what I - the programmer - get back is a slice referencing it. All vector-, sequence-, string-, etc. functions all act on the high-level slices.


samdphillips
2022-1-3 19:21:40

Why couldn’t it (or rather why wouldn’t you want it) be in a library do you think?


massung
2022-1-3 19:22:17

It’s such a fundamental pattern of programming. Hash tables could be in a library, too, but we don’t. :wink:


massung
2022-1-3 19:23:00

But it could be, sure.


massung
2022-1-3 19:23:17

Maybe I’ll make one


badkins
2022-1-3 19:23:30

Forgive my ignorance, but is a slice simply a 3-tuple of (vector, beg, end) ?


massung
2022-1-3 19:24:03

yeah.


badkins
2022-1-3 19:25:16

Sorry for the red herring of csv reading above, ignore that, but the video is a nice intro.


massung
2022-1-3 19:25:40

This is #random - no need to be sorry about CSV reading. :slightly_smiling_face:


samdphillips
2022-1-3 19:41:05

Yeah if a slice acts like the underlying then I could see how having it in core would be good. You could reuse all of the existing functions.


samdphillips
2022-1-3 19:41:30

Although slices of mutable containers can get dicey


seanbunderwood
2022-1-3 19:44:38

I think the only language where I can see myself habitually passing around slices of mutable containers is Rust.

But I suppose they can still be useful even when confined to a single function body.


badkins
2022-1-3 20:00:39

From the <https://arrow.apache.org/overview/|Arrow Overview> “Arrow libraries for C (Glib), MATLAB, Python, R, and Ruby are built on top of the C++ library.” So, I’m thinking the way to go may be to just provide a wrapper to the Arrow C++ library. There has been, and will continue to be, a lot of effort put into the C++ library, so it seems we could use it in the same way flomat uses BLAS/LAPACK. Does that sound right?


samdphillips
2022-1-3 20:02:34

I would use the C library it would probably be easier to work with from FFI


badkins
2022-1-3 20:03:33

I wondered about that, but the C library is built on the C++ library.


samdphillips
2022-1-3 20:05:00

The C library also uses GIR so you might be able to get bindings for “free” (https://pkgs.racket-lang.org/package/gir)


badkins
2022-1-3 20:06:35

Wow


samdphillips
2022-1-3 20:11:40

I don’t think Racket can call into C++ without a C shim.


badkins
2022-1-3 20:55:06

@samdphillips do you know about gir ? I installed Apache Arrow via brew install apache-arrow-glib, but I have no idea what the object is called. The gir example for using Gtk has: (define gtk (gi-ffi "Gtk")) How does one determine the correct name?


samdphillips
2022-1-3 20:56:14

Hmm there should be an introspection “document” somewhere. Let me take a shot on my mac


badkins
2022-1-3 21:01:16

It seems like: (define arrow (gi-ffi "Arrow")) should do it


samdphillips
2022-1-3 21:01:21

While I wait for brew … here are some examples using Lua https://github.com/apache/arrow/tree/master/c_glib/example

It may be under the name “Arrow”


samdphillips
2022-1-3 21:01:29

:smile:


badkins
2022-1-3 21:01:31

Yeah, tried that.


badkins
2022-1-3 21:01:43

% racket arrow.rkt g-irepository-require: implementation not found; arguments: "Arrow" #f 0 #&lt;cpointer&gt; context...: /Users/badkins/sync/github/racket/racket/share/pkgs/gir/gir/main.rkt:34:0: gi-ffi /Users/badkins/sync/github/racket/racket/collects/racket/contract/private/arrow-val-first.rkt:489:18 body of "/Users/badkins/tmp/arrow.rkt"


badkins
2022-1-3 21:09:47

The docs for gir make no mention of how it finds libraries.


samdphillips
2022-1-3 21:13:11

Looks like the arrow lib installs the GIR XML files but doesn’t generate the .typelib files which the library uses.


samdphillips
2022-1-3 21:14:23

You can use an environment variable to influence the lookup (once the typelib file is generated) https://gnome.pages.gitlab.gnome.org/gobject-introspection/girepository/GIRepository.html


samdphillips
2022-1-3 21:14:48

It is possible to control the search paths programmatically, using g_irepository_prepend_search_path(). It is also possible to modify the search paths by using the GI_TYPELIB_PATH environment variable. The environment variable takes precedence over the default search path and the g_irepository_prepend_search_path() calls.


badkins
2022-1-3 21:16:00

I have: % ls -l lib/girepository-1.0 total 384 -r--r--r-- 1 badkins admin 138664 Nov 9 21:07 Arrow-1.0.typelib -r--r--r-- 1 badkins admin 9456 Nov 9 21:07 ArrowDataset-1.0.typelib -r--r--r-- 1 badkins admin 9172 Nov 9 21:07 ArrowFlight-1.0.typelib -r--r--r-- 1 badkins admin 16440 Nov 9 21:07 Gandiva-1.0.typelib -r--r--r-- 1 badkins admin 4440 Nov 9 21:07 Parquet-1.0.typelib -r--r--r-- 1 badkins admin 3736 Nov 9 21:07 Plasma-1.0.typelib


samdphillips
2022-1-3 21:16:17

Ah those should work.


samdphillips
2022-1-3 21:16:38

I probably missed them in my install


badkins
2022-1-3 21:19:12

Tried all sorts of values for GI_TYPELIB_PATH to no avail.


samdphillips
2022-1-3 21:20:07

Same, sorry.


samdphillips
2022-1-3 21:25:27

Oh it may be because Racket cannot load the C part of the gir code.


badkins
2022-1-3 21:28:57

Oh well, it seemed like a promising idea.


samdphillips
2022-1-3 21:32:24

$ PLTSTDERR=debug@ffi-lib racket Welcome to Racket v8.3.0.8 [cs]. &gt; ,req gir ffi-lib: failed for (ffi-lib "libgobject-2.0" ""), tried: #&lt;path:/Users/sphillips/Library/Racket/snapshot/lib/libgobject-2.0.dylib&gt; (no such file) #&lt;path:/Users/sphillips/Library/Racket/snapshot/lib/libgobject-2.0&gt; (no such file) #&lt;path:/Applications/Racket v8.3.0.8/lib/libgobject-2.0.dylib&gt; (no such file) #&lt;path:/Applications/Racket v8.3.0.8/lib/libgobject-2.0&gt; (no such file) "libgobject-2.0.dylib" (using OS library search path) "libgobject-2.0" (using OS library search path) #&lt;path:/usr/local/Cellar/apache-arrow-glib/6.0.1_1/lib/girepository-1.0/libgobject-2.0.dylib&gt; (no such file) #&lt;path:/usr/local/Cellar/apache-arrow-glib/6.0.1_1/lib/girepository-1.0/libgobject-2.0&gt; (no such file) ffi-lib: failed for (ffi-lib "libgirepository-1.0" ""), tried: #&lt;path:/Users/sphillips/Library/Racket/snapshot/lib/libgirepository-1.0.dylib&gt; (no such file) #&lt;path:/Users/sphillips/Library/Racket/snapshot/lib/libgirepository-1.0&gt; (no such file) #&lt;path:/Applications/Racket v8.3.0.8/lib/libgirepository-1.0.dylib&gt; (no such file) #&lt;path:/Applications/Racket v8.3.0.8/lib/libgirepository-1.0&gt; (no such file) "libgirepository-1.0.dylib" (using OS library search path) "libgirepository-1.0" (using OS library search path) #&lt;path:/usr/local/Cellar/apache-arrow-glib/6.0.1_1/lib/girepository-1.0/libgirepository-1.0.dylib&gt; (no such file) #&lt;path:/usr/local/Cellar/apache-arrow-glib/6.0.1_1/lib/girepository-1.0/libgirepository-1.0&gt; (no such file)


samdphillips
2022-1-3 21:33:03

Oh it’s searching in the last two there because it is cwd


samdphillips
2022-1-3 21:43:09

Well they load if you are in /usr/local/lib . I’m not much of a Mac expert to know how to get these properly in the library search path.


massung
2022-1-3 21:44:46

not sure how you have homebrew setup, but all my stuff doesn’t go to /usr/local, but rather use brew --prefix to find out where. For example, I usually do something like:

export PATH=$(brew --prefix)/bin:$PATH export DYLD_LIBRARY_PATH=$(brew --prefix)/lib:$DYLD_LIBRARY_PATH etc.


badkins
2022-1-3 21:45:24

I tried getting it to work with Ruby as a test using <https://github.com/apache/arrow/tree/master/c_glib|this info> , but all sorts of C++ build errors occurred. I’m getting a pretty bad impression of Arrow thus far…


samdphillips
2022-1-3 21:45:53

DYLD_LIBRARY_PATH that’s the part I’m missing


badkins
2022-1-3 21:46:05

I tried running my Racket program while being in various directories to no avail. @samdphillips are you saying you got this to work?


samdphillips
2022-1-3 21:46:36

I have it loading the libs and maybe the typelib now


samdphillips
2022-1-3 21:49:32

In /usr/local/lib: Welcome to Racket v8.3.0.8 [cs]. &gt; ,req gir ffi-lib: loaded "libgobject-2.0.dylib" ffi-lib: loaded "libgirepository-1.0.dylib" &gt; (gi-ffi "Arrow") #&lt;procedure:...ib/gir/gir/main.rkt:37:2&gt; &gt; (define x (gi-ffi "Arrow")) &gt; (x 'Field 'new "uint8" (x 'UInt8DataType 'new)) ;; takes a bit of time #&lt;procedure:.../gir/gir/object.rkt:44:4&gt;


samdphillips
2022-1-3 21:49:48

I think it’s working but I don’t have time right now to try a bigger example


badkins
2022-1-3 21:49:52

Hmm… I set PATH and DYLD_LIBRARY_PATH as above, still can’t find it.


badkins
2022-1-3 21:50:46

I don’t get the two ffi-lib... lines you get after (require gir)


badkins
2022-1-3 21:51:19

requiring gir works fine, but it still can’t find Arrow


samdphillips
2022-1-3 21:51:23

you get that if you turn on PLTSTDERR=debug@ffi-lib


badkins
2022-1-3 21:52:02

I just turned that on. Still no output for ffi-lib


massung
2022-1-3 21:53:44

> I set PATH and DYLD_LIBRARY_PATH Annoying… did you restart [Dr]Racket so it got the new environment? If you get the env vars from the REPL in racket, does it show the right thing?


badkins
2022-1-3 21:56:09

I’m running racket from command line.


massung
2022-1-3 21:57:32

massung
2022-1-3 21:57:52

talks about DYLD_FALLBACK_FRAMEWORK_PATH


massung
2022-1-3 21:57:57

havent heard of that one before tho


badkins
2022-1-3 21:58:45

@massung yes, (getenv "DYLD_LIBRARY_PATH") and PATH show correctly


badkins
2022-1-3 22:05:45

I tried loading the 2 libs in Sam’s output manually. First one loaded, second one did not. For the second, I have 2 versions in /usr/local/lib - I tried both to no avail. No clue what’s going on here.


badkins
2022-1-3 22:07:12

Welcome to Racket v8.2.0.8 [cs]. &gt; (require ffi/unsafe) &gt; (ffi-lib "libgirepository-1.0.1.dylib") "ffi-lib: could not load foreign library\n path: libgirepository-1.0.1.dylib\n system error: dlopen(libgirepository-1.0.1.dylib, 6): Symbol not found: __cg_jpeg_resync_to_restart\n Referenced from: /System/Library/Frameworks/ImageIO.framework/Versions/A/ImageIO\n Expected in: /usr/local/lib/libJPEG.dylib\n in /System/Library/Frameworks/ImageIO.framework/Versions/A/ImageIO" (exn:fail:filesystem "ffi-lib: could not load foreign library\n path: libgirepository-1.0.1.dylib\n system error: dlopen(libgirepository-1.0.1.dylib, 6): Symbol not found: __cg_jpeg_resync_to_restart\n Referenced from: /System/Library/Frameworks/ImageIO.framework/Versions/A/ImageIO\n Expected in: /usr/local/lib/libJPEG.dylib\n in /System/Library/Frameworks/ImageIO.framework/Versions/A/ImageIO" #&lt;continuation-mark-set&gt;) &gt;


badkins
2022-1-3 22:07:50

So some missing jpeg thing is gumming up the works ?!?!


badkins
2022-1-3 22:11:55

What the heck - this works fine in DrRacket, but not command line Racket ?!?!?!


badkins
2022-1-3 22:12:43

I did a search for that jpeg lib and it kept showing up in various racket install dirs.



badkins
2022-1-3 22:16:29

The only path I had was /usr/local/lib though: % echo $DYLD_LIBRARY_PATH /usr/local/lib


soegaard2
2022-1-3 22:18:18

Maybe it’s one of the other PATH variables?


badkins
2022-1-3 22:20:49

I’m just wondering why DrRacket works and command line racket fails.


soegaard2
2022-1-3 22:21:08

badkins
2022-1-3 22:58:39

Solved it. Setting DYLD_LIBRARY_ATH was a mistake :)


samdphillips
2022-1-3 22:59:05

So, probably an “easy” way to make it work is just to fix the path in the gir package https://github.com/Kalimehtar/gir/blob/master/gir/loadlib.rkt


samdphillips
2022-1-3 22:59:56

@badkins you have it running from an arbitrary directory?


badkins
2022-1-3 23:00:26

yes


badkins
2022-1-3 23:01:46

I went through so many bloody permutations, I’m not exactly sure what happened. But now, when closing out of all terminal windows & quitting terminal, then opening terminal. It just works.


samdphillips
2022-1-3 23:03:52

Ohhh


badkins
2022-1-3 23:03:59

Ah! I did brew install gobject-introspection at one point.


samdphillips
2022-1-3 23:04:03

I kinda hate macs


badkins
2022-1-3 23:05:01

It must’ve been that last brew install that did it. Setting DYLD_LIBRARY_PATH just confused the library loading process.


massung
2022-1-3 23:06:07

> I kinda hate ~macs~ anything that doesn’t “just work” :slightly_smiling_face:


massung
2022-1-3 23:06:36

I’m getting old. Shit needs to just work or I move onto something else


badkins
2022-1-3 23:08:21

@samdphillips where did you get the idea for this syntax? (x 'Field 'new "uint8" (x 'UInt8DataType 'new))


samdphillips
2022-1-3 23:08:41

From the lua example



samdphillips
2022-1-3 23:10:05

The racket gir library wraps everything in procedures though so it’s a little wonky


badkins
2022-1-3 23:14:52

I’m confused re: how to navigate this GObject stuff. If you didn’t see that Lua example, how would you get to your code snippet from https://arrow.apache.org/docs/c_glib/arrow-glib/ ?


samdphillips
2022-1-3 23:16:18

<dusts off my old GObject hat> GObject functions follow a convention so there should be a function call something_new to make a thing


samdphillips
2022-1-3 23:17:03

Also most “classes” will have a description section in the docs.


badkins
2022-1-3 23:19:25

yes, <https://arrow.apache.org/docs/c_glib/arrow-glib/basic-array-classes.html#garrow-double-array-new|garrow_double_array_new ()>


badkins
2022-1-3 23:19:53

I just don’t see how, from the documentation, one would know what params to pass to gir


badkins
2022-1-3 23:21:25

And really, I have no idea what (x 'Field 'new "uint8" (x 'UInt8DataType 'new)) is supposed to do. The inner form appears to create a new UInt8DataType, but then what, you’re creating a new Field on that object ?!?!


samdphillips
2022-1-3 23:27:00

The gir definition in /usr/local/Cellar/apache-arrow-glib/6.0.1_1/share/gir-1.0 helps.

For double_array it’s on line 8655 of Arrow-1.0.gir


tristesse
2022-1-3 23:28:18

@tristesse has joined the channel


samdphillips
2022-1-3 23:29:00

So to call the new function you’d want something like: (define arrow-lib (gi-ffi "Arrow")) (arrow-lib 'DoubleArray 'new an-array-size an-arrow-buffer #f 0) ;; a closure wrapping the array


badkins
2022-1-3 23:29:05

Thus far, this seems really awful :) I’ve looked at quite a few web pages, tutorials, etc., and it’s just a mess. I can’t imagine anyone being productive with this.


samdphillips
2022-1-3 23:29:48

It’s like apis like vulkan, you’re suppose to put something nicer on top of it.


badkins
2022-1-3 23:30:11

Sure, but all the docs are so kludgy, it makes me wonder about the quality of the implementation also.


badkins
2022-1-3 23:32:05

Even something as simple as knowing to use DoubleArray and not GArrowDoubleArray , etc. is not easy to know.


samdphillips
2022-1-3 23:32:27

Part of it is that the choice of using closure ‘objects’ in the racket gir library is a bit klunky


badkins
2022-1-3 23:34:24

I don’t really care about the aesthetics of gir, but if there is a cost to the abstraction, it’s a deal breaker.


samdphillips
2022-1-3 23:35:50

Probably mostly mental anguish


badkins
2022-1-3 23:45:04

It appears there’s no way to get Arrow to create an array. All the functions expect a pointer to already-allocated memory to be passed in.



badkins
2022-1-3 23:58:53

Thanks for all your help @samdphillips. I’ve decided that if I get involved with this, it will be a direct implementation, and not a GObject wrapper.


samdphillips
2022-1-3 23:59:02

You need to pass in something the collector won’t trash though


samdphillips
2022-1-4 01:53:38

Very tiny example, loads a Arrow Double Array with some numbers: <https://gist.github.com/samdphillips/db507867cebff520c2affccc73dba6c9>


badkins
2022-1-4 02:10:15

Thanks for that @samdphillips it doesn’t look quite so bad, now that you connected all the dots! I’m still concerned about the long-term implications of being tied to the Arrow C library which in turn is built on the C++ library. If that can be abstracted over, it may not be an issue.


badkins
2022-1-4 02:12:25

I may take your example, and expand it to use flomat and see if I can get some zero-copy functionality going e.g. create a flomat matrix, convert to an Arrow array w/o copying, then slice that, and convert the slice to a flomat matrix w/o copying. I’ll have to look at the flomat source to see if the latter is possible.


badkins
2022-1-4 02:23:37

Yes, it should be possible. A flomat is just a struct: (struct flomat (m n a lda) so setting those 4 fields from Arrow info is trivial.


badkins
2022-1-4 02:24:06

I’m assuming I can cast the raw Arrow data to the appropriate C type.


alexharsanyi
2022-1-4 03:37:16

I just read the last hundred or so messages in this channel, and I thought I would add my two cents to this discution, since I have done some work which would count as data science in Racket. Also, for full disclosure, I am the author of the existing data-frame package.

The Python data science libraries are the result of an enormous amount of work from a large number of contributors, some of these working full time on the various libraries. Here is a summary for 2021: matplotlib: 175 contributors made 4127 commits, numpy: 270 contributors made 3539 commits, scipy: 225 contributors made 2210 commits, and pandas: 408 contributors made 3171 commits.

For comparison, in 2021, the Racket plot library had 17 commits from 7 contributors, while the Racket math library had 27 commits from 9 contributors. The racket repository itself had 984 commits from 73 contributors.

Even if we assume that Racket is 100 times more productive than Python, there is no real chance of ever matching the features and performace of the Python ecosystem. I would love to be proven wrong on this.

On the other hand, not every data science task needs the absolute best: I was able to process, analyze, and visualize decent amounts of data using Racket. To do this, I had to make some improvements to the plot library and wrote the data frame package, but overall I was able to use existing Racket facilities.

I think the data science environment in Racket could be incrementally improved by looking at how to use existing libraries to solve specific problems and contributing improvements to those libraries. For my part, I am happy to assist people with the plot and data-frame libraries and I suspect maintainers for the other packages would also be happy to help out.


samth
2022-1-4 03:44:24

I basically agree with @alexharsanyi. When @hazel and I set out to build sawzall and graphite, (a) lots of things were already easy, (b) have a complete implementation relative to python/r/etc would be totally impossible, and (c) the only effective way to make progress is to start with something you want to do, do it using Racket, and see how you can improve.


massung
2022-1-4 04:23:31

For those who were part of the “slice” discussion earlier, I whipped this up in a couple hours: https://github.com/massung/sliver

I’d like to get thoughts before putting up on raco. There’s also a couple things I’d probably do to if first as well. I’m not well versed in typed racket, so if someone wanted it updated with type signatures, I’d be down for a PR>


massung
2022-1-4 04:29:01

Just noting that if really prefer that something like this be just part of racket core. Likely with a few extra features.

For example, I’d very much like to allow slices of strings and bytes to bed used as input ports.