
What does NumPy have that is missing from the Flomat and math libraries?

By Jax do you mean MathJax?


I’m not sure what ‘automatically differentiate native Python and NumPy code’ means?

> JAX uses XLA to compile and run your NumPy code on accelerators, like GPUs and TPUs. https://www.tensorflow.org/xla#xla_frontends\|https://www.tensorflow.org/xla#xla_frontends
and > standalone tfcompile
tool, which converts TensorFlow graph into executable code (for x86–64 CPU only).

I’m not familiar with automatic differentiation, but I know what it is. I’m really unclear what automatic differentiation of code is > It can differentiate through a large subset of Python’s features, including loops, ifs, recursion, and closures,



Flomat only does floating point matrices. Numpy supports a variety of data types as the elements, including arbitrary objects.
That’s allowed it to be not just a linear algebra library, but the de facto standard for working with and sharing multidimensional hunks of data. Which then allowed the Python data ecosystem to blossom. Packages can easily interoperate with no or minimal need for glue code.

Oh it is clear for the first 5 min of the video that I do not know what AD is. I was thinking of symbolic diff

For example, I mostly do NLP, so my numpy arrays tend to be full of strings and arrays or sets of strings.

it is a #lang
!

thank you

It’s a pretty new thing. If I recall correctly, the first big paper on it was published less than 10 years ago.

Oh

Things like data-frame are not multidimensional https://docs.racket-lang.org/data-frame/index.html\|https://docs.racket-lang.org/data-frame/index.html

But, for example, Pandas is built on top of numpy, even for the non-numeric column types.

Versus, I believe Racket’s data-frame is doing its own thing with the data.

Ah I see it now https://github.com/numpy/numpy/blob/main/numpy/core/src/multiarray/_multiarray_tests.c.src\|https://github.com/numpy/numpy/blob/main/numpy/core/src/multiarray/_multiarray_tests.c.src numpy/core/src/multiarray/_multiarray_tests.c.src

Which, I haven’t tried doing data science for real in Racket. Scala’s data science ecosystem, which is what I mostly use at work right now, has a similar thing going on to what I think I see in the equivalent Racket ecosystem, though: the linear algebra library only does linear algebra, and so the data and scientific computing packages end up creating their own private types to add support for all the kinds of data they need to support beyond floating point numbers. And then, with so little overlap in their underlying data models, what on is working on doesn’t need to get too complex before you’re suddenly spending a surprising amount of time writing glue code.

automatic differentiation is quite old; here’s a paper about it from 1964: https://dl.acm.org/doi/10.1145/355586.364791

Fwiw I’ve asked the author to add a racket compat license and if they would be ok with it going on the package catalog. Via a PR https://github.com/ots22/rackpropagator\|https://github.com/ots22/rackpropagator

Differentiable programming, though?

Fascinating either way

To the degree that “differentiable programming” means “automatic differentiation of programs in a conventional programming language” then it is also very old

Rascas has a few algorithmic operations that it can differentiate through, in particular _if
.

I thought racas was symbolic diff?

That’s symbolic AD, yes, there are pros and cons

Rascas also has a _let*
form and can make a lot of simplifications with it

There’s a smart-simplify
operator that tries various combinations of contract/expand to compress the expression

Autodiff based on (f(x)-f(x+h))/h can become quickly inaccurate with compound functions

_let*
is what allows sequential style AD, but it constructs a symbolic expr nonetheless, which allows for further manipulation


So, it looks like NumPy is pretty fundamental in the Python data science ecosystem, and the ndarray
seems pretty fundamental w/in NumPy. @seanbunderwood does creating the Racket equivalent of ndarray
seem like a good starting point to up Racket’s game in this space?

I suppose a 2-dimensional ndarray
of floats would just use flomat
internally, since that already has the BLAS bindings.

It’s just my personal take, but yeah, my impression is that that’s what really distinguishes the Python ecosystem from most others: numpy isn’t just a linear algebra package; it’s a lingua franca for all sorts of data manipulation and interchange that everyone can share.
That said, my best hunch is that the future in this space is Apache Arrow. I would guess that integrating with it is the best way to become a data contender nowadays. It’s language-agnostic, and is already seeing some traction with other communities. I would hope that means that language communities that congregate on it can share more, duplicate less, and move faster.

<http://cidrdb.org/cidr2021/papers/cidr2021_paper08.pdf|A Microsoft paper> referenced in <https://news.ycombinator.com/item?id=26018827|a Hacker News post> that seems relevant.

In particular, “Magpie uses Arrow [2] and ArrowFlight [40] as a common format for data access across each of its backend engines”

One of the comments in that HN thread mentions Arrow not supporting n-dimensional arrays. That seems odd.

Aha, that’s annoying.

Though, if that’s the case, I wonder what https://arrow.apache.org/docs/python/generated/pyarrow.Tensor.html\|pyarrow.Tensor is doing.

I’m not an expert in NNs (by a long shot), but simple matrix x vector multiplication doesn’t require 2D matrices as opposed to just row (or column) vectors, which would be more efficient anyway.

Plus for the big stuff they likely load everything into GPU registers anyway.

I found some value in the “Getting Things Done” book. In particular, trying to identify the specific “next task” in a complex set of tasks. I think this NumPy.ndarray / Apache Arrow “thing” is a likely “next task”.

I’m not entirely clear of the benefits of Arrow if this is implemented in Racket/Scheme code vs. C.

i.e. I’m not sure I’d want to go through the effort of using the Arrow format, if the efficiency benefits are nullified in practice

Jax is interesting; > Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more seems like a task that would be simpler with parenthetical syntax (The jit to gpu/tpu bit seems to be done by a call out to XLA)

Ah, now I remember @seanbunderwood :) I know we discussed Arrow before in this context. I’m reading the spec now, and I’m wondering if the physical layout is compatible with Racket. If not, then it seems this would need to be implemented in C/C++ with bindings in Racket possibly.

“The representation of a flomat therefore matches the expectation of the routines in BLAS and LAPACK closely.” from flomat
- this seems to be the important part. I would think Arrow has this same format to allow using BLAS/LAPACK w/ zero copy.

The representation is as follows: ; m = rows, n = cols, a = mxn array of doubles
; lda = leading dimension of a (see below)
(struct flomat (m n a lda) ...)
An “array of doubles” is simply a pointer to a piece of memory: (define _flomat (_cpointer 'flomat))
It is tedious, but trivial to extend flomat to support complex numbers.

@soegaard2 any interest in converting flomat
to using the <https://arrow.apache.org/docs/format/Columnar.html|Arrow format> ? :) I think I’d be willing to put some serious time into it. I just need to research some more to see if it makes sense.

I’ve been reading for a while, and I still haven’t seen any info on Arrow <-> BLAS/LAPACK integration. :(

It’s probably less about forcing that integration as opposed to making it trivial for anyone who wants to pass Arrow data directly to BLAS

Skimming the documentation on Arrow, it looks like a big project. If someone (cough) writes the lower levels then replacing the upper level of flomat wouldn’t be too hard. But the idea of flomat was to use as many BLAS/LAPACK routines as possible, since they can be trusted for numerical accuracy.

I have a reasonable idea of what it would take to get started on creating a subset of NumPy.ndarray where it would simply use flomat in the specific case where the elements are floating point numbers, and something else, otherwise. But I have much less of a handle on how Arrow impacts this, and I’m less sure of the benefits of Arrow.

During december I have worked on implementing srfi 179 in (untyped) Racket. https://srfi.schemers.org/srfi-179/srfi-179.html
Maybe we can use that as a starting point for alternative implementation of general arrays in Racket?

Having said that, it seems naive to ignore Arrow, so I’ll need to keep reading …

One of the ideas of the srfi is some arrays are “specialized”. I.e. their data are stored in a contiguous memory area.

I only briefly looked at the srfi, but it really does seem like the industry is getting behind Arrow.

flomat is probably 90% or more compatible already.

I think I’ll continue on my current course, which is to blast through a ton of info on Python & the ecosystem, to become a competent user of the Python ecosystem, and then try and figure out how I might incrementally move Racket forward in this space.

It is “just” a question of how to package the array of numbers in a larger data structure?

I think that’s the bulk of it, but the details may require a fair amount of work.

I think my initial goal would be to port an important subset of NumPy.ndarray
to racket.

If you decide to tackle Arrow, feel free to use whatever you can from flomat. I predict I won’t have time do any implementing though.

Does Numpy support Arrow?


My impression is NumPy.ndarray is its own thing currently, but maybe NumPy is moving toward using Arrow - not really sure.

The advantage is being able to interop with other ecosystem. Yes, you would be wrapping the C/C++ libraries for the most part, rather than writing it in pure Racket. What you theoretically get in return, though, is that you can keep getting more things at low cost by wrapping more native libraries from the same ecosystem. Ballista for distributed computing, for example, or libgdf for CUDA-accelerated computation.

<https://wesmckinney.com/blog/apache-arrow-pandas-internals/|This blog post> by the creator of Pandas is probably worth reading.

Bjarne Stroustrup talks about “two types of languages: the ones people complain about and the ones nobody uses” :) I hear of people complaining about Pandas a lot, but it seems everyone uses it!

Long story short, ndarray and Arrow’s types both have the same underlying data layout. So to create an ndarray of a pyarrow array, what you do is create a new ndarray that shares pointers with the pyarrow array.

And that layout is simply columnar, right? So, in theory, the same would apply to flomat.

probably 2 functions then: Arrow -> flomat, and flomat -> Arrow

both would be essentially zero copy i.e. the main elements wouldn’t be relocated

> people complaining about Pandas a lot, but it seems everyone uses it I have to constantly remind people at work showing demos or project progress to others that “no comments” is a bad sign. Everyone commenting to you all the things they wish your demo did or features you need to add is what you want. They aren’t knocking the hard work you’ve done. They are able to quickly see how what you’ve done benefits them, their work, and their brains are flooding them with all the things they can now do, wish they could do easier, etc.

I’m hoping there are other benefits besides the interop though, because I’m not sure Racket would be that great for interop :( After I made the comment above, I looked through the source for flomat, and it looks like flomat is already using a compatible layout, for the most part. So, I think much of the code could be Racket code. I wouldn’t be too interesting in creating a large C++ library for use in Racket. I paid my dues with C++, and would rather not get back into it :)

In my humble opinion, the main benefit to this theoretical work is to have a fantastic ndarray-like thing that other pure Racket libs could make use of. I could be wrong though, I should really look into what it would take to get some Python interop - I think I might be ok with that compromise if it offered enough functionality.

But, the only reason to avoid Arrow is probably if it would involve a ton of more work, and at this point, I don’t think that’s the case. If something is going to be built from scratch, some layout needs to be used anyway, so why not the Arrow layout.

IME, pandas can be broken down into just a few pieces:
• Reading/writing data FAST (for all the various formats). This is actually the most difficult part to do well. • Data frame processing. This is the easiest part if you throw out multi-threading/nodes • Math (eg. scipy). Implementations of common equations and algorithms that are trusted and work on all the various data frame classes: series, groups, etc. This is also fairly simple, but getting it trusted for publishing not so much.

I am beginning to see where Arrow fits in. It would definitely make it easier to work with libraries written in other languages.

Are there standard C (or other) libs that SciPy is using analogous to BLAS/LAPACK? Creating bindings is one thing, but I have no interest, or sufficient skill, to implement a ton of stats functions, for example.

I think it varies. The linear algebra function seems to be wrappers of BLAS/LAPACK mostly - but the the stat folder seems to be in Python.

Wow, somehow I missed the fact that Wes McKinney, the creator of Pandas, is also a co-creator of Apache Arrow! <https://learning.acm.org/techtalks/apache|Apache Arrow and the Future of Data Frames with Wes McKinney>

Impressive. At 24:10 in that video Wes compares Pandas’ read_csv(...)
to Arrow’s read_csv(...)
when reading a 2.5 GB file. Pandas, single threaded, took about 30 seconds. Arrow’s multi-threaded took 1.5 seconds.

That’s promising for implementing arrow in Racket - if for no other reason than to be able to easily transfer data between Python and Racket via files.

I’m usually very wary of benchmark comparisons like that because it’s usually apples/oranges in the comparison. I’m sure Arrow does several smart things that Python can’t do, but here’s likely a list of the “optimization” done… cough deferred cough:
• Slurped the entire file into memory • Made no attempt to “parse” data types as opposed to just slicing the file into byte offset/length references • Likely made no effort to handle N/A values, double quotes, escaped characters, etc. • Returned as soon as the frame was created and “ready” for use - meaning data isn’t materialized and will be materialized on-demand. The same work needs to be done, but there’s hope that the work can be handled across multiple threads, in the background while readying other things (e.g. reading other CSV files), and that - when needed - the entire file won’t need to be materialized, but just a subset of it. Those are smart, good things to do. But it also means - again - apples and oranges comparison. It’d be better to compare a fully materialized table: what was the time footprint? I’m sure it’s still faster, but likely 5–10x longer than the 1.5s claimed.

Speaking of… if I could have one Racket wishlist item magically fulfilled, it’d be slicing. I so want racket to support true [byte]vector/string slices.

Support in core or just support?

I haven’t watched the video, but I’m also wondering if he used the pure Python implementation of read_csv, or the C implementation. 30s sounds to me like it might be the former.

In core

Do you want it to work like a vector/bytes?

IMO, all sequences should be slices at the interface level. If I do #(1 2 3 4)
there may be an underlying vector created, but what I - the programmer - get back is a slice referencing it. All vector-
, sequence-
, string-
, etc. functions all act on the high-level slices.

Why couldn’t it (or rather why wouldn’t you want it) be in a library do you think?

It’s such a fundamental pattern of programming. Hash tables could be in a library, too, but we don’t. :wink:

But it could be, sure.

Maybe I’ll make one

Forgive my ignorance, but is a slice simply a 3-tuple of (vector, beg, end) ?

yeah.

Sorry for the red herring of csv reading above, ignore that, but the video is a nice intro.

This is #random - no need to be sorry about CSV reading. :slightly_smiling_face:

Yeah if a slice acts like the underlying then I could see how having it in core would be good. You could reuse all of the existing functions.

Although slices of mutable containers can get dicey

I think the only language where I can see myself habitually passing around slices of mutable containers is Rust.
But I suppose they can still be useful even when confined to a single function body.

From the <https://arrow.apache.org/overview/|Arrow Overview> “Arrow libraries for C (Glib), MATLAB, Python, R, and Ruby are built on top of the C++ library.” So, I’m thinking the way to go may be to just provide a wrapper to the Arrow C++ library. There has been, and will continue to be, a lot of effort put into the C++ library, so it seems we could use it in the same way flomat uses BLAS/LAPACK. Does that sound right?

I would use the C library it would probably be easier to work with from FFI

I wondered about that, but the C library is built on the C++ library.

The C library also uses GIR so you might be able to get bindings for “free” (https://pkgs.racket-lang.org/package/gir)

Wow

I don’t think Racket can call into C++ without a C shim.

@samdphillips do you know about gir
? I installed Apache Arrow via brew install apache-arrow-glib
, but I have no idea what the object is called. The gir
example for using Gtk has: (define gtk (gi-ffi "Gtk"))
How does one determine the correct name?

Hmm there should be an introspection “document” somewhere. Let me take a shot on my mac

It seems like: (define arrow (gi-ffi "Arrow"))
should do it

While I wait for brew … here are some examples using Lua https://github.com/apache/arrow/tree/master/c_glib/example
It may be under the name “Arrow”

:smile:

Yeah, tried that.

% racket arrow.rkt
g-irepository-require: implementation not found; arguments: "Arrow" #f 0 #<cpointer>
context...:
/Users/badkins/sync/github/racket/racket/share/pkgs/gir/gir/main.rkt:34:0: gi-ffi
/Users/badkins/sync/github/racket/racket/collects/racket/contract/private/arrow-val-first.rkt:489:18
body of "/Users/badkins/tmp/arrow.rkt"

The docs for gir
make no mention of how it finds libraries.

Looks like the arrow lib installs the GIR XML files but doesn’t generate the .typelib
files which the library uses.

You can use an environment variable to influence the lookup (once the typelib file is generated) https://gnome.pages.gitlab.gnome.org/gobject-introspection/girepository/GIRepository.html

It is possible to control the search paths programmatically, using g_irepository_prepend_search_path(). It is also possible to modify the search paths by using the GI_TYPELIB_PATH environment variable. The environment variable takes precedence over the default search path and the g_irepository_prepend_search_path() calls.

I have: % ls -l lib/girepository-1.0
total 384
-r--r--r-- 1 badkins admin 138664 Nov 9 21:07 Arrow-1.0.typelib
-r--r--r-- 1 badkins admin 9456 Nov 9 21:07 ArrowDataset-1.0.typelib
-r--r--r-- 1 badkins admin 9172 Nov 9 21:07 ArrowFlight-1.0.typelib
-r--r--r-- 1 badkins admin 16440 Nov 9 21:07 Gandiva-1.0.typelib
-r--r--r-- 1 badkins admin 4440 Nov 9 21:07 Parquet-1.0.typelib
-r--r--r-- 1 badkins admin 3736 Nov 9 21:07 Plasma-1.0.typelib

Ah those should work.

I probably missed them in my install

Tried all sorts of values for GI_TYPELIB_PATH to no avail.

Same, sorry.

Oh it may be because Racket cannot load the C part of the gir code.

Oh well, it seemed like a promising idea.

$ PLTSTDERR=debug@ffi-lib racket
Welcome to Racket v8.3.0.8 [cs].
> ,req gir
ffi-lib: failed for (ffi-lib "libgobject-2.0" ""), tried:
#<path:/Users/sphillips/Library/Racket/snapshot/lib/libgobject-2.0.dylib> (no such file)
#<path:/Users/sphillips/Library/Racket/snapshot/lib/libgobject-2.0> (no such file)
#<path:/Applications/Racket v8.3.0.8/lib/libgobject-2.0.dylib> (no such file)
#<path:/Applications/Racket v8.3.0.8/lib/libgobject-2.0> (no such file)
"libgobject-2.0.dylib" (using OS library search path)
"libgobject-2.0" (using OS library search path)
#<path:/usr/local/Cellar/apache-arrow-glib/6.0.1_1/lib/girepository-1.0/libgobject-2.0.dylib> (no such file)
#<path:/usr/local/Cellar/apache-arrow-glib/6.0.1_1/lib/girepository-1.0/libgobject-2.0> (no such file)
ffi-lib: failed for (ffi-lib "libgirepository-1.0" ""), tried:
#<path:/Users/sphillips/Library/Racket/snapshot/lib/libgirepository-1.0.dylib> (no such file)
#<path:/Users/sphillips/Library/Racket/snapshot/lib/libgirepository-1.0> (no such file)
#<path:/Applications/Racket v8.3.0.8/lib/libgirepository-1.0.dylib> (no such file)
#<path:/Applications/Racket v8.3.0.8/lib/libgirepository-1.0> (no such file)
"libgirepository-1.0.dylib" (using OS library search path)
"libgirepository-1.0" (using OS library search path)
#<path:/usr/local/Cellar/apache-arrow-glib/6.0.1_1/lib/girepository-1.0/libgirepository-1.0.dylib> (no such file)
#<path:/usr/local/Cellar/apache-arrow-glib/6.0.1_1/lib/girepository-1.0/libgirepository-1.0> (no such file)

Oh it’s searching in the last two there because it is cwd

Well they load if you are in /usr/local/lib
. I’m not much of a Mac expert to know how to get these properly in the library search path.

not sure how you have homebrew setup, but all my stuff doesn’t go to /usr/local
, but rather use brew --prefix
to find out where. For example, I usually do something like:
export PATH=$(brew --prefix)/bin:$PATH
export DYLD_LIBRARY_PATH=$(brew --prefix)/lib:$DYLD_LIBRARY_PATH
etc.

I tried getting it to work with Ruby as a test using <https://github.com/apache/arrow/tree/master/c_glib|this info> , but all sorts of C++ build errors occurred. I’m getting a pretty bad impression of Arrow thus far…

DYLD_LIBRARY_PATH
that’s the part I’m missing

I tried running my Racket program while being in various directories to no avail. @samdphillips are you saying you got this to work?

I have it loading the libs and maybe the typelib now

In /usr/local/lib
: Welcome to Racket v8.3.0.8 [cs].
> ,req gir
ffi-lib: loaded "libgobject-2.0.dylib"
ffi-lib: loaded "libgirepository-1.0.dylib"
> (gi-ffi "Arrow")
#<procedure:...ib/gir/gir/main.rkt:37:2>
> (define x (gi-ffi "Arrow"))
> (x 'Field 'new "uint8" (x 'UInt8DataType 'new))
;; takes a bit of time
#<procedure:.../gir/gir/object.rkt:44:4>

I think it’s working but I don’t have time right now to try a bigger example

Hmm… I set PATH and DYLD_LIBRARY_PATH as above, still can’t find it.

I don’t get the two ffi-lib...
lines you get after (require gir)

requiring gir works fine, but it still can’t find Arrow

you get that if you turn on PLTSTDERR=debug@ffi-lib

I just turned that on. Still no output for ffi-lib

> I set PATH and DYLD_LIBRARY_PATH Annoying… did you restart [Dr]Racket so it got the new environment? If you get the env vars from the REPL in racket, does it show the right thing?

I’m running racket
from command line.

A few years old, but maybe helpful: http://sushihangover.github.io/mono-unable-to-find-the-ldylib-native-library/

talks about DYLD_FALLBACK_FRAMEWORK_PATH

havent heard of that one before tho

@massung yes, (getenv "DYLD_LIBRARY_PATH")
and PATH show correctly

I tried loading the 2 libs in Sam’s output manually. First one loaded, second one did not. For the second, I have 2 versions in /usr/local/lib - I tried both to no avail. No clue what’s going on here.

Welcome to Racket v8.2.0.8 [cs].
> (require ffi/unsafe)
> (ffi-lib "libgirepository-1.0.1.dylib")
"ffi-lib: could not load foreign library\n path: libgirepository-1.0.1.dylib\n system error: dlopen(libgirepository-1.0.1.dylib, 6): Symbol not found: __cg_jpeg_resync_to_restart\n Referenced from: /System/Library/Frameworks/ImageIO.framework/Versions/A/ImageIO\n Expected in: /usr/local/lib/libJPEG.dylib\n in /System/Library/Frameworks/ImageIO.framework/Versions/A/ImageIO"
(exn:fail:filesystem "ffi-lib: could not load foreign library\n path: libgirepository-1.0.1.dylib\n system error: dlopen(libgirepository-1.0.1.dylib, 6): Symbol not found: __cg_jpeg_resync_to_restart\n Referenced from: /System/Library/Frameworks/ImageIO.framework/Versions/A/ImageIO\n Expected in: /usr/local/lib/libJPEG.dylib\n in /System/Library/Frameworks/ImageIO.framework/Versions/A/ImageIO" #<continuation-mark-set>)
>

So some missing jpeg thing is gumming up the works ?!?!

What the heck - this works fine in DrRacket, but not command line Racket ?!?!?!

I did a search for that jpeg lib and it kept showing up in various racket install dirs.


The only path I had was /usr/local/lib though: % echo $DYLD_LIBRARY_PATH
/usr/local/lib

Maybe it’s one of the other PATH variables?

I’m just wondering why DrRacket works and command line racket fails.

This one is newer. It mentions JPEG with capital letters. https://stackoverflow.com/questions/63024402/solution-to-python3-tkinter-import-error-symbol-not-found-cg-jpeg-resync-to

Solved it. Setting DYLD_LIBRARY_ATH
was a mistake :)

So, probably an “easy” way to make it work is just to fix the path in the gir package https://github.com/Kalimehtar/gir/blob/master/gir/loadlib.rkt

@badkins you have it running from an arbitrary directory?

yes

I went through so many bloody permutations, I’m not exactly sure what happened. But now, when closing out of all terminal windows & quitting terminal, then opening terminal. It just works.

Ohhh

Ah! I did brew install gobject-introspection
at one point.

I kinda hate macs

It must’ve been that last brew install
that did it. Setting DYLD_LIBRARY_PATH just confused the library loading process.

> I kinda hate ~macs~ anything that doesn’t “just work” :slightly_smiling_face:

I’m getting old. Shit needs to just work or I move onto something else

@samdphillips where did you get the idea for this syntax? (x 'Field 'new "uint8" (x 'UInt8DataType 'new))

From the lua example


The racket gir
library wraps everything in procedures though so it’s a little wonky

I’m confused re: how to navigate this GObject stuff. If you didn’t see that Lua example, how would you get to your code snippet from https://arrow.apache.org/docs/c_glib/arrow-glib/ ?

<dusts off my old GObject hat> GObject functions follow a convention so there should be a function call something_new
to make a thing

Also most “classes” will have a description section in the docs.

yes, <https://arrow.apache.org/docs/c_glib/arrow-glib/basic-array-classes.html#garrow-double-array-new|garrow_double_array_new ()>

I just don’t see how, from the documentation, one would know what params to pass to gir

And really, I have no idea what (x 'Field 'new "uint8" (x 'UInt8DataType 'new))
is supposed to do. The inner form appears to create a new UInt8DataType
, but then what, you’re creating a new Field
on that object ?!?!

The gir definition in /usr/local/Cellar/apache-arrow-glib/6.0.1_1/share/gir-1.0
helps.
For double_array
it’s on line 8655 of Arrow-1.0.gir

@tristesse has joined the channel

So to call the new function you’d want something like: (define arrow-lib (gi-ffi "Arrow"))
(arrow-lib 'DoubleArray 'new an-array-size an-arrow-buffer #f 0) ;; a closure wrapping the array

Thus far, this seems really awful :) I’ve looked at quite a few web pages, tutorials, etc., and it’s just a mess. I can’t imagine anyone being productive with this.

It’s like apis like vulkan, you’re suppose to put something nicer on top of it.

Sure, but all the docs are so kludgy, it makes me wonder about the quality of the implementation also.

Even something as simple as knowing to use DoubleArray
and not GArrowDoubleArray
, etc. is not easy to know.

Part of it is that the choice of using closure ‘objects’ in the racket gir library is a bit klunky

I don’t really care about the aesthetics of gir
, but if there is a cost to the abstraction, it’s a deal breaker.

Probably mostly mental anguish

It appears there’s no way to get Arrow to create an array. All the functions expect a pointer to already-allocated memory to be passed in.


Thanks for all your help @samdphillips. I’ve decided that if I get involved with this, it will be a direct implementation, and not a GObject wrapper.

You need to pass in something the collector won’t trash though

Very tiny example, loads a Arrow Double Array with some numbers: <https://gist.github.com/samdphillips/db507867cebff520c2affccc73dba6c9>

Thanks for that @samdphillips it doesn’t look quite so bad, now that you connected all the dots! I’m still concerned about the long-term implications of being tied to the Arrow C library which in turn is built on the C++ library. If that can be abstracted over, it may not be an issue.

I may take your example, and expand it to use flomat and see if I can get some zero-copy functionality going e.g. create a flomat matrix, convert to an Arrow array w/o copying, then slice that, and convert the slice to a flomat matrix w/o copying. I’ll have to look at the flomat source to see if the latter is possible.

Yes, it should be possible. A flomat
is just a struct: (struct flomat (m n a lda)
so setting those 4 fields from Arrow info is trivial.

I’m assuming I can cast
the raw Arrow data to the appropriate C type.

I just read the last hundred or so messages in this channel, and I thought I would add my two cents to this discution, since I have done some work which would count as data science in Racket. Also, for full disclosure, I am the author of the existing data-frame
package.
The Python data science libraries are the result of an enormous amount of work from a large number of contributors, some of these working full time on the various libraries. Here is a summary for 2021: matplotlib
: 175 contributors made 4127 commits, numpy
: 270 contributors made 3539 commits, scipy
: 225 contributors made 2210 commits, and pandas
: 408 contributors made 3171 commits.
For comparison, in 2021, the Racket plot
library had 17 commits from 7 contributors, while the Racket math
library had 27 commits from 9 contributors. The racket
repository itself had 984 commits from 73 contributors.
Even if we assume that Racket is 100 times more productive than Python, there is no real chance of ever matching the features and performace of the Python ecosystem. I would love to be proven wrong on this.
On the other hand, not every data science task needs the absolute best: I was able to process, analyze, and visualize decent amounts of data using Racket. To do this, I had to make some improvements to the plot library and wrote the data frame package, but overall I was able to use existing Racket facilities.
I think the data science environment in Racket could be incrementally improved by looking at how to use existing libraries to solve specific problems and contributing improvements to those libraries. For my part, I am happy to assist people with the plot
and data-frame
libraries and I suspect maintainers for the other packages would also be happy to help out.

I basically agree with @alexharsanyi. When @hazel and I set out to build sawzall and graphite, (a) lots of things were already easy, (b) have a complete implementation relative to python/r/etc would be totally impossible, and (c) the only effective way to make progress is to start with something you want to do, do it using Racket, and see how you can improve.

For those who were part of the “slice” discussion earlier, I whipped this up in a couple hours: https://github.com/massung/sliver
I’d like to get thoughts before putting up on raco. There’s also a couple things I’d probably do to if first as well. I’m not well versed in typed racket, so if someone wanted it updated with type signatures, I’d be down for a PR>

Just noting that if really prefer that something like this be just part of racket core. Likely with a few extra features.
For example, I’d very much like to allow slices of strings and bytes to bed used as input ports.