
For lists, it looks like you store the entire list in the slice and convert a slice-ref
call into a list-ref
with the slice start offset.
For a list of 1 million items and a slice of (slice my-list <tel:9000001000000\|900000 1000000>)
, a (slice-ref 0)
will be converted into a (list-ref 900000)
. This is especially bad if the user calls (slice-ref 0)
, (slice-ref 1)
, etc.
For lists, you could drop the first start
elements of the list from the slice, so (slice-ref 0)
becomes (list-ref 0)
, this would make slices more efficient for lists.

> I’d very much like to allow slices of strings and bytes to bed used as input ports. You can use make-input-port
to implement a call-with-input-slice
, or open-input-slice
, similar to call-with-input-string
or open-input-string

You probably also want to implement in-slice
to allow slices to be iterated over using for
loops, although, they proabably won’t be as fast as using in-list
or (in-vector v start end)
, since the for
marcros know how to expand those…

^Relatedly, it could be nice if there were a way to build forms that co-operated with the for macros to say “this thing is iterable and I’m supplying the relevant optimization, so don’t worry about it”

> For lists, it looks like you store the entire list in the slice and convert Yes. This is one of those “I’d like to do better” things. Your suggestion (@alexharsanyi) regarding dropping the initial start
cons cells is what I was going to do, just not for the initial pass to have everyone take a look-see.
> You can use make-input-port
… I was hoping that open-input-string
and open-input-bytes
allowed for start
and end
arguments to limit where the input port read from, but sadly they do not. Before diving down making custom input port code, my thought was someone might know a simple work-around to get what I wanted easier.
> You probably also want to implement in-slice
… Why? The slices are sequences already and can be used in for
loops as is. Is the concern just that adding the one level of indirection means it will be a bit slower than looping over the underlying structure natively?

Also, regarding ports, I wasn’t sure if I wanted to special case string and byte slices -> ports as opposed to letting any slice be an input port. Aside from fact that a list or vector could store non-character/byte values, is there any particular reason to not allow an input port to read values from a list or vector or …?

@alexharsanyi I agree that it’s not feasible for Racket to reproduce the functionality of the Python ecosystem. However, one lesson I learned from developing web applications with Ruby/Rails for over a decade, and successfully switching to Racket, is that I didn’t need 100% of Rails (which has become quite bloated), and creating the percentage that I needed wasn’t that difficult. I’m still very much a newbie when it comes to data science, but I expect something similar may apply, as you relate from your personal experience. I’m interested in having a substantial subset of NumPy.ndarray as a start.

As a side note, I naively believed the hype that Julia was poised to supplant Python in this space. On paper, Julia looks fantastic, and I do think it is a much nicer language than Python, and the community is largely focused on scientific computing, but it seems their progress is quite slow - probably due to the massive Python community and inertia. Who knows, maybe in a couple decades they will do it. If Julia is having such a difficult time, I expect Racket will also, but Alex’ point about “not needing the best” is a good one. An example is my <https://lojic.com/covid.html|simple covid stats page> - I wrote a couple thousand lines of Julia (lots of boilerplate due to being new) because I thought I’d do “something fancy” with the data. In that instance, I could have provided the same functionality with Racket easily, and simply “shelled out” to Python if I needed fancy stuff later, so incrementally improving Racket is very worthwhile IMO.

@badkins How large is your input data to that page? Not sure what kind of operations you’re doing to get the data you care about, but may want to take a look at (shameless plug) tabular-asa
.


If there is some feature it lacks you need, I’d be happy to implement it.

The data is the <https://github.com/CSSEGISandData/COVID–19/tree/master/csse_covid_19_data/csse_covid_19_time_series|Johns Hopkins datset> - it’s not huge. I will probably port the code to Python first, as another learning exercise. My current plan is to become competent with the Python ecosystem first, and then see what I might be able to contribute on the Racket side.

I find the idea of having a solid ndarray
that many other things could be built on top of, fairly compelling.

But, the difference between practice and theory is much greater in practice than in theory :) Hence, my current Python research :)

unless you mean to use an ndarray as a cell’s value (or need matrix math), I’ve almost never had a need for them wrt tables/dataframes. you almost always want columnar data anyway for performance reasons

I expect ndarray uses columnar, right? Isn’t Pandas built on ndarray? That’s what I’m talking about.

Another way to look at this is, take ndarray away from the Python ecosystem, what breaks?

Pandas likely does lots of things under-the-hood for handling lots of specialized situations efficiently. But storing data frames as ndarrays would likely not be one of them. It would just be fundamentally too slow for lots of things.

That said, maybe Pandas does use it for certain things.

I could very well be mistaken. Like I said, I need to become competent first :)

But most of the time, you want column-major data

Not intending to toot, but I’m speaking as someone who worked professionally writing databases for a period of time and writes dataframe code every day currently (mostly using Python, but R and other languages as well)

God I hate R. :stuck_out_tongue:

Column-major is a given; hence, the interest in Apache Arrow.

Arrow - once it merged w/ Parquet - has a lot of nice things going for it. But until your dataset is large enough to really benefit from distributed work across nodes (and/or files), you likely won’t see much benefit in using it unless you just want to take advantage of various aspects of it (like the Parquet format or running on GPUs)

But if it’s just a learning exercise, I can’t knock that effort. I do that every day “for fun” too :slightly_smiling_face:

And my wife rolls her eyes at me constantly for it.

I suppose if you ended up never interoperating with other things, it could be a waste of time, but otherwise, since you have to decide on a layout anyway, why not choose the Arrow layout?

I’m pretty sure I could receive Arrow data and use it in flomat with zero copy - seems like a plus.

And if the computing functionality of Arrow ever expanded to be on par with NumPy, you could get it for free via the C API.

I don’t think the latter is a stated goal, but they do provide some <https://arrow.apache.org/docs/c_glib/arrow-glib/compute.html|basic stuff> now.

I’m not knocking using Arrow. Go for it! :slightly_smiling_face:
But, there are areas where - due to the optimizations it’s implementing - you may run into issues given whatever your data is.
For example, columns must be a specific type for every value in that column. It’s not possible to mix floats and strings in the same column. This is typically a great thing, but can cause issues if your data isn’t cleaned first or you’re pulling from crazy sources (e.g. clinicians — ugh).
I just want to note that until the data size hits a critical mass, most of the performance is algorithmic, which you’ll get with any columnar library, be it Pandas, tabular-asa, or something else. The Arrow stuff is about taking all those algorithmic advantages and then giving them crack for taking advantage of multi-core, cache coherency, SIMD, etc. All great things. It just depends on what your goals are.
If it’s just to process some 1M rows of data quickly, don’t bother with it as you won’t see a benefit and may even see some slow (due to FFI, processing setup, reductions, etc.).
If your goal is to learn some awesome stuff, add an Arrow package to the Racket community, or you have some serious data to process, then absolutely go for it! :smile:

Quotes like “the pandas library relies heavily on the NumPy array for the implementation of pandas data objects” from <https://medium.com/@ericvanrees/pandas-series-objects-and-numpy-arrays–15dfe05919d7|this post> are what made me think the NumPy.ndarray is “more fundamental” than the Pandas data frame.

Arrow has Union though, right?


Pandas columns have types associated with them (obviously). Quite often the type is object
, which means the cell value can be any Python value and there’s literally no advantage from NumPy being gained. But, if the type is something like numpy.int64
then yes, you’ll get some speed-ups for sure.

> Arrow has Union though, right? I’ve never used it, so I can’t speak to it. I’m sure it’s efficient, though. :slightly_smiling_face:

re: your comment above, my “goal” is a simple one - to minimize the need to use other languages than Racket :) I’m currently trying to find the appropriate dividing line between Racket and Python, and then to nudge that line ever so slightly toward Racket :)

Given the layered approach to much of software, it makes sense to me to consider what the more foundational pieces are, and make sure they’re worth building upon.

It sounds like I may be mistaken though re: Pandas relying on ndarray. If Pandas is doing “it’s own thing” for data, that’s a different story.

Regardless, I need to go get familiar with the existing Python ecosystem, then things will become more clear to me.

In this particular area, the biggest issue Racket has is in data loading. There’s no packages for loading Parquet or Feather data files (that I know of — there’s a parquet package in raco, but no docs and it doesn’t appear to do anything). CSV loading is really slow as is parsing JSON. I’m not sure why Racket reading data is so slow, but I’m hoping to speed up CSV parsing for tabular-asa (hence some of this slicing work)

Once the data is in Racket, though, processing tables, columns, etc. is pretty darn speedy.

I can see slow loading is an issue, but it doesn’t seem “fundamental” to me as in there’s not something lower level, on which it depends, that needs fixing. AFAICT.

I’m pretty sure it’s just a metric ton of copying/allocating that is totally unnecessary.

I suppose I’m really thinking of stable APIs. If the API is solid, the implementation can be improved later, but if the API is not good, then higher layers will need rework later, and that’s a more serious problem.

sure. that makes sense.

It is amazing how fast I/O can be in interpreted languages like Ruby and Python though!

Because usually it isn’t (it’s in C). :wink:

Hopefully it’s not necessary to do csv reading in C for Racket.

Yes, I know it’s in C :) My point is that they’ve optimized some really important stuff.

FWIW, in-slice
is already a thing (https://docs.racket-lang.org/reference/sequences.html#%28def._%28%28lib._racket%2Fsequence..rkt%29._in-slice%29%29), and it means something else. So we should pick a different name.
in-...
could be defined using define-sequence-syntax
+ :do-in
, which could boost the performance significantly.

oh, lol. yeah for sure.

I think what each community is good at - for successful languages - is knowing what their audience needs and making it happen. Racket just hasn’t prioritized IO as it hasn’t been a priority. I don’t believe anyone is trying to use Racket for processing GB worth of data.

I work with a decent size database where I need to reload 12 GB of data four times a year. That’s coming up on Saturday, so I may take the opportunity to do some benchmarking of I/O in Racket & Python.

I converted Ruby code to Racket, made use of places to parallelize things, and the process is dramatically faster. Although, that’s not quite fair, because I also changed how I load the data into Postgres.

I’m unclear as to the benefits of having a slice-range
(or whatever it’s called) function as opposed to just implementing #:property prop:sequence
.
Obviously the current implementation I have sucks for lists (so needs improved), but for all the other sequence types it works just fine in for
loops.

IIRC the idea is that, without any annotation, those for loops have to go through some fairly generic machinery, which can be expensive relative to expanding to some special machinery it can handle inline (I think it’s really the cost of some kind of generic dispatch versus being able to call unsafe operations directly.)

@massung somewhat germane to this discussion, just out of curiosity, what are the reasons you sometimes need to use R vs. Python?

I work with computational biologists who know R well enough to cause problems, but don’t know Python. :slightly_smiling_face:

So usually it’s taking their code and either fixing it or (rarely) porting it.

It’s not officially stated on the website, but the project’s founder has said it’s a major goal in some conference talks. My guess is it’s not on the immediate roadmap because Python, R, and Julia already have their own packages that provide that functionality.
I’m guessing that, if DataFusion takes off, that will attract more interest from Rustaceans, and that will lead to a fairly robust set of analytic functions in a C-compatible library.

For my part, I prefer to use R when I’m doing type A data science, and I prefer to use Python for type B work.
Python is still not as robust or easy to use for statistical analysis and data visualization, and doesn’t have a great answer to RStudio or RMarkdown. And R is still a chore to productionize.

RStudio is nice. I just dislike how in R there is literally 50 ways to accomplish the same thing. And I don’t mean that in the way Lisp/scheme has many solutions to an algorithmic problem. More like how - in the same file - I’ll come across…
x <- 10
x + 1 -> x
x = 12
The inconsistency of something as basic as variable assignment drives me crazy.
And then there’s all the operator overloading (ala Haskell). Unless you “know” it, it’s totally unobvious what an operator like %>%
does.

It just adds unnecessary cognitive load IMO.

@seanbunderwood - I’m curious what you mean by “type A” vs. “type B”?

Also, do you use Jupyter much? I have the list of things I dislike about it, but I generally use RStudio the same way I use Jupyter, so there’s not much of a win for RStudio over it (for me). I’m curious how your use differs.

Absolutely true. But it also doesn’t really bother me for the things I do in R. Maintainability isn’t a huge concern for code that doesn’t need to be maintained.


TL;DR: A stands for Analysis and B stands for Building.

Okay, I’ve pushed an update that adds the following changes:
• There’s now better list slicing: it drops the start of the source list so refs are always from the head of the slice ((slice-ref xs 0)
will be O(1) instead of O(start)). • There’s slice-range
which defaults to the proper in-
method for the fundamental types and a better sequence-map
for slices that handles lists without using slice-ref
to index into it for much improved performance.

Still more to do, but I’m still generally curious as to whether or not others think that this - or some heavily modified version implemented by the Racket team - would be beneficial in racket/base
?

Makes sense. Hadn’t heard that before. I’d say that I primarily work with “type A” and get the analysis thrown over the wall at me to build. But, I’ll sometimes dip my feet into type A work when needed. :wink:

I use Jupyter all the time. It’s my Python REPL.
I don’t really consider Jupyter and RStudio to be all that comparable. RStudio is a full-on analytical IDE with code editor, environment inspector, REPL, and a report-oriented flavor of notebooks.
Jupyter started as an improved Python REPL, and grew to include a nice GUI and the ability to organize code a bit better. But it’s still mostly just a REPL, and tends to become troublesome if you try to use it as much more than that.

I’d also like to get thoughts on how best to print slices. In my ideal world you’d see the materialized slice printed, but without having to actually materialize it. But, without doing so, I’m not sure if there’s a simple way of getting the default printing of each type?

It honestly feels a little weak as a REPL, too, sometimes. R’s REPL is more lisp-like, in that you can save full images. I don’t believe Jupyter lets you do that; all you can save is code and outputs. It also doesn’t let you just run individual snippets from a file you’re working on. You have to copy/paste them into the notebook. So, in my experience, there can be a lot of copying back and forth when you get to really hacking on something.

Yeah. I’ve never used RStudio as anything more than a REPL. I think I’ve opened a source file in it only a handful of times to edit and run.

@massung <https://www.infoq.com/articles/apache-arrow-java/|this article> seems to support your conjecture that Arrow might not be worth it “in the small”. I think it’s a given Racket is not going to be used for cluster-sized big data tasks. Just using an internal format that is column-major, as flomat does, may be sufficient for easily getting data into/out of Arrow should that need arise.

In the end, I’ll note that while this package works as a POC, it’s definitely not what I would want from built-in slicing.
I’d love to be able to do something like: (string-upcase (slice "hello, world!" 0 5)) ;=> "HELLO"
, but it’s not possible to use the existing functions like string-upcase
on slices without materializing them, which really sucks. :disappointed:

It probably depends a bit on what components you use? DataFusion is still new, but its stated goal is to be the fastest single-node data table engine. Already it’s posting fairly impressive numbers.

@seanbunderwood at a very fundamental level, I don’t have a good grasp on the implications of committing to zero-copy. It seems you have to work on the data in-place in the columnar format, so maybe there is a lot of duplicated functionality required.

But that’s for more general purpose data manipulation and SQL queries, not just number crunching.

That’s fine if you’re doing matrix stuff, and you can simply assign the pointer in flomat to the data, and then do some linear algebra, but in other contexts, it seems like we’d need to get the data into native Racket objects anyway.

For example, I can see the benefit of storing all the strings in a column contiguously, if you’re writing imperative C code, but not for Racket code.

I guess the idea is that you have to “vectorize” everything ? I’m using “vectorize” loosely i.e. pushing down functionality into the middleware.

So you don’t operate on a string, but you ask the middleware to do something on the list of strings, and then very late in the pipeline, you convert to a Racket string? I dunno, my mind is a bit overwhelmed at this point :)

Well, it would be just the pointers that are contiguous, anyway. Column-major is more of an advantage for compression than cache locality for non-atomic types, when you’re talking raw performance.
The other advantage for column major, though, is that data table manipulations tend to be more column-oriented than row-oriented in practice. And it’s cheaper to add, drop, or replace one column of n values than it is to mutate or replace n rows with m fields.

No, I think the point of Arrow, as I’ve just read, is that the strings are actually stored contiguously, not the pointers.

See <https://wesmckinney.com/blog/apache-arrow-pandas-internals/|Wes’ blog post>

In both the file format and the memory format?

“In pandas, an array of strings is an array of PyObject
pointers, and the actual string data lives inside PyBytes
or PyUnicode
structs that live all over the process heap.” —> “In Arrow, each string is right next to the previous one in memory, so you can scan all of the data in a column of strings without any cache misses. Processing contiguous bytes right against the metal, guaranteed.”

Yes, in the memory format.

Sounds great until you start thinking about how you’d work with those strings in Racket :)

Seems like this is an age-old issue - disparity between binary formats and particular language implementation details.

I think Julia might have facilities for making nice type wrappers around just “bits”.

Ah, that’s nice to see. Explains why they’ve been working so hard on adding string intrinsics to the library.

Which; I’d probably do like Pandas does and encourage people to use the string intrinsics whenever possible, rather than mapping arbitrary functions over the data.

I’m thinking there’s a catch–22 here - once (if??) Arrow has rich enough functionality internally, then maybe it can do all the aggregate stuff that’s needed, and you only convert to Racket later, but until then, it looks pretty rough.

Re: strings in columns, this is why DBs break out VARCHAR
vs. TEXT
. A column of 1M VARCHAR(40)
s will 40M bytes (assuming no compression), while 1M TEXT
will actually be 1M pointers to strings.

Well, that depends on the DB. In Postgres, there’s explicitly no difference in how they’re stored.

(“bytes” above should be “chars” if talking about UTF encoding, wide chars, etc)

I thought Postgres did not distinguish between varchar and text now.

Ok, @seanbunderwood sounds like you’re confirming my suspicion about the need to push more/most functionality down into the engine.

If that’s the case, then I think the only feasible way to use Arrow in Racket is to use the C library - way too much work to create this functionality natively in Racket using the Arrow memory layout.

But that’s likely going to require data conversion in most cases, so the zero copy benefits will be lost.

It’s generally the direction everyone is going in. Another advantage, in terms of performance, is that it gives more leeway to the optimizer.

I hope these Arrow developers are good then :)

The team includes a lot of the people who pushed Python to the center of the data universe, and unseated R and Java in the process. They might not succeed, but it wouldn’t be because they aren’t good.

I would say, the serialization chunks are already established, insofar as they’ve been embraced by the Spark project. That probably guarantees a decent level of support for as far in the future as it’s possible to project in this domain.
The bigger question is if the compute components, DataFusion, or Ballista really get any traction. And that’s probably down to how much different communities decide to collaborate on tools like that, versus how much they would rather do their own thing.

> Quotes like “the pandas library relies heavily on the NumPy array for the implementation of pandas data objects” from <https://medium.com/@ericvanrees/pandas-series-objects-and-numpy-arrays–15dfe05919d7|this post> are what made me think the NumPy.ndarray is “more fundamental” than the Pandas data frame. Yes, I agree here. NumPy is more generic and used by several other libraries.

@maxpaik2023 has joined the channel