alexharsanyi
2022-1-4 08:54:22

For lists, it looks like you store the entire list in the slice and convert a slice-ref call into a list-ref with the slice start offset.

For a list of 1 million items and a slice of (slice my-list <tel:9000001000000\|900000 1000000>), a (slice-ref 0) will be converted into a (list-ref 900000) . This is especially bad if the user calls (slice-ref 0), (slice-ref 1), etc.

For lists, you could drop the first start elements of the list from the slice, so (slice-ref 0) becomes (list-ref 0) , this would make slices more efficient for lists.


alexharsanyi
2022-1-4 08:55:37

> I’d very much like to allow slices of strings and bytes to bed used as input ports. You can use make-input-port to implement a call-with-input-slice, or open-input-slice , similar to call-with-input-string or open-input-string


alexharsanyi
2022-1-4 08:58:21

You probably also want to implement in-slice to allow slices to be iterated over using for loops, although, they proabably won’t be as fast as using in-list or (in-vector v start end) , since the for marcros know how to expand those…


ben.knoble
2022-1-4 12:41:07

^Relatedly, it could be nice if there were a way to build forms that co-operated with the for macros to say “this thing is iterable and I’m supplying the relevant optimization, so don’t worry about it”


massung
2022-1-4 14:30:06

> For lists, it looks like you store the entire list in the slice and convert Yes. This is one of those “I’d like to do better” things. Your suggestion (@alexharsanyi) regarding dropping the initial start cons cells is what I was going to do, just not for the initial pass to have everyone take a look-see.

> You can use make-input-port… I was hoping that open-input-string and open-input-bytes allowed for start and end arguments to limit where the input port read from, but sadly they do not. Before diving down making custom input port code, my thought was someone might know a simple work-around to get what I wanted easier.

> You probably also want to implement in-slice… Why? The slices are sequences already and can be used in for loops as is. Is the concern just that adding the one level of indirection means it will be a bit slower than looping over the underlying structure natively?


massung
2022-1-4 14:32:31

Also, regarding ports, I wasn’t sure if I wanted to special case string and byte slices -> ports as opposed to letting any slice be an input port. Aside from fact that a list or vector could store non-character/byte values, is there any particular reason to not allow an input port to read values from a list or vector or …?


badkins
2022-1-4 14:50:08

@alexharsanyi I agree that it’s not feasible for Racket to reproduce the functionality of the Python ecosystem. However, one lesson I learned from developing web applications with Ruby/Rails for over a decade, and successfully switching to Racket, is that I didn’t need 100% of Rails (which has become quite bloated), and creating the percentage that I needed wasn’t that difficult. I’m still very much a newbie when it comes to data science, but I expect something similar may apply, as you relate from your personal experience. I’m interested in having a substantial subset of NumPy.ndarray as a start.


badkins
2022-1-4 15:04:16

As a side note, I naively believed the hype that Julia was poised to supplant Python in this space. On paper, Julia looks fantastic, and I do think it is a much nicer language than Python, and the community is largely focused on scientific computing, but it seems their progress is quite slow - probably due to the massive Python community and inertia. Who knows, maybe in a couple decades they will do it. If Julia is having such a difficult time, I expect Racket will also, but Alex’ point about “not needing the best” is a good one. An example is my <https://lojic.com/covid.html|simple covid stats page> - I wrote a couple thousand lines of Julia (lots of boilerplate due to being new) because I thought I’d do “something fancy” with the data. In that instance, I could have provided the same functionality with Racket easily, and simply “shelled out” to Python if I needed fancy stuff later, so incrementally improving Racket is very worthwhile IMO.


massung
2022-1-4 15:05:58

@badkins How large is your input data to that page? Not sure what kind of operations you’re doing to get the data you care about, but may want to take a look at (shameless plug) tabular-asa.



massung
2022-1-4 15:07:17

If there is some feature it lacks you need, I’d be happy to implement it.


badkins
2022-1-4 15:09:19

The data is the <https://github.com/CSSEGISandData/COVID–19/tree/master/csse_covid_19_data/csse_covid_19_time_series|Johns Hopkins datset> - it’s not huge. I will probably port the code to Python first, as another learning exercise. My current plan is to become competent with the Python ecosystem first, and then see what I might be able to contribute on the Racket side.


badkins
2022-1-4 15:10:05

I find the idea of having a solid ndarray that many other things could be built on top of, fairly compelling.


badkins
2022-1-4 15:10:56

But, the difference between practice and theory is much greater in practice than in theory :) Hence, my current Python research :)


massung
2022-1-4 15:11:15

unless you mean to use an ndarray as a cell’s value (or need matrix math), I’ve almost never had a need for them wrt tables/dataframes. you almost always want columnar data anyway for performance reasons


badkins
2022-1-4 15:12:06

I expect ndarray uses columnar, right? Isn’t Pandas built on ndarray? That’s what I’m talking about.


badkins
2022-1-4 15:12:32

Another way to look at this is, take ndarray away from the Python ecosystem, what breaks?


massung
2022-1-4 15:13:30

Pandas likely does lots of things under-the-hood for handling lots of specialized situations efficiently. But storing data frames as ndarrays would likely not be one of them. It would just be fundamentally too slow for lots of things.


massung
2022-1-4 15:14:12

That said, maybe Pandas does use it for certain things.


badkins
2022-1-4 15:14:28

I could very well be mistaken. Like I said, I need to become competent first :)


massung
2022-1-4 15:14:37

But most of the time, you want column-major data


massung
2022-1-4 15:15:30

Not intending to toot, but I’m speaking as someone who worked professionally writing databases for a period of time and writes dataframe code every day currently (mostly using Python, but R and other languages as well)


massung
2022-1-4 15:15:37

God I hate R. :stuck_out_tongue:


badkins
2022-1-4 15:16:48

Column-major is a given; hence, the interest in Apache Arrow.


massung
2022-1-4 15:19:20

Arrow - once it merged w/ Parquet - has a lot of nice things going for it. But until your dataset is large enough to really benefit from distributed work across nodes (and/or files), you likely won’t see much benefit in using it unless you just want to take advantage of various aspects of it (like the Parquet format or running on GPUs)


massung
2022-1-4 15:21:06

But if it’s just a learning exercise, I can’t knock that effort. I do that every day “for fun” too :slightly_smiling_face:


massung
2022-1-4 15:21:25

And my wife rolls her eyes at me constantly for it.


badkins
2022-1-4 15:22:43

I suppose if you ended up never interoperating with other things, it could be a waste of time, but otherwise, since you have to decide on a layout anyway, why not choose the Arrow layout?


badkins
2022-1-4 15:23:19

I’m pretty sure I could receive Arrow data and use it in flomat with zero copy - seems like a plus.


badkins
2022-1-4 15:25:29

And if the computing functionality of Arrow ever expanded to be on par with NumPy, you could get it for free via the C API.


badkins
2022-1-4 15:27:29

I don’t think the latter is a stated goal, but they do provide some <https://arrow.apache.org/docs/c_glib/arrow-glib/compute.html|basic stuff> now.


massung
2022-1-4 15:31:16

I’m not knocking using Arrow. Go for it! :slightly_smiling_face:

But, there are areas where - due to the optimizations it’s implementing - you may run into issues given whatever your data is.

For example, columns must be a specific type for every value in that column. It’s not possible to mix floats and strings in the same column. This is typically a great thing, but can cause issues if your data isn’t cleaned first or you’re pulling from crazy sources (e.g. clinicians — ugh).

I just want to note that until the data size hits a critical mass, most of the performance is algorithmic, which you’ll get with any columnar library, be it Pandas, tabular-asa, or something else. The Arrow stuff is about taking all those algorithmic advantages and then giving them crack for taking advantage of multi-core, cache coherency, SIMD, etc. All great things. It just depends on what your goals are.

If it’s just to process some 1M rows of data quickly, don’t bother with it as you won’t see a benefit and may even see some slow (due to FFI, processing setup, reductions, etc.).

If your goal is to learn some awesome stuff, add an Arrow package to the Racket community, or you have some serious data to process, then absolutely go for it! :smile:


badkins
2022-1-4 15:32:35

Quotes like “the pandas library relies heavily on the NumPy array for the implementation of pandas data objects” from <https://medium.com/@ericvanrees/pandas-series-objects-and-numpy-arrays–15dfe05919d7|this post> are what made me think the NumPy.ndarray is “more fundamental” than the Pandas data frame.


badkins
2022-1-4 15:34:08

Arrow has Union though, right?



massung
2022-1-4 15:34:26

Pandas columns have types associated with them (obviously). Quite often the type is object, which means the cell value can be any Python value and there’s literally no advantage from NumPy being gained. But, if the type is something like numpy.int64 then yes, you’ll get some speed-ups for sure.


massung
2022-1-4 15:35:47

> Arrow has Union though, right? I’ve never used it, so I can’t speak to it. I’m sure it’s efficient, though. :slightly_smiling_face:


badkins
2022-1-4 15:37:48

re: your comment above, my “goal” is a simple one - to minimize the need to use other languages than Racket :) I’m currently trying to find the appropriate dividing line between Racket and Python, and then to nudge that line ever so slightly toward Racket :)


badkins
2022-1-4 15:38:43

Given the layered approach to much of software, it makes sense to me to consider what the more foundational pieces are, and make sure they’re worth building upon.


badkins
2022-1-4 15:40:07

It sounds like I may be mistaken though re: Pandas relying on ndarray. If Pandas is doing “it’s own thing” for data, that’s a different story.


badkins
2022-1-4 15:41:24

Regardless, I need to go get familiar with the existing Python ecosystem, then things will become more clear to me.


massung
2022-1-4 15:41:46

In this particular area, the biggest issue Racket has is in data loading. There’s no packages for loading Parquet or Feather data files (that I know of — there’s a parquet package in raco, but no docs and it doesn’t appear to do anything). CSV loading is really slow as is parsing JSON. I’m not sure why Racket reading data is so slow, but I’m hoping to speed up CSV parsing for tabular-asa (hence some of this slicing work)


massung
2022-1-4 15:42:38

Once the data is in Racket, though, processing tables, columns, etc. is pretty darn speedy.


badkins
2022-1-4 15:43:18

I can see slow loading is an issue, but it doesn’t seem “fundamental” to me as in there’s not something lower level, on which it depends, that needs fixing. AFAICT.


massung
2022-1-4 15:44:06

I’m pretty sure it’s just a metric ton of copying/allocating that is totally unnecessary.


badkins
2022-1-4 15:45:42

I suppose I’m really thinking of stable APIs. If the API is solid, the implementation can be improved later, but if the API is not good, then higher layers will need rework later, and that’s a more serious problem.


massung
2022-1-4 15:45:57

sure. that makes sense.


badkins
2022-1-4 15:46:50

It is amazing how fast I/O can be in interpreted languages like Ruby and Python though!


massung
2022-1-4 15:47:18

Because usually it isn’t (it’s in C). :wink:


badkins
2022-1-4 15:47:24

Hopefully it’s not necessary to do csv reading in C for Racket.


badkins
2022-1-4 15:48:16

Yes, I know it’s in C :) My point is that they’ve optimized some really important stuff.


sorawee
2022-1-4 15:48:37

FWIW, in-slice is already a thing (https://docs.racket-lang.org/reference/sequences.html#%28def._%28%28lib._racket%2Fsequence..rkt%29._in-slice%29%29), and it means something else. So we should pick a different name.

in-... could be defined using define-sequence-syntax + :do-in, which could boost the performance significantly.


massung
2022-1-4 15:48:40

oh, lol. yeah for sure.


massung
2022-1-4 15:49:51

I think what each community is good at - for successful languages - is knowing what their audience needs and making it happen. Racket just hasn’t prioritized IO as it hasn’t been a priority. I don’t believe anyone is trying to use Racket for processing GB worth of data.


badkins
2022-1-4 15:49:57

I work with a decent size database where I need to reload 12 GB of data four times a year. That’s coming up on Saturday, so I may take the opportunity to do some benchmarking of I/O in Racket & Python.


badkins
2022-1-4 15:51:32

I converted Ruby code to Racket, made use of places to parallelize things, and the process is dramatically faster. Although, that’s not quite fair, because I also changed how I load the data into Postgres.


massung
2022-1-4 15:52:01

I’m unclear as to the benefits of having a slice-range (or whatever it’s called) function as opposed to just implementing #:property prop:sequence.

Obviously the current implementation I have sucks for lists (so needs improved), but for all the other sequence types it works just fine in for loops.


ben.knoble
2022-1-4 15:55:39

IIRC the idea is that, without any annotation, those for loops have to go through some fairly generic machinery, which can be expensive relative to expanding to some special machinery it can handle inline (I think it’s really the cost of some kind of generic dispatch versus being able to call unsafe operations directly.)


badkins
2022-1-4 16:09:49

@massung somewhat germane to this discussion, just out of curiosity, what are the reasons you sometimes need to use R vs. Python?


massung
2022-1-4 16:11:15

I work with computational biologists who know R well enough to cause problems, but don’t know Python. :slightly_smiling_face:


massung
2022-1-4 16:11:47

So usually it’s taking their code and either fixing it or (rarely) porting it.


seanbunderwood
2022-1-4 16:21:50

It’s not officially stated on the website, but the project’s founder has said it’s a major goal in some conference talks. My guess is it’s not on the immediate roadmap because Python, R, and Julia already have their own packages that provide that functionality.

I’m guessing that, if DataFusion takes off, that will attract more interest from Rustaceans, and that will lead to a fairly robust set of analytic functions in a C-compatible library.


seanbunderwood
2022-1-4 16:28:11

For my part, I prefer to use R when I’m doing type A data science, and I prefer to use Python for type B work.

Python is still not as robust or easy to use for statistical analysis and data visualization, and doesn’t have a great answer to RStudio or RMarkdown. And R is still a chore to productionize.


massung
2022-1-4 16:32:32

RStudio is nice. I just dislike how in R there is literally 50 ways to accomplish the same thing. And I don’t mean that in the way Lisp/scheme has many solutions to an algorithmic problem. More like how - in the same file - I’ll come across…

x &lt;- 10 x + 1 -&gt; x x = 12 The inconsistency of something as basic as variable assignment drives me crazy.

And then there’s all the operator overloading (ala Haskell). Unless you “know” it, it’s totally unobvious what an operator like %&gt;% does.


massung
2022-1-4 16:35:25

It just adds unnecessary cognitive load IMO.


massung
2022-1-4 16:35:44

@seanbunderwood - I’m curious what you mean by “type A” vs. “type B”?


massung
2022-1-4 16:36:41

Also, do you use Jupyter much? I have the list of things I dislike about it, but I generally use RStudio the same way I use Jupyter, so there’s not much of a win for RStudio over it (for me). I’m curious how your use differs.


seanbunderwood
2022-1-4 16:36:44

Absolutely true. But it also doesn’t really bother me for the things I do in R. Maintainability isn’t a huge concern for code that doesn’t need to be maintained.



seanbunderwood
2022-1-4 16:38:35

TL;DR: A stands for Analysis and B stands for Building.


massung
2022-1-4 16:41:13

Okay, I’ve pushed an update that adds the following changes:

• There’s now better list slicing: it drops the start of the source list so refs are always from the head of the slice ((slice-ref xs 0) will be O(1) instead of O(start)). • There’s slice-range which defaults to the proper in- method for the fundamental types and a better sequence-map for slices that handles lists without using slice-ref to index into it for much improved performance.


massung
2022-1-4 16:41:56

Still more to do, but I’m still generally curious as to whether or not others think that this - or some heavily modified version implemented by the Racket team - would be beneficial in racket/base?


massung
2022-1-4 16:45:02

Makes sense. Hadn’t heard that before. I’d say that I primarily work with “type A” and get the analysis thrown over the wall at me to build. But, I’ll sometimes dip my feet into type A work when needed. :wink:


seanbunderwood
2022-1-4 16:45:11

I use Jupyter all the time. It’s my Python REPL.

I don’t really consider Jupyter and RStudio to be all that comparable. RStudio is a full-on analytical IDE with code editor, environment inspector, REPL, and a report-oriented flavor of notebooks.

Jupyter started as an improved Python REPL, and grew to include a nice GUI and the ability to organize code a bit better. But it’s still mostly just a REPL, and tends to become troublesome if you try to use it as much more than that.


massung
2022-1-4 17:00:35

I’d also like to get thoughts on how best to print slices. In my ideal world you’d see the materialized slice printed, but without having to actually materialize it. But, without doing so, I’m not sure if there’s a simple way of getting the default printing of each type?


seanbunderwood
2022-1-4 17:02:15

It honestly feels a little weak as a REPL, too, sometimes. R’s REPL is more lisp-like, in that you can save full images. I don’t believe Jupyter lets you do that; all you can save is code and outputs. It also doesn’t let you just run individual snippets from a file you’re working on. You have to copy/paste them into the notebook. So, in my experience, there can be a lot of copying back and forth when you get to really hacking on something.


massung
2022-1-4 17:04:50

Yeah. I’ve never used RStudio as anything more than a REPL. I think I’ve opened a source file in it only a handful of times to edit and run.


badkins
2022-1-4 17:06:05

@massung <https://www.infoq.com/articles/apache-arrow-java/|this article> seems to support your conjecture that Arrow might not be worth it “in the small”. I think it’s a given Racket is not going to be used for cluster-sized big data tasks. Just using an internal format that is column-major, as flomat does, may be sufficient for easily getting data into/out of Arrow should that need arise.


massung
2022-1-4 17:23:12

In the end, I’ll note that while this package works as a POC, it’s definitely not what I would want from built-in slicing.

I’d love to be able to do something like: (string-upcase (slice "hello, world!" 0 5)) ;=&gt; "HELLO", but it’s not possible to use the existing functions like string-upcase on slices without materializing them, which really sucks. :disappointed:


seanbunderwood
2022-1-4 17:42:43

It probably depends a bit on what components you use? DataFusion is still new, but its stated goal is to be the fastest single-node data table engine. Already it’s posting fairly impressive numbers.


badkins
2022-1-4 17:45:54

@seanbunderwood at a very fundamental level, I don’t have a good grasp on the implications of committing to zero-copy. It seems you have to work on the data in-place in the columnar format, so maybe there is a lot of duplicated functionality required.


seanbunderwood
2022-1-4 17:46:28

But that’s for more general purpose data manipulation and SQL queries, not just number crunching.


badkins
2022-1-4 17:46:45

That’s fine if you’re doing matrix stuff, and you can simply assign the pointer in flomat to the data, and then do some linear algebra, but in other contexts, it seems like we’d need to get the data into native Racket objects anyway.


badkins
2022-1-4 17:47:35

For example, I can see the benefit of storing all the strings in a column contiguously, if you’re writing imperative C code, but not for Racket code.


badkins
2022-1-4 17:48:20

I guess the idea is that you have to “vectorize” everything ? I’m using “vectorize” loosely i.e. pushing down functionality into the middleware.


badkins
2022-1-4 17:49:27

So you don’t operate on a string, but you ask the middleware to do something on the list of strings, and then very late in the pipeline, you convert to a Racket string? I dunno, my mind is a bit overwhelmed at this point :)


seanbunderwood
2022-1-4 17:52:07

Well, it would be just the pointers that are contiguous, anyway. Column-major is more of an advantage for compression than cache locality for non-atomic types, when you’re talking raw performance.

The other advantage for column major, though, is that data table manipulations tend to be more column-oriented than row-oriented in practice. And it’s cheaper to add, drop, or replace one column of n values than it is to mutate or replace n rows with m fields.


badkins
2022-1-4 17:52:36

No, I think the point of Arrow, as I’ve just read, is that the strings are actually stored contiguously, not the pointers.


badkins
2022-1-4 17:53:03

See <https://wesmckinney.com/blog/apache-arrow-pandas-internals/|Wes’ blog post>


seanbunderwood
2022-1-4 17:53:12

In both the file format and the memory format?


badkins
2022-1-4 17:53:43

“In pandas, an array of strings is an array of PyObject pointers, and the actual string data lives inside PyBytes or PyUnicode structs that live all over the process heap.” —> “In Arrow, each string is right next to the previous one in memory, so you can scan all of the data in a column of strings without any cache misses. Processing contiguous bytes right against the metal, guaranteed.”


badkins
2022-1-4 17:53:59

Yes, in the memory format.


badkins
2022-1-4 17:54:22

Sounds great until you start thinking about how you’d work with those strings in Racket :)


badkins
2022-1-4 17:55:03

Seems like this is an age-old issue - disparity between binary formats and particular language implementation details.


badkins
2022-1-4 17:55:37

I think Julia might have facilities for making nice type wrappers around just “bits”.


seanbunderwood
2022-1-4 17:56:21

Ah, that’s nice to see. Explains why they’ve been working so hard on adding string intrinsics to the library.


seanbunderwood
2022-1-4 17:57:34

Which; I’d probably do like Pandas does and encourage people to use the string intrinsics whenever possible, rather than mapping arbitrary functions over the data.


badkins
2022-1-4 17:58:01

I’m thinking there’s a catch–22 here - once (if??) Arrow has rich enough functionality internally, then maybe it can do all the aggregate stuff that’s needed, and you only convert to Racket later, but until then, it looks pretty rough.


massung
2022-1-4 17:58:17

Re: strings in columns, this is why DBs break out VARCHAR vs. TEXT. A column of 1M VARCHAR(40)s will 40M bytes (assuming no compression), while 1M TEXT will actually be 1M pointers to strings.


seanbunderwood
2022-1-4 17:59:01

Well, that depends on the DB. In Postgres, there’s explicitly no difference in how they’re stored.


massung
2022-1-4 17:59:03

(“bytes” above should be “chars” if talking about UTF encoding, wide chars, etc)


badkins
2022-1-4 17:59:06

I thought Postgres did not distinguish between varchar and text now.


badkins
2022-1-4 18:00:35

Ok, @seanbunderwood sounds like you’re confirming my suspicion about the need to push more/most functionality down into the engine.


badkins
2022-1-4 18:01:16

If that’s the case, then I think the only feasible way to use Arrow in Racket is to use the C library - way too much work to create this functionality natively in Racket using the Arrow memory layout.


badkins
2022-1-4 18:01:54

But that’s likely going to require data conversion in most cases, so the zero copy benefits will be lost.


seanbunderwood
2022-1-4 18:01:59

It’s generally the direction everyone is going in. Another advantage, in terms of performance, is that it gives more leeway to the optimizer.


badkins
2022-1-4 18:02:41

I hope these Arrow developers are good then :)


seanbunderwood
2022-1-4 18:03:57

The team includes a lot of the people who pushed Python to the center of the data universe, and unseated R and Java in the process. They might not succeed, but it wouldn’t be because they aren’t good.


seanbunderwood
2022-1-4 18:09:24

I would say, the serialization chunks are already established, insofar as they’ve been embraced by the Spark project. That probably guarantees a decent level of support for as far in the future as it’s possible to project in this domain.

The bigger question is if the compute components, DataFusion, or Ballista really get any traction. And that’s probably down to how much different communities decide to collaborate on tools like that, versus how much they would rather do their own thing.


sschwarzer
2022-1-4 18:51:52

> Quotes like “the pandas library relies heavily on the NumPy array for the implementation of pandas data objects” from <https://medium.com/@ericvanrees/pandas-series-objects-and-numpy-arrays–15dfe05919d7|this post> are what made me think the NumPy.ndarray is “more fundamental” than the Pandas data frame. Yes, I agree here. NumPy is more generic and used by several other libraries.


maxpaik2023
2022-1-4 20:43:18

@maxpaik2023 has joined the channel