gknauth
2021-8-9 12:43:01

People at work are doing a lot of Python data-sciency things with tables, data frames, parquet files, … I’ll have to see if this is useful in that context. The files we deal with are often in the 2–3GB range.


massung
2021-8-9 13:04:24

It’s definitely not ready for the multi GB sized files yet (and I’d love to support parquet files). I do TB-sized processing genetics at work, so if Racket can handle the load, I’m down for the challenge.

But, there’s a lot of things that need to happen for that to be possible as of yet. Most of those will deal with things like:

• Better file parsing • More conservative “reading” of CSV (e.g. when do you try and convert cell value -> number, date, etc.? • Parallel processing of column data when appropriate (e.g. aggregating groups) Consider the current version the entry with 90% features implemented, is only 2–6x slower than Pandas (except on data load, where it’s 10+x slower).


massung
2021-8-9 13:05:23

I have loaded and tested it with files in the 600 MB range (10M+ rows).


samdphillips
2021-8-9 14:43:02

IIRC the python csv library has an advantage in that most of the heavy lifting is in C


ben.knoble
2021-8-9 14:49:47

That sounds right to me; it’s the same with numpy (C and maybe Fortran?). Doing something similar here could make a difference.


samdphillips
2021-8-9 16:16:12

csv-reading isn’t the most performant it could be for racket. I have an experimental CSV reader (doesn’t handle all of the edge cases) that reads in 4kb buffers and operates on bytes (not strings) that’s ~4x faster. Benchmarked with 741MB file: $ raco run bench.rkt csv-reading < raw/systems_800.csv cpu time: 64015 real time: 64017 gc time: 163 5848000 $ raco run bench.rkt csv-reading < raw/systems_800.csv cpu time: 64422 real time: 64436 gc time: 166 5848000 $ raco run bench.rkt fast-csv < raw/systems_800.csv cpu time: 16907 real time: 16908 gc time: 226 5848000 $ raco run bench.rkt fast-csv < raw/systems_800.csv cpu time: 16850 real time: 16851 gc time: 227 5848000 I don’t think the csv-reading library does any number reading, but avoiding string->number can improve performance by avoiding unicode decoding and parameter lookups (these lookups can be avoided by setting all of the optional arguments in the call too) on each call.


samdphillips
2021-8-9 16:20:37

According to my notes I wrote it around Nov 2020 https://gist.github.com/a032514dc0093f00875922bc7c2c8b00


soegaard2
2021-8-9 16:23:02

Would it be worth to implement a bytes->number?


massung
2021-8-9 16:24:07

Yeah, CSV is tricky to do well and fast. Ideally, it wouldn’t even cons cells in a row either, as opposed to just being a sequence that returned cells in-order (as parsed) and a special symbol for 'end-of-row.


samdphillips
2021-8-9 16:24:10

Maybe? Some of these formats need custom readers.


samdphillips
2021-8-9 16:24:18

Like the JSON one is hand coded.


massung
2021-8-9 16:25:01

JSON can be significantly sped up as well by not being strict. For example, if you read t just assume true and skip 3 bytes.


sorawee
2021-8-9 16:32:01

This spam again :(


sorawee
2021-8-9 17:16:00

Feeling like READMEs in various repos should really be more descriptive. • The “description” field in every repo should be set. • There should be at least a sentence describing what the repo is for • Either link to Scribble doc, or have a lot more sentences to describe what it does I would love to help where possible, but my English sucks, so it might be easier for other people to do it.


massung
2021-8-9 17:16:55

> but my English sucks Your English is better than my <insert any other language here> :smile:


spdegabrielle
2021-8-9 17:18:27

I take it you mean repos linked the packages in the official package catalog?


sorawee
2021-8-9 17:18:59

Oh, I meant repos under racket organization. Sorry that was unclear


ben.knoble
2021-8-9 17:19:08

I agree in the large, but (at least with the last two packages I’ve published) I haven’t wanted to spend a lot of effort duplicating docs. I should add links though


spdegabrielle
2021-8-9 17:19:26

That’s tractable


sorawee
2021-8-9 17:19:30

For example, in racket/scribble, you have:

> This the source for the Racket packages: “scribble”, “scribble-doc”, “scribble-html-lib”, “scribble-lib”, “scribble-test”, “scribble-text-lib”.


spdegabrielle
2021-8-9 17:20:32

Description can’t be set by a pr - on the repo owners card do the description


sorawee
2021-8-9 17:20:54

Someone who doesn’t know Scribble would have a hard time to figure out what Scribble is.


shu--hung
2021-8-9 17:21:26

READMEs are in the individual package directories actually


shu--hung
2021-8-9 17:22:12

oh wait I am confused by scribble


shu--hung
2021-8-9 17:23:09

ah I see


sorawee
2021-8-9 17:24:15

My question originates from not knowing what https://github.com/racket/pkg-push does. After reading the code, I think it’s a system to convert packages in the previous package system to the new one, but I’m still not sure. It would save me a lot of time if this is clear from the README.



soegaard2
2021-8-9 17:29:04

Mor -> more


sorawee
2021-8-9 17:29:37

Yes, something like that.


shu--hung
2021-8-9 17:29:47

CIs are gonna fail until the next snapshot build is up..


sorawee
2021-8-9 17:31:00

welp


spdegabrielle
2021-8-9 17:31:09

Fixed


spdegabrielle
2021-8-9 17:33:28

I don’t know what pkg-push does either :sob: - there are no docs or readme but it is only ~300 lines - if I go hunting I can probably work it out from the pkg that requires pkg-push, but I’m cooking dinner now. Or meant to be


shu--hung
2021-8-9 17:34:33

pkg-push is >= 7 years old. Probably some old code that is not in use


shu--hung
2021-8-9 17:37:06

Small suggestion: this renders a little better on Github # Scribble: The Racket Documentation Tool Matthew Flatt and Eli Barzilay Scribble is a collection of tools for .... (sorry I don’t have the merge permission)


spdegabrielle
2021-8-9 17:44:52

I agree with @sorawee in general, but not for pkg-push - packages like that probably just need a generic pointer readme to the racket homepage; > Welcome! You probably don’t want this package :grinning: > If you are interested in Racket check the homepage at https://racket-lang.org\|https://racket-lang.org for downloads, documentation, libraries and community - where you are welcome to ask questions. >



ayushhh.sh
2021-8-9 18:38:39

@ayushhh.sh has joined the channel


ben.knoble
2021-8-9 18:51:00

I’m pretty sure even with GHFM you need a blank line after headers, or things get confused. At least this was true in the last couple years


rokitna
2021-8-9 19:15:32

There’s something I’m looking for in the docs but can’t recall the name of. I seem to recall there’s a feature that lets errortrace-compiled modules coexist alongside other modules, but is general enough that other tools like errortrace could use it too. I think it takes a string as an argument so that it can be incorporated into the filename. Does someone know what I’m looking for?


laurent.orseau
2021-8-9 19:24:40

Do you mean changing the default compiled directory, like ddracket does?


spdegabrielle
2021-8-9 20:20:25

Tweaked it a bit more.


hazel
2021-8-9 20:53:16

all my Racket repos have the README: NAME A Racket library for X. This software is under rapid development (if necesssary)

INSTALLATION raco pkg install the-package

DOCUMENTATION <links to Scribble docs>


hazel
2021-8-9 20:54:12

which is pretty much the bare minimum IMO


rokitna
2021-8-9 21:29:56

That sounds like it could be exactly what I’m looking for. :D Do you know of a link to that?



rokitna
2021-8-9 21:33:14

I was about to paste the same URL :D


rokitna
2021-8-9 21:33:26

I think this is exactly it, thank you so much


shu--hung
2021-8-10 02:37:34

Does Syntax Parse Bee accept illustration-only examples? That is, • Non-practical examples that illustrate a specific syntax of syntax-parse • Non-examples that show possible error messages of a specific syntax • Examples that partly overlap with the existing ones in the documentation of syntax-parse


zachmclark
2021-8-10 03:15:47

@zachmclark has joined the channel