A feature upgrade that would be useful for csv_read_file/3

Wisermans · June 14, 2021, 9:55am

Playing with standard libraries and csv_read_file/3 while comparing it to other programming languages … i feel that a feature that would be useful (as it valorizes the “intelligence” of Prolog) would be to add a filter in the options where you can add header names to filter the columns that you want to import in spite of just using the arity. Example of option: header(‘header_name_1’, ‘header_name_2’, ‘header_name_3’) to just get the columns using those names, whatever is their position in the CSV colums. What do you think about it ?

Talking about adding features to be standard because they are in the standard CSV library, i also feel that those other ones could make a nice extension: csv_headers(+File, -Headers) where Headers = [ ‘header_name_1’, ‘header_name_2’ ] etc. to preprocess the filtering or csv_check(+File, -Stats, +Options) to get statistics on the number of lines, empty cells, number of commented out lines, etc. csv_write_file_row(+File, +Row, +Data, +Options) to have a simple way to update a CSV file “on the fly” when some data is missing or corrupted.

Boris · June 14, 2021, 10:58am

I use library(csv) on a regular basis. I also had some ideas how to extend it, when I first started using it, but with time I learned to live with it instead

One difficulty with the options is how they should interact with each other? You get the usual combinatorial explosion. Another thing I noticed early is that my common tasks are trivial/easy with Prolog, and with some, the CSV parsing is not the best place to achieve them.

This is NOT a call to not extend the library, it is just a comment on my experience with it.

EDIT: Another comment, a CSV file is a container for data, it is most certainly not a data structure. Depending on what I’m doing, I just read the data and populate a data structure that I will actually use. For example, a “data frame”, R-style, is maybe better for selecting columns?

EDIT2: (and I promise I stop editing this message) there are also two fundamentally different ways of using the library: loading all data in memory, or streaming the data and filtering on the fly. I remember that those were difficult to treat consistently if you decide to add more options to the interface.

Wisermans · June 14, 2021, 11:09am

My open discussion about those features is because people take what they have and are more and more focused on libraries as they make them get time. It is also one of the reasons why some other languages are much more popular as they have tons of snippets and already done stuff. Getting the headers is easy but extending csv_read_file is either making a new one or having the feature already implemented … In a classical user case you often get external CSV files already done and i find it useful to have a smart integration way.

jan · June 14, 2021, 11:17am

I tend to agree with @Boris worries. An option might be a predicate that reads a CSV file into something that resembles a data frame. One issue is that there is no set-in-stone representation for arrays in Prolog. We could have a predicate that reads a CSV into a list of dicts and then use dicts_slice/3 to do what you want. For reading a CSV into dicts see the csv_read_file/3 example.

We could also opt for a library that slices columns from a list of compounds.

I think that sooner or later we must have (foreign) data-frame like library. Foreign as it would allow handing such an array to foreign code an efficiently perform number crunching on it. Not sure how much of that is in the ffi_matrix add-on by @friguzzi

Boris · June 14, 2021, 11:18am

My suggestion today would be to write convenience predicates that work with the output of csv_read_file/3 and csv_read_row/3. Maybe those can become part of the library? Just because options really become unwieldy as they don’t combine nicely.

Boris · June 14, 2021, 11:24am

For a data frame I guess we would instead have to transpose and have a single dict, with column names as keys and columns as arrays (flat terms?). The order within the array is important, same index means same row.

Wisermans · June 14, 2021, 11:26am

My idea into that open discussion is to try to “standadize” things that Prolog users do on their own side and redo each time for their own use though it could be shared as a common basis … Tons of useful stuff that keep hidden somewhere though it could enhance standard libraries. Example = you have a CSV extract of 10 colums on 10 000 rows … you just need 2 … what do you do ?

Boris · June 14, 2021, 11:33am

my knee-jerk reaction would be probably to use cut and paste or awk before I let this touch my beautiful Prolog code.

Now seriously, I would probably use csv_read_file_row/3 and only get the columns I need, in the order I need.

Wisermans · June 14, 2021, 11:50am

OK so in terms of process = read each line then pick from it what you need to make it your own way, and if needed preprocess the header to know which column number you need.

Boris · June 14, 2021, 11:56am

Yes, that is what I would do / have done. Again, I am not claiming this is the best way to do it, but sometimes I just write the code so that I can keep moving…

Wisermans · June 14, 2021, 7:34pm

Good talking as it gives ideas … i then usually let some time to mature it then write some stuff as for now i am rookie on Prolog as I need to get back to it … moreover i need to think at how i will organize the data i want to play with … in that case a referential linked to thousands of financial instruments with long lists of characteristics to pick on web pages and to be managed on a daily basis … 365 days x 1 000 instruments x tens of daily data per instruments makes a funny challenge …

nicos · June 17, 2021, 12:44pm

Dear Wisermans,

The advice by Boris and Jan is solid.

In practice if you have a large input file you either transform it in memory once (and write out the narrowed version) or write simple code to enact the transformation. If you have enough memory
to load everything in memory it is straight forward to write to your own code, although pack(mtx)
already provides mtx_column_select/4
http://eu.swi-prolog.org/pack/file_details/mtx/src/mtx_column_select.pl

library(csv) was (and is) an excellent tool that allowed many of us to do data analytics within SWI.
So i very much see the point why it should not try to do too much.

Jan has structured and exposed the innards of csv well enough by now to make it easy to implement specialised versions. For instance see:
http://eu.swi-prolog.org/pack/file_details/mtx/src/mtx_read_table.pl

Having said that, processing each row on-the-fly is such a Prologese operation for which I probably argued in the past.

I just published pack(mtx) 0.6 which incorporates on-the-fly transformations:

?- assert( (
          only_c_b(Cb,Ln,RowIn,RowOut) :-
               ( Ln=:=1 ->
                    once(arg(Cb,RowIn,c_b)),
                    RowOut = row(c_b)
                    ;
                    arg(Cb,RowIn,CbItem),
                    RowOut = row(CbItem)
               )
          )
        ).

?-  tmp_file( testo, TmpF ),
    csv_write_file( TmpF, [row(c_a,c_b,c_c),row(1,a,b),row(2,aa,bb)], [] ),
    mtx( TmpF, Mtx, row_call(only_c_b(_)) ).

TmpF = '/tmp/swipl_testo_8588_1',
Mtx = [row(c_b), row(a), row(aa)].

?- mtx( '/tmp/swipl_testo_8588_1', Full ).
Full = [row(c_a, c_b, c_c), row(1, a, b), row(2, aa, bb)].

I suspect this will at some point be extended to also have an option to trigger ignoring of rows
for which the call fails.

Regards,

Nicos Angelopoulos

https://stoics.org.uk/~nicos

Wisermans · June 17, 2021, 1:02pm

Thx Nicos. At the end rather than making things more complex my CSV process should end up with something simple row by row … eheh and as far as i understood @Boris is also going to give somre more examples soon from his own day life library

On my side my idea is to grab financial data from different web sites, restructure / control it and play with … for fun at coffee time … so let’s say when i have time and find it funny to play with to relax and change my mind :-/ programming is fun … nothing to do with my professsional life even if i worked in the financial sector too … more like a hobby since i’m 8 y/o and i’m much older … even if i made some serious programming projects too … let me time to look at your links too …

On my part once i grab that CSV table it is just the start to tons of other data to grab and structure … so in the meantime i will also need to think about structuring all that. Roma has not been built in one day, moreover i need to get back to Prolog programming more seriously …

PS: My approach is always to minimize memory use and make it work at best on “upgraded” oldies (Windows 10 64-bit 8 MB RAM), also why I didn’t want to mount the full table in memory but just the needed colums, looking at what was existing to do so.

Wisermans · June 17, 2021, 6:13pm

PS2: I looked again at your MTX library … thx again … it looks really interesting … eheh i need to spend some more time on the source code

As for what i mentionned earlier there was 2 simple tools that could be quite “normalized” in a library :
1/ csv_header = list of columns names reading the first line (it also makes a good beginners “2 lines code” demo code too)
2/ csv_scan = stats / log report on a CSV file with some kind of usual scanner = number of rows, number of columns, cells lacking, type changing or looking bad within a column (string open not closed, string on some and figures in others withing the same column etc.)
… the advantage of libraries being to try to have the same predicate names used on classical operations.

Boris · June 18, 2021, 6:38am

Did you read the example under csv_read_file/3? This is a one-and-a-half liner, depending on whether you want it as a list or just as a row(...) like the rest:

csv_read_file(File, [Header|Rows]),
Header =.. [row|Colnames] % if needed

This isn’t too useful I am afraid, too many things left unspecified.

Wisermans · June 18, 2021, 7:49am

It’s why i talk about a simple 2 lines example dedicated to just getting a header with the column names … in the same style as examples for students or the 99 exercises = Hello style

Think about how apps importing data are working … they do such checks on structured data otherwise they would get garbage all day long. Personnaly when i import CSV data (till now in other languages) i always check what i get and i try to make it in a general way not to spend time redoing my personal libraries = general control + then i add extra specific ones depending on the use. Typically in CSV columns are of the same type within a column and when some fields/cells are lacking you look at where and why except if it is non table stlye data where rows matter more than columns.

Boris · June 20, 2021, 7:19am

This isn’t a CSV importer, this is a “tab-delimited no escape sequence no quoting -file importer”. A CSV file could break this in many different ways

A more complete implementation is already available with csv_read_file_row/3, including logical line numbers:

csv_read_file_row(File, Row, [line(N)])

Boris · June 20, 2021, 8:42am

See RFC 4180. The CSV “spec” says, among other things, that both the field and the line separators can be embedded without escaping inside double-quoted fields. The double quote itself is escaped with a double quote. It says a lot more but you can read it yourself. Not sure how useful it is though.

The only point I was trying to make is that CSV is not an excel file nor is it a data structure. It is a ubiquitous non-standard that is not strictly followed by its biggest users. The library in SWI-Prolog is useful because it handles the CSV files I have seen in the wild pretty well.

The other common format in the wild is the “tab-separated values”, often with the extension .tsv. The “specs” I have encountered seem to gravitate around “no embedding of field and record separators within fields”.

In bioinformatics at least there are a whole bunch of other formats that almost look like those two but aren’t.

Boris · June 20, 2021, 9:25am

I am not sure where this discussion is going. library(csv) has provisions for using different field separator, it is called separator and it is documented. See the csv//2 docs.

The other important part of the puzzle is that Microsoft has always made sure to break compatibility with third party tools maliciously. Their broken export from excel spreadsheets is only one example.

You can always “save as” whatever but once you have anything more than aaa,bbb,ccc in your data the quoting/escaping/embedding becomes relevant. Excel as a tool is fine; but sharing anything that has been in an excel spreadsheet has been a torture in so many different ways.

Wisermans · June 20, 2021, 9:47am

I got busy during the week but maybe also interesting to look at @nicos

“mtx” pack for SWI-Prolog

that i didn’t noticed until he pointed it out.

Topic		Replies	Views
Tabling dynamics Help!	4	503	February 18, 2022
Recently started woring with prolog and decided to make a text documenting my annoyances General	18	634	October 21, 2023
Contributing libraries General	24	1194	March 20, 2019
Trying out a new formatting style General discussion	19	1241	May 3, 2020
New command line parser - reply 2 General	18	817	October 29, 2021

A feature upgrade that would be useful for csv_read_file/3

Nicos Angelopoulos

Related topics