A feature upgrade that would be useful for csv_read_file/3

My suggestion today would be to write convenience predicates that work with the output of csv_read_file/3 and csv_read_row/3. Maybe those can become part of the library? Just because options really become unwieldy as they don’t combine nicely.

For a data frame I guess we would instead have to transpose and have a single dict, with column names as keys and columns as arrays (flat terms?). The order within the array is important, same index means same row.

2 Likes

My idea into that open discussion is to try to “standadize” things that Prolog users do on their own side and redo each time for their own use though it could be shared as a common basis … Tons of useful stuff that keep hidden somewhere though it could enhance standard libraries. Example = you have a CSV extract of 10 colums on 10 000 rows … you just need 2 … what do you do ?

:slight_smile: my knee-jerk reaction would be probably to use cut and paste or awk before I let this touch my beautiful Prolog code.

Now seriously, I would probably use csv_read_file_row/3 and only get the columns I need, in the order I need.

OK so in terms of process = read each line then pick from it what you need to make it your own way, and if needed preprocess the header to know which column number you need.

1 Like

Yes, that is what I would do / have done. Again, I am not claiming this is the best way to do it, but sometimes I just write the code so that I can keep moving…

1 Like

Good talking as it gives ideas … i then usually let some time to mature it then write some stuff as for now i am rookie on Prolog as I need to get back to it … moreover i need to think at how i will organize the data i want to play with … in that case a referential linked to thousands of financial instruments with long lists of characteristics to pick on web pages and to be managed on a daily basis … 365 days x 1 000 instruments x tens of daily data per instruments makes a funny challenge …

Dear Wisermans,

The advice by Boris and Jan is solid.

In practice if you have a large input file you either transform it in memory once (and write out the narrowed version) or write simple code to enact the transformation. If you have enough memory
to load everything in memory it is straight forward to write to your own code, although pack(mtx)
already provides mtx_column_select/4
http://eu.swi-prolog.org/pack/file_details/mtx/src/mtx_column_select.pl

library(csv) was (and is) an excellent tool that allowed many of us to do data analytics within SWI.
So i very much see the point why it should not try to do too much.

Jan has structured and exposed the innards of csv well enough by now to make it easy to implement specialised versions. For instance see:
http://eu.swi-prolog.org/pack/file_details/mtx/src/mtx_read_table.pl

Having said that, processing each row on-the-fly is such a Prologese operation for which I probably argued in the past.

I just published pack(mtx) 0.6 which incorporates on-the-fly transformations:

?- assert( (
          only_c_b(Cb,Ln,RowIn,RowOut) :-
               ( Ln=:=1 ->
                    once(arg(Cb,RowIn,c_b)),
                    RowOut = row(c_b)
                    ;
                    arg(Cb,RowIn,CbItem),
                    RowOut = row(CbItem)
               )
          )
        ).

?-  tmp_file( testo, TmpF ),
    csv_write_file( TmpF, [row(c_a,c_b,c_c),row(1,a,b),row(2,aa,bb)], [] ),
    mtx( TmpF, Mtx, row_call(only_c_b(_)) ).

TmpF = '/tmp/swipl_testo_8588_1',
Mtx = [row(c_b), row(a), row(aa)].

?- mtx( '/tmp/swipl_testo_8588_1', Full ).
Full = [row(c_a, c_b, c_c), row(1, a, b), row(2, aa, bb)].

I suspect this will at some point be extended to also have an option to trigger ignoring of rows
for which the call fails.

Regards,

Nicos Angelopoulos

https://stoics.org.uk/~nicos

1 Like

Thx Nicos. At the end rather than making things more complex my CSV process should end up with something simple row by row … eheh and as far as i understood @Boris is also going to give somre more examples soon from his own day life library :slight_smile:

On my side my idea is to grab financial data from different web sites, restructure / control it and play with … for fun at coffee time … so let’s say when i have time and find it funny to play with to relax and change my mind :-/ programming is fun … nothing to do with my professsional life even if i worked in the financial sector too … more like a hobby since i’m 8 y/o and i’m much older … even if i made some serious programming projects too … let me time to look at your links too …

On my part once i grab that CSV table it is just the start to tons of other data to grab and structure … so in the meantime i will also need to think about structuring all that. Roma has not been built in one day, moreover i need to get back to Prolog programming more seriously …

PS: My approach is always to minimize memory use and make it work at best on “upgraded” oldies (Windows 10 64-bit 8 MB RAM), also why I didn’t want to mount the full table in memory but just the needed colums, looking at what was existing to do so.

PS2: I looked again at your MTX library … thx again … it looks really interesting … eheh i need to spend some more time on the source code :slight_smile:

As for what i mentionned earlier there was 2 simple tools that could be quite “normalized” in a library :
1/ csv_header = list of columns names reading the first line (it also makes a good beginners “2 lines code” demo code too)
2/ csv_scan = stats / log report on a CSV file with some kind of usual scanner = number of rows, number of columns, cells lacking, type changing or looking bad within a column (string open not closed, string on some and figures in others withing the same column etc.)
… the advantage of libraries being to try to have the same predicate names used on classical operations.

Did you read the example under csv_read_file/3? This is a one-and-a-half liner, depending on whether you want it as a list or just as a row(...) like the rest:

csv_read_file(File, [Header|Rows]),
Header =.. [row|Colnames] % if needed

This isn’t too useful I am afraid, too many things left unspecified.

It’s why i talk about a simple 2 lines example dedicated to just getting a header with the column names … in the same style as examples for students or the 99 exercises = Hello style

Think about how apps importing data are working … they do such checks on structured data otherwise they would get garbage all day long. Personnaly when i import CSV data (till now in other languages) i always check what i get and i try to make it in a general way not to spend time redoing my personal libraries = general control + then i add extra specific ones depending on the use. Typically in CSV columns are of the same type within a column and when some fields/cells are lacking you look at where and why except if it is non table stlye data where rows matter more than columns.

This isn’t a CSV importer, this is a “tab-delimited no escape sequence no quoting -file importer”. A CSV file could break this in many different ways :slight_smile:

A more complete implementation is already available with csv_read_file_row/3, including logical line numbers:

csv_read_file_row(File, Row, [line(N)])

See RFC 4180. The CSV “spec” says, among other things, that both the field and the line separators can be embedded without escaping inside double-quoted fields. The double quote itself is escaped with a double quote. It says a lot more but you can read it yourself. Not sure how useful it is though.

The only point I was trying to make is that CSV is not an excel file nor is it a data structure. It is a ubiquitous non-standard that is not strictly followed by its biggest users. The library in SWI-Prolog is useful because it handles the CSV files I have seen in the wild pretty well.

The other common format in the wild is the “tab-separated values”, often with the extension .tsv. The “specs” I have encountered seem to gravitate around “no embedding of field and record separators within fields”.

In bioinformatics at least there are a whole bunch of other formats that almost look like those two but aren’t.

I am not sure where this discussion is going. library(csv) has provisions for using different field separator, it is called separator and it is documented. See the csv//2 docs.

The other important part of the puzzle is that Microsoft has always made sure to break compatibility with third party tools maliciously. Their broken export from excel spreadsheets is only one example.

You can always “save as” whatever but once you have anything more than aaa,bbb,ccc in your data the quoting/escaping/embedding becomes relevant. Excel as a tool is fine; but sharing anything that has been in an excel spreadsheet has been a torture in so many different ways.

1 Like

I got busy during the week but maybe also interesting to look at @nicos

“mtx” pack for SWI-Prolog

that i didn’t noticed until he pointed it out.

On my side for now i am looking at what is existing, how it is done to think about how i would use it to grab data from a CSV file of which i just need some columns. If you look at UNdata for example there are tons of CSV files to play with. Same for ISO 3166-3 country codes etc.

You may wish to check out SWISH -- SWI-Prolog for SHaring This link points at an example for accessing external CSV data from SWISH. This includes a shorthand notation for accessing columns (but it always downloads all of them).

I had to read very large Csv files too.

For me prolog is the fastes way
to read data, because the code is very short and it is my favorite language and maybe i
got addicted to it.

i tried the standard Csv read file to but it becomes to big memory with large files,
cvs read row is an option, and then you have manage the stream where and how you are reading.

a lot simpler is to use library readutil and then use pred: read_line_to_string(Sea, Sx)
as for example in this code

read_file_sea(Sea , Fnbase):- not( at_end_of_stream(Sea) ), read_line_to_string(Sea, Sx00),!,
process_handle_your_string(Sx00),
read_file_sea(Sea, Fnbase ).
read_file_sea(_ ,_):- !.

then use :
split_string(Sx00, “;”,"", Lis),

to get all the column values in a list,

before you split the line-string, you have to check if there are no
separators inside colomn values, and if so you have to make a predicate
which removes them

 in this way you can handle millions of rows very fast, and keep the memory
 very clean for only what u need

Indeed also in my work the reading of excel or csv files is a recurrent task.
I always convert them to Csv with this char as separator ; and Quoted Cells
if needed

what is a also problem is the way in which files are saved, for example with windows-code-
table or as utf8 or as unicode, then you will lose the special characters in strings,
and in most cases you will find that out in a later time segment.

is there a way to find out with which code table ( ascii , unicode_le, unicode_be,
utf8, octet etc ) files are saved?

maybe about a similar thing occurs when you have to read large XML files.
when I use the prolog XML tool i found no other option then to read the whole XML file
converted to 1 prolog argument. the term so big that processing this term is inefficient
in memory. You would then have to make functions that can read parts of this XML to
separated term, but then you would have to know where you have a valid Begin and End
XML-tag.

then it is a lot easier to read the XML file line by line ( with the prolog
read-line tool ) ( assuming here that there are newlines in the xml file )
and then to do in memory triggering on certain tags to be able to extract
the information you want.

also i have the question, what is the best most efficient way to Trim a string,
Trim means to remove trailing space and tabs , or spaces and tabs after the normal
characters.

this following works but is very inefficient?

trim( StrA, El):- split_string(StrA, “”,"\s\t\n",L), nth0(0,L,El, _),!.

or something like the following but then you have to know whter you are before, inside or
after the string

str_codes_remove_code(_, [] , [], [] ):- !.
str_codes_remove_code(Q, [Co|Codes] , Coremain, [Co|Co_weg] ):- Co = Q, !, str_codes_remove_code( Q, Codes , Coremain, Co_weg ).
str_codes_remove_code(Q, [Co|Codes] , [Co|Coremain], Co_weg ):- !, str_codes_remove_code( Q, Codes , Coremain, Co_weg ).

str_remove_code(Q, Str, Str2):- !, string_codes(Str, Codes), str_codes_remove_code(Q, Codes, Codes2, _), string_codes(Str2, Codes2).

and I made this predicate which i need a lot, can it be implemented better / more
efficient?

substr_between(Src, Btag, Etag, Btws):-
sub_string(Src, Sta,,, Btag),
sub_string(Src, Sta2,,, Etag), Sta2 > Sta,
string_length(Btag, Le1),
X is Sta + Le1,
Y is Sta2 - X,
sub_string(Src, X, Y,_, Btws),!.

You can use csv_read_file_row/3, which is really easy to use. Sure it is slower than what you use. That is the price you pay for a general solution that deals with most of the CSV (non-)standard pitfalls :slight_smile:

No, unless the file has a BOM marker. By default SWI-Prolog open in text mode checks for a BOM marker. Otherwise you have to guess. I think there are tools around that try to do the guessing.

Of course if you have XML files from a specific source that works fine. The XML parser has a mode to make calbacks that allows parsing infinite files. The trick is to set a callback on the tag open. If you find the tag you are interested in you ask it to read the entire element and you process it. The RDF/XML parser works this way.

You get get rid of the nth0/3 (why /4?) call using this. As there is no split character you know the output list always holds exactly one element. This is surely as fast as it gets in SWI-Prolog. Even a user foreign implementation will probably only barely beat this.

split_string(StrA, "", "\s\t\n",[Stripped]).