A feature upgrade that would be useful for csv_read_file/3

Wisermans · June 20, 2021, 10:29am

On my side for now i am looking at what is existing, how it is done to think about how i would use it to grab data from a CSV file of which i just need some columns. If you look at UNdata for example there are tons of CSV files to play with. Same for ISO 3166-3 country codes etc.

jan · June 20, 2021, 11:47am

You may wish to check out SWISH -- SWI-Prolog for SHaring This link points at an example for accessing external CSV data from SWISH. This includes a shorthand notation for accessing columns (but it always downloads all of them).

drspro · June 28, 2021, 9:46am

I had to read very large Csv files too.

For me prolog is the fastes way
to read data, because the code is very short and it is my favorite language and maybe i
got addicted to it.

i tried the standard Csv read file to but it becomes to big memory with large files,
cvs read row is an option, and then you have manage the stream where and how you are reading.

a lot simpler is to use library readutil and then use pred: read_line_to_string(Sea, Sx)
as for example in this code

read_file_sea(Sea , Fnbase):- not( at_end_of_stream(Sea) ), read_line_to_string(Sea, Sx00),!,
process_handle_your_string(Sx00),
read_file_sea(Sea, Fnbase ).
read_file_sea(_ ,_):- !.

then use :
split_string(Sx00, “;”,"", Lis),

to get all the column values in a list,

before you split the line-string, you have to check if there are no
separators inside colomn values, and if so you have to make a predicate
which removes them

 in this way you can handle millions of rows very fast, and keep the memory
 very clean for only what u need

Indeed also in my work the reading of excel or csv files is a recurrent task.
I always convert them to Csv with this char as separator ; and Quoted Cells
if needed

what is a also problem is the way in which files are saved, for example with windows-code-
table or as utf8 or as unicode, then you will lose the special characters in strings,
and in most cases you will find that out in a later time segment.

is there a way to find out with which code table ( ascii , unicode_le, unicode_be,
utf8, octet etc ) files are saved?

maybe about a similar thing occurs when you have to read large XML files.
when I use the prolog XML tool i found no other option then to read the whole XML file
converted to 1 prolog argument. the term so big that processing this term is inefficient
in memory. You would then have to make functions that can read parts of this XML to
separated term, but then you would have to know where you have a valid Begin and End
XML-tag.

then it is a lot easier to read the XML file line by line ( with the prolog
read-line tool ) ( assuming here that there are newlines in the xml file )
and then to do in memory triggering on certain tags to be able to extract
the information you want.

also i have the question, what is the best most efficient way to Trim a string,
Trim means to remove trailing space and tabs , or spaces and tabs after the normal
characters.

this following works but is very inefficient?

trim( StrA, El):- split_string(StrA, “”,"\s\t\n",L), nth0(0,L,El, _),!.

or something like the following but then you have to know whter you are before, inside or
after the string

str_codes_remove_code(_, [] , [], [] ):- !.
str_codes_remove_code(Q, [Co|Codes] , Coremain, [Co|Co_weg] ):- Co = Q, !, str_codes_remove_code( Q, Codes , Coremain, Co_weg ).
str_codes_remove_code(Q, [Co|Codes] , [Co|Coremain], Co_weg ):- !, str_codes_remove_code( Q, Codes , Coremain, Co_weg ).

str_remove_code(Q, Str, Str2):- !, string_codes(Str, Codes), str_codes_remove_code(Q, Codes, Codes2, _), string_codes(Str2, Codes2).

and I made this predicate which i need a lot, can it be implemented better / more
efficient?

substr_between(Src, Btag, Etag, Btws):-
sub_string(Src, Sta,,, Btag),
sub_string(Src, Sta2,,, Etag), Sta2 > Sta,
string_length(Btag, Le1),
X is Sta + Le1,
Y is Sta2 - X,
sub_string(Src, X, Y,_, Btws),!.

jan · June 29, 2021, 4:34pm

You can use csv_read_file_row/3, which is really easy to use. Sure it is slower than what you use. That is the price you pay for a general solution that deals with most of the CSV (non-)standard pitfalls

No, unless the file has a BOM marker. By default SWI-Prolog open in text mode checks for a BOM marker. Otherwise you have to guess. I think there are tools around that try to do the guessing.

Of course if you have XML files from a specific source that works fine. The XML parser has a mode to make calbacks that allows parsing infinite files. The trick is to set a callback on the tag open. If you find the tag you are interested in you ask it to read the entire element and you process it. The RDF/XML parser works this way.

You get get rid of the nth0/3 (why /4?) call using this. As there is no split character you know the output list always holds exactly one element. This is surely as fast as it gets in SWI-Prolog. Even a user foreign implementation will probably only barely beat this.

split_string(StrA, "", "\s\t\n",[Stripped]).

Topic		Replies	Views
Tabling dynamics Help!	4	503	February 18, 2022
Recently started woring with prolog and decided to make a text documenting my annoyances General	18	628	October 21, 2023
Contributing libraries General	24	1194	March 20, 2019
Trying out a new formatting style General discussion	19	1241	May 3, 2020
New command line parser - reply 2 General	18	814	October 29, 2021

A feature upgrade that would be useful for csv_read_file/3

Related topics