Can you please provide advice on big data handling strategy?

Rscho314 · October 2, 2020, 6:00pm

Hi,

I’d like to work on a big database (~30 million facts), that currently is in SQL format. I am aware of libs such as prosqlite or CQL, but I would much prefer to use plain prolog.

The SQL data is as follows:

HEADER (‘1’,‘0006-2944’,‘1975 Jun’,‘1975-6-1’,1975),(‘10’,‘1873-2968’,‘1975 Sep 01’,‘1975-9-1’,1975) END

And I parse it with the following DCGs:

... --> [] | [_], ... .

list([]) --> [].
list([L|Ls]) --> [L], list(Ls).

citation_facts([C|Cs]) --> ..., citation_fact(C), ",",  citation_facts(Cs).
citation_facts([C]) --> citation_fact(C), ... .

citation_fact([Pmid, Issn, Edat, Pyear]) --> "('",
                                                  list(P), { atom_codes(Pmid, P) }, "','",
                                                  list(I), { atom_codes(Issn, I) }, "','",
                                                  list(_),                          "','",
                                                  list(Ey), "-", list(Em), "-", list(Ed),
                                                  { atom_codes(Ay, Ey), atom_number(Ay, Ny),
                                                    atom_codes(Am, Em), atom_number(Am, Nm),
                                                    atom_codes(Ad, Ed), atom_number(Ad, Nd),
                                                    date_time_stamp(date(Ny,Nm,Nd,0,0,0,0,-,-), Edat) }, "',",
                                                  list(Y), { atom_codes(Py, Y), atom_number(Py, Pyear) },
                                             ")", !.

once(phrase_from_file(citation_facts(Cs), 'db.sql')).

Now of course if I do that on the entire database I blow up the stack because it puts everything in a single huge list. My idea is to use lib persistency and I’d like to use phrase_from_stream/2 to stream the data in constant memory from SQL to fact files. I read that the stream must be buffered. What does that mean in practice? Is using memory files the way to go? Can you give me an example on how to proceed to stream the data, please?

Also regarding lib persistency, my understanding is that it won’t require to keep all facts in memory at all times. Is that correct? My database is 67Gb uncompressed and I have 96Gb RAM available. Would you be able to provide a ballpark estimate of how much RAM I’ll need?

Thanks a bunch!

EricGT · October 2, 2020, 6:52pm

While I am a fan of library(persistency) personally I would look at Quick Load Files first.
See related topic: Quick Load Files

The topic of large fact files has also been discussed before, see: Scaling to billions of facts?

While you may need library(persistency) to create the first fact file. Once the file is created I would convert the data into a qlf.

I can’t help with this as I have not done that.

AFAIK any fact you want access to has to be in memory. If the fact is not in memory then Prolog will not know about it. As I use Windows I would think that Windows Virtual Memory should solve this, see: How To Manage Virtual Memory (Pagefile) In Windows 10

While I do have some experience parsing very large files into Prolog facts using DCGs, the one thing I know for sure is to make sure your DCG is deterministic. If it is not it will take quite some time to run though the variations which might not be what you want.

I can’t.

While these answers are based on my experience, you should wait to see if Jan W. gives an answer. If he has a different answer then both you and I will have learned something.

EDIT

If you just want to save the facts, then I would use library(persistency) as used here.

After you have parsed a single entry, then just persist it using something like

add_imports(Import_module,Export_module) :-
    (
        imports(Import_module,Export_module), !
    ;
        assert_imports(Import_module,Export_module)
    ).

where the functor for the fact is imports and values for fact is Import_module,Export_module.

The line

imports(Import_module,Export_module), !

checks to see if the fact is already persisted and if so, the cut then prunes the running the next statement.

The next statement

assert_imports(Import_module,Export_module)

is executed if the fact is not persisted and just adds an assert to the persistency journal file.

HTH

EDIT

atom_codes(Ay, Ey), atom_number(Ay, Ny)

can this not be replaced with number_codes/2 ?

The whole list thing had me confused until I remembered that the DCG tutorials do it that way, e.g. Using Definite Clause Grammars in SWI-Prolog or Prolog DCG Primer

I think you might find basics.pl – Various general DCG utilities of more value, especially string_without//2.

Personally I prefer the way Wouter Beek does parsing with difference list, dcg.pl

CapelliC · October 2, 2020, 7:21pm

Maybe library(intercept) could help? If the DB is mainly static data, I would try to translate it to Prolog clauses, at then save as QLF. Then reloading should be a lot faster. Maybe there is a way to avoid the translation in Prolog, try first on a small subset to assert facts (declared dynamic) and then issue qcompile.

Rscho314 · October 2, 2020, 8:59pm

Thank you for this stellar answer!

I did not know of the Quick Load Files solution, and it seems like it fits my use case perfectly. My data will be completely static. I will probably assert new facts in my programs, but I definitely will not retract anything.

My DCG is deterministic but leaves a redundant choice point which is why I used once/1 (although I suspect it’s just a matter of finding the right place for an additional cut).

Indeed it should be, but I get an ‘illegal number’ error when using number_codes/2 and I haven’t been able to find why. I would expect my current workaround to also fail, so I’m a bit stumped.

Thanks again!

Rscho314 · October 2, 2020, 9:03pm

I didn’t know of this lib, but judging from the doc it seems to fit my use case perfectly. I’ll give it a try ASAP

I do not really understand what you are thinking of, here. You mean that I would perhaps be able to make a QLF without compiling the SQL rows to Prolog facts? Can you please give me some clues about that?

Thanks!

Boris · October 2, 2020, 9:03pm

Your choice point comes from citations_facts//1, most probably, and the use of the “skipping” DCG ...//0.

Edit: I suspect everything can be much easier if you used the DCG libraries:

library(dcg/basics)
library(dcg/high_order)

I wouldn’t mind showing how but the format of your input file is not completely clear to me tbh. For example, there seems to be some redundancy in the last three columns but are they really the same date/data?

Rscho314 · October 2, 2020, 9:34pm

Thanks, indeed I modified the code as such:

list([]) --> [].
list([L|Ls]) --> [L], list(Ls).

citation_facts([C|Cs]) --> list(_), citation_fact(C), ",",  citation_facts(Cs), list(_) .
citation_facts([]) --> [].

citation_fact([Pmid, Issn, Edat, Pyear]) --> "('",
                                                  list(P), { atom_codes(Pmid, P) }, "','",
                                                  list(I), { atom_codes(Issn, I) }, "','",
                                                  list(_),                          "','",
                                                  list(Ey), "-", list(Em), "-", list(Ed),
                                                  { number_codes(Ny, Ey),
                                                    number_codes(Nm, Em),
                                                    atom_codes(Ad, Ed), atom_number(Ad, Nd),
                                                    date_time_stamp(date(Ny,Nm,Nd,0,0,0,0,-,-), Edat) }, "',",
                                                  list(Y), { atom_codes(Py, Y), atom_number(Py, Pyear) },
                                             ")", !.

Which allowed to roughly divide the inference number by 2. The choice point is still here, though. And there are still ‘illegal number’ errors triggered if I try to replace the remaining atom_codes/2 atom_number/2 combinations by the normally equivalent number_codes/2.

A longer and more realistic subset of the data would be:

INSERT INTO `CITATIONS` VALUES ('1','0006-2944','1975 Jun','1975-6-1',1975),('10','1873-2968','1975 Sep 01','1975-9-1',1975),('100','0547-6844','1975','1975-1-1',1975),('1000','0264-6021','1975 Sep','1975-9-1',1975),('10000','0006-3002','1976 Sep 28','1976-9-28',1976),('100000','0160-3450','1978 Sep','1978-9-1',1978),('1000000','0006-3363','1976 Dec','1976-12-1',1976);

I can manage with just the time stamp. I thought keeping the redundant year from the original data would be practical since the year is the most important part of the date. The date columns are redundant in the original data, but they actually all represent slightly different events.

Boris · October 2, 2020, 9:41pm

Are the last three columns truly redundant? Or do you really want a “date” from the Year-Month-Day column and additionally a year from the last column?

Either way, if you used library(dcg/basics), you can save yourself some typing. It might also be more efficient, who knows

Could you fix your last comment to have the “realistic subset of data” formatted as code? At the moment the single quotes are turned into smart quotes by the overeager styler of discourse and I cannot copy-paste the data.

peter.ludemann · October 2, 2020, 9:54pm

30million facts isn’t a lot. I have some test data that’s 650K facts that look fairly similar to yours - it takes 126MB in text form and 58MB when compiled; so 30M facts would take about 2.5GB compiled. (I could get you more specific numbers, but it takes me a couple of hours of processing to generate 30M facts; the 650K facts take 14 seconds to compile and 0.5 seconds to load, so I’d extrapolate that to 10 minutes to compile 30M facts and 25 seconds to load.)

One easy way to use QLF format is to create an empty .qlf file (e.g., using touch) and then load the file without the .pl or .qlf – this will compile the facts the first time it’s done and replace the empty .qlf file.

EricGT · October 2, 2020, 10:57pm

Have you tried wrapping the line with a catch/3 and just printing the error when it occurs with the input that is causing the error. For that much data, which is probably hand generated, I would not be surprised if there were errors.

The other thing to remember is that you might be the first person to run certain checks against the data, either knowingly or unknowingly and thus find some invalid data. This has happened to me a few times with very large datasets where when using DCGs, some unexpected semantics passed into the data but the DCG would catch the error.

jan · October 3, 2020, 8:19am

Note that if you have the SQL data anyway and possibly loaded in a database, why not use the ODBC driver to get it out of there and write it to a file as clauses in the format you want? Finally qcompile/1 the file and you’re done.

30M facts isn’t much of a problem for SWI-Prolog, but depending on the shape and indexes you need it may require quite a bit of memory.

CapelliC · October 3, 2020, 10:57am

Forgive me for not tackling on your problem too much directly.

Here is is a generic SQL parser ( just INSERT right now ), and a ‘driver’ that attempts to save what have been parsed in QLF. This last doesn’t work, sorry, qcompile(data_in) apparently just dump the source - that is, an empty module declaration.

Hope that you can take inspiration from the parser tough, there is value in separating it from the actual statement processing - that in my example, happens in driver.pl.
Note how compact it is, making use of sequence//3 and string//1. There is also an hint for handling syntax errors.

sql_parser.pl

/*  File:    sql_parser.pl
    Author:  Carlo,,,
    Created: Oct  3 2020
    Purpose: help on https://swi-prolog.discourse.group/t/can-you-please-provide-advice-on-big-data-handling-strategy/2997
*/

:- module(sql_parser,
          [sql_parser/2]).

:- use_module(library(intercept)).
:- use_module(library(pure_input)).
:- use_module(library(dcg/basics)).
:- use_module(library(dcg/high_order)).

:- meta_predicate sql_parser(+, 1).

sql_parser(Sql,HandleStatement) :-
    intercept(phrase_from_file(sql, Sql),
              Statement,
              call(HandleStatement,Statement)).

sql --> [].
sql --> statement(S), {send_signal(S)}, !, sql.

statement(insert([Table_name|Values])) -->
    "INSERT INTO ",
    table_name(Table_name),
    " VALUES ",
    sequence(value,",",Values),
    ";",
    blanks. % just in case last line is missing EOL
statement(_) -->
    syntax_error(insert_expected).

table_name(Table_name) --> "`", string(Table_name), "`".
value(Value) --> "(",sequence(field,",",Value),")".
field(Field) --> integer(Field).
field(Field) --> "'", string(Codes), "'", {string_codes(Field,Codes)}. % note: ignoring escaped quoting...

/* TBD
"SELECT", ... this is challenging
"UPDATE", ... handle after SELECT, to reuse WHERE analysis
"DELETE", ... same advice as UPDATE
*/

driver.pl

/*  File:    driver.pl
    Author:  Carlo,,,
    Created: Oct  3 2020
    Purpose: help on https://swi-prolog.discourse.group/t/can-you-please-provide-advice-on-big-data-handling-strategy/2997
*/

:- module(driver,
          [db_to_qlf/0]).

:- use_module(sql_parser).
:- use_module(data_in).

db_to_qlf :-
    clear,
    sql_parser('data_in.sql', handle_sql_statement).

clear :-
    data_in:abolish('CITATIONS'/5),
    data_in:abolish('OTHERTABLE'/5).

handle_sql_statement(insert([TableNameString|Values])) :-
    atom_string(TableName,TableNameString),
    maplist({TableName}/[Fields]>>(
                Fact =.. [TableName|Fields],
                data_in:assertz(Fact)),
           Values).

data_in.pl

:- module(data_in, []).

Last, a dummy test file, derived from your example duplicating and tweaking some rows.
Worth to note that blanks//0 simplify a lot the end of line processing.

data_in.sql

INSERT INTO `CITATIONS` VALUES ('1','0006-2944','1975 Jun','1975-6-1',1975),('10','1873-2968','1975 Sep 01','1975-9-1',1975),('100','0547-6844','1975','1975-1-1',1975),('1000','0264-6021','1975 Sep','1975-9-1',1975),('10000','0006-3002','1976 Sep 28','1976-9-28',1976),('100000','0160-3450','1978 Sep','1978-9-1',1978),('1000000','0006-3363','1976 Dec','1976-12-1',1976);
INSERT INTO `CITATIONS` VALUES ('2','0007-2944','1975 Jun','1975-6-1',1975),('10','1873-2968','1975 Sep 01','1975-9-1',1975),('100','0547-6844','1975','1975-1-1',1975),('1000','0264-6021','1975 Sep','1975-9-1',1975),('10000','0006-3002','1976 Sep 28','1976-9-28',1976),('100000','0160-3450','1978 Sep','1978-9-1',1978),('1000000','0006-3363','1976 Dec','1976-12-1',1976);
INSERT INTO `CITATIONS` VALUES ('3','0008-2944','1975 Jun','1975-6-1',1975),('10','1873-2968','1975 Sep 01','1975-9-1',1975),('100','0547-6844','1975','1975-1-1',1975),('1000','0264-6021','1975 Sep','1975-9-1',1975),('10000','0006-3002','1976 Sep 28','1976-9-28',1976),('100000','0160-3450','1978 Sep','1978-9-1',1978),('1000000','0006-3363','1976 Dec','1976-12-1',1976);

INSERT INTO `OTHERTABLE` VALUES ('12','0016-2944','1995 Jun','1975-6-1',1975),('10','1873-2968','1975 Sep 01','1975-9-1',1975),('100','0547-6844','1975','1975-1-1',1975),('1000','0264-6021','1975 Sep','1975-9-1',1975),('10000','0006-3002','1976 Sep 28','1976-9-28',1976),('100000','0160-3450','1978 Sep','1978-9-1',1978),('1000000','0006-3363','1976 Dec','1976-12-1',1976);
INSERT INTO `OTHERTABLE` VALUES ('13','0026-2944','1995 Jun','1975-6-1',1975),('10','1873-2968','1975 Sep 01','1975-9-1',1975),('100','0547-6844','1975','1975-1-1',1975),('1000','0264-6021','1975 Sep','1975-9-1',1975),('10000','0006-3002','1976 Sep 28','1976-9-28',1976),('100000','0160-3450','1978 Sep','1978-9-1',1978),('1000000','0006-3363','1976 Dec','1976-12-1',1976);
INSERT INTO `OTHERTABLE` VALUES ('14','0036-2944','1995 Jun','1975-6-1',1975),('10','1873-2968','1975 Sep 01','1975-9-1',1975),('100','0547-6844','1975','1975-1-1',1975),('1000','0264-6021','1975 Sep','1975-9-1',1975),('10000','0006-3002','1976 Sep 28','1976-9-28',1976),('100000','0160-3450','1978 Sep','1978-9-1',1978),('1000000','0006-3363','1976 Dec','1976-12-1',1976);

You could try by means of

$ swipl driver.pl
...
?- db_to_qlf.
true .

?- data_in:listing.

:- dynamic'CITATIONS'/5.

'CITATIONS'("1", "0006-2944", "1975 Jun", "1975-6-1", 1975).
'CITATIONS'("10", "1873-2968", "1975 Sep 01", "1975-9-1", 1975).
'CITATIONS'("100", "0547-6844", "1975", "1975-1-1", 1975).
...

I hope @jan or @peter.ludemann could comment on the possibility to generate a QLF without saving the file - that by the way, could be as simple as issuing

?- tell('temp.pl'),data_in:listing,told,qcompile(temp). after a successful parse.

edit

I’ve found qsave_program/2, now looking for inner details.

edit again

Sorry, but cannot find a proper way to reuse qsave_program/2 (actually, src/pl-wic.c builtins) on this purpose. Seems we have to stick to dumping the source file for qcompile/1.

last edit, i promise

Without inlining of qcompile/1, there is little point to assertz/1 each record. Better then dump them while parsing…

simplified driver

:- module(driver,
          [db_to_qlf/0
          ]).

:- use_module(sql_parser).

db_to_qlf :-
    open('data_in.pl',write,S),
    time(sql_parser('data_in.sql', handle_sql_statement(S))),
    close(S),
    time(qcompile('data_in.pl')).

handle_sql_statement(S,insert([TableNameString|Values])) :-
    atom_string(TableName,TableNameString),
    maplist(
      {S,TableName}/[Fields]>>(
        F=..[TableName|Fields],
        format(S,'~q.~n',[F])
      ),
      Values).

HTH Carlo

EricGT · October 3, 2020, 12:06pm

It seems that you have the bad fortune of asking a question that is simple enough to play with but complex enough for us to have fun and demonstrate some useful code.

Here is my rendition of a possible solution. Don’t think you have to use this as I know when you look at it you will think some crazy guy wrote it with an overly complex set of DCGs, but having used DCGs to parse hundreds of files, this is what happens.

Details

Directory: C:/Users/Groot/Documents - Change as needed.

File: data.sql

INSERT INTO `CITATIONS` VALUES ('1','0006-2944','1975 Jun','1975-6-1',1975),('10','1873-2968','1975 Sep 01','1975-9-1',1975),('100','0547-6844','1975','1975-1-1',1975),('1000','0264-6021','1975 Sep','1975-9-1',1975),('10000','0006-3002','1976 Sep 28','1976-9-28',1976),('100000','0160-3450','1978 Sep','1978-9-1',1978),('1000000','0006-3363','1976 Dec','1976-12-1',1976);
INSERT INTO `CITATIONS` VALUES ('2','0006-2945','1975 Jun','1975-6-1',1975),('20','1873-2969','1975 Sep 01','1975-9-1',1975),('200','0547-6845','1975','1975-1-1',1975),('2000','0264-6022','1975 Sep','1975-9-1',1975),('20000','0006-3003','1976 Sep 28','1976-9-28',1976),('200000','0160-3451','1978 Sep','1978-9-1',1978),('2000000','0006-3364','1976 Dec','1976-12-1',1976);

For the example data above the second line is a copy of the first with the id for each citation changed.

File: persist_citations.pl
Note: The library(persistency) code is not needed to create the Quick Load File, but since you noted it in your question, there are not many working examples of library(persistency) code and it was easy, it was added.

:- module(persist_citations,[
    main/0
]).

:- use_module(library(persistency)).

:- set_prolog_flag(double_quotes, codes).
:- set_prolog_flag(back_quotes,string).

:- debug(citation).

:- persistent
    citation(pmid:number,issn:atom,edat:float,pyear:number).

% :- initialization(main).

file(input,'data.sql').

main :-
    db_attach('citation.journal', []),
    load_and_persist_data.

add_citation(Pmid, Issn, Edat, Pyear) :-
    debug(citation,'Pmid: ~w, Issn: ~w, Edat: ~w, Pyear: ~w',[Pmid,Issn,Edat,Pyear]),
    (
        (
            is_of_type(number, Pmid),
            is_of_type(atom, Issn),
            is_of_type(float, Edat),
            is_of_type(number, Pyear)
        )
    ->
        (
            citation(Pmid, Issn, Edat, Pyear), !
        ;
            assert_citation(Pmid, Issn, Edat, Pyear)
        )
    ;
        debug(citation,'Invalid citation data!~n',[])
    ).

load_and_persist_data :-
    setup_call_cleanup(
        (
            file(input,Filename),
            open(Filename,read,Stream)
        ),
        (
            % set_stream(Stream, newline(posix)), % If you need to preserve cr lf endings.
            read_stream_to_codes(Stream,Codes),
            DCG = 'citation_lines+',
            phrase(DCG,Codes,[])
        ),
        close(Stream)
    ).

'citation_lines+' -->
    citation_line, !,
    'citation_lines*'.
'citation_lines*' -->
    citation_line, !,
    'citation_lines*'.
'citation_lines*' --> [].

citation_line -->
    "INSERT INTO `CITATIONS` VALUES ",
    'citations+',
    semicolon,
    'eol?'.

'citations+' -->
    citation, !,
    'citations*'.
'citations*' -->
    comma,
    citation, !,
    'citations*'.
'citations*' --> [].

citation -->
    open_paren,
    pmid(Pmid),
    comma,
    issn(Issn),
    comma,
    date_unknown,
    comma,
    edat(Edat),
    comma,
    pyear(Pyear),
    close_paren,
    { add_citation(Pmid, Issn, Edat, Pyear) }.

pmid(Number) -->
    single_quote,
    'digit+'(Digit_codes,[]),
    single_quote,
    { number_codes(Number,Digit_codes) }.

issn(Atom) -->
    single_quote,
    'digit+'(T0,T1),
    dash(T1,T2),
    'digit+'(T2,[]),
    single_quote,
    { atom_codes(Atom,T0) }.

date_unknown -->
    single_quote,
    'digit+',
    'month_3_letter?',
    'date?',
    single_quote.

'month_3_letter?' --> sp, month_3_letter, !.
'month_3_letter?' --> [].

'date?' --> sp, 'digit+', !.
'date?' --> [].

edat(TimeStamp) -->
    single_quote,
    'digit+'(Year_codes,[]),
    dash,
    'digit+'(Month_codes,[]),
    dash,
    'digit+'(Day_codes,[]),
    single_quote,
    {
        number_codes(Year,Year_codes),
        number_codes(Month,Month_codes),
        number_codes(Day,Day_codes),
        date_time_stamp(date(Year,Month,Day,0,0,0,0,-,-), TimeStamp)
    }.

pyear(Number) -->
    'digit+'(T0,[]),
    { number_codes(Number,T0) }.

% Recognizers
% If the sequence is successfully recognized the side effect is that the sequence is removed from the input.
month_3_letter --> month_3_letter(_,_).
'digit+'       --> 'digit+'(_,_).
'digit*'       --> 'digit*'(_,_).
digit          --> digit(_,_).
'wsp+'         --> 'wsp+'(_,_).
'wsp*'         --> 'wsp*'(_,_).
wsp            --> wsp(_,_).
'cr?'          --> 'cr?'(_,_).
'eol?'         --> 'eol?'(_,_).
htab           --> htab(_,_).
cr             --> cr(_,_).
lf             --> lf(_,_).
sp             --> sp(_,_).
dash           --> dash(_,_).
semicolon      --> semicolon(_,_).
comma          --> comma(_,_).
open_paren     --> open_paren(_,_).
close_paren    --> close_paren(_,_).
single_quote   --> single_quote(_,_).

month_3_letter([0'J,0'a,0'n|T],T) --> "Jan", !.
month_3_letter([0'F,0'e,0'b|T],T) --> "Feb", !.
month_3_letter([0'M,0'a,0'r|T],T) --> "Mar", !.
month_3_letter([0'A,0'p,0'r|T],T) --> "Apr", !.
month_3_letter([0'M,0'a,0'y|T],T) --> "May", !.
month_3_letter([0'J,0'u,0'n|T],T) --> "Jun", !.
month_3_letter([0'J,0'u,0'l|T],T) --> "Jul", !.
month_3_letter([0'A,0'u,0'g|T],T) --> "Aug", !.
month_3_letter([0'S,0'e,0'p|T],T) --> "Sep", !.
month_3_letter([0'O,0'c,0't|T],T) --> "Oct", !.
month_3_letter([0'N,0'o,0'v|T],T) --> "Nov", !.
month_3_letter([0'D,0'e,0'c|T],T) --> "Dec".

'digit+'(T0,T) -->
    digit(T0,T1), !,
    'digit*'(T1,T).

 'digit*'(T0,T) -->
    digit(T0,T1), !,
    'digit*'(T1,T).
 'digit*'(T,T) --> [].

 digit([C|T],T) -->
    [C],
    { between(0'0,0'9,C) }.

'wsp+'(T0,T) -->
    wsp(T0,T1), !,
    'wsp*'(T1,T).

'wsp*'(T0,T) -->
    wsp(T0,T1), !,
    'wsp*'(T1,T).
'wsp*'(T,T) --> [].

wsp(T0,T) -->
    sp(T0,T), !.
wsp(T0,T) -->
    htab(T0,T).

'cr?'(T0,T) --> cr(T0,T), !.
'cr?'(T,T) --> [].

'eol?'(T0,T) -->
    'cr?'(T0,T1),
    lf(T1,T), !.
'eol?'(T,T) --> [].

htab([0x09|T],T) --> [0x09].
cr([0x0D|T],T)   --> [0x0D].
lf([0x0A|T],T)   --> [0x0A].
sp([0x20|T],T)   --> [0x20].

dash([0'-|T],T)         --> "-".
semicolon([0';|T],T)    --> ";".
comma([0',|T],T)        --> ",".
open_paren([0'(|T],T)   --> "(".
close_paren([0')|T],T)  --> ")".
single_quote([0''|T],T) --> "'".

If you have questions please ask; it would takes pages just to explain why the code is written like this. I know this is not beginner DCGs and does lots of subtle things that will not be be easily understood.

File: citation.journal - This is where the data is persisted after running the code.

created(1601723402.391526).
assert(citation(1,'0006-2944',170812800.0,1975)).
assert(citation(10,'1873-2968',178761600.0,1975)).
assert(citation(100,'0547-6844',157766400.0,1975)).
assert(citation(1000,'0264-6021',178761600.0,1975)).
assert(citation(10000,'0006-3002',212716800.0,1976)).
assert(citation(100000,'0160-3450',273456000.0,1978)).
assert(citation(1000000,'0006-3363',218246400.0,1976)).
assert(citation(2,'0006-2945',170812800.0,1975)).
assert(citation(20,'1873-2969',178761600.0,1975)).
assert(citation(200,'0547-6845',157766400.0,1975)).
assert(citation(2000,'0264-6022',178761600.0,1975)).
assert(citation(20000,'0006-3003',212716800.0,1976)).
assert(citation(200000,'0160-3451',273456000.0,1978)).
assert(citation(2000000,'0006-3364',218246400.0,1976)).

Example run with debug enabled.

?- working_directory(_,'C:/Users/Groot/Documents').
true.

?- ['persist_citations.pl'].
true.

?- main.
% Pmid: 1, Issn: 0006-2944, Edat: 170812800.0, Pyear: 1975
% Pmid: 10, Issn: 1873-2968, Edat: 178761600.0, Pyear: 1975
% Pmid: 100, Issn: 0547-6844, Edat: 157766400.0, Pyear: 1975
% Pmid: 1000, Issn: 0264-6021, Edat: 178761600.0, Pyear: 1975
% Pmid: 10000, Issn: 0006-3002, Edat: 212716800.0, Pyear: 1976
% Pmid: 100000, Issn: 0160-3450, Edat: 273456000.0, Pyear: 1978
% Pmid: 1000000, Issn: 0006-3363, Edat: 218246400.0, Pyear: 1976
% Pmid: 2, Issn: 0006-2945, Edat: 170812800.0, Pyear: 1975
% Pmid: 20, Issn: 1873-2969, Edat: 178761600.0, Pyear: 1975
% Pmid: 200, Issn: 0547-6845, Edat: 157766400.0, Pyear: 1975
% Pmid: 2000, Issn: 0264-6022, Edat: 178761600.0, Pyear: 1975
% Pmid: 20000, Issn: 0006-3003, Edat: 212716800.0, Pyear: 1976
% Pmid: 200000, Issn: 0160-3451, Edat: 273456000.0, Pyear: 1978
% Pmid: 2000000, Issn: 0006-3364, Edat: 218246400.0, Pyear: 1976
true.

?- halt.

Example run with :- debug(citation). commented out.

?- working_directory(_,'C:/Users/Groot/Documents').
true.

?- ['persist_citations.pl'].
true.

?- main.
true.

?- halt.

Note: If you have not worked with library(persistency) before, you need to run halt at the end so that the persistency file will be updated and closed. Until halt the file will be open but not contain all of the data.

Example run that creates the Quick Load File.

?- working_directory(_,'C:/Users/Groot/Documents').
true.

?- ['persist_citations.pl'].
true.

?- main.
true.

?- tell('citation_facts.pl'),listing(persist_citations:citation/4),told.

With a text editor open `citation_facts.pl` and remove the line `:- dynamic citation/4.` I don't know how to create a listing of a dynamic predicate without that appearing.

?- qcompile('citation_facts.pl').
true.

Check to make sure `citation_facts.qlf` was created; it is a binary file so not worth trying to read in an editor.

?- halt.

Start up a new SWI-Prolog instance.

?- working_directory(_,'C:/Users/Groot/Documents').
true.

?- ['citation_facts.qlf'].
true.

?- citation(A,B,C,D).
A = 1,
B = '0006-2944',
C = 170812800.0,
D = 1975 ;
A = 10,
B = '1873-2968',
C = 178761600.0,
D = 1975 ;
A = 100,
B = '0547-6844',
C = 157766400.0,
D = 1975 ;
...

?- citation(Pmid,Issn,Edat_timestamp,Pyear),stamp_date_time(Edat_timestamp,Edat,0).
Pmid = 1,
Issn = '0006-2944',
Edat_timestamp = 170812800.0,
Pyear = 1975,
Edat = date(1975, 6, 1, 0, 0, 0.0, 0, -, -) ;
Pmid = 10,
Issn = '1873-2968',
Edat_timestamp = 178761600.0,
Pyear = 1975,
Edat = date(1975, 9, 1, 0, 0, 0.0, 0, -, -) ;
...

HTH

Rscho314 · October 3, 2020, 10:02pm

Well, actually… that’s a much better solution! I was so preoccupied with transcribing to facts that I missed the forest for the trees. Thank you!

Rscho314 · October 3, 2020, 10:05pm

Thank you for showing those examples. I’m gonna make good use of all that!

peter.ludemann · October 3, 2020, 11:01pm

I find SQL so painful that I’ve sometimes converted SQL databases into Prolog facts and query with Prolog.

Rscho314 · October 4, 2020, 2:03am

This is exactly what I’m going to do. I aim at exploratory analysis in this database, so SQL would be very impractical.

Boris · October 4, 2020, 5:55am

Finally, I found some time to show what I mean by “use library(dcg/basics)”. The code is available on SWISH, here:

https://swish.swi-prolog.org/p/JqbnsjEW.pl

Use ?- run. to run it.

This code also shows how to use library(intercept). It makes it possible to write a DCG without arguments, so less typing, which I like. Because I cannot assert on SWISH, the third argument in the intercept/3 call just writes to standard output; for your use-case, you could be doing an assertz there.

A practical note: this all depends on your setup and such code is probably completely unnecessary. Even if you don’t want to setup odbc connectivity for your database, but your data is already in a database anyway, you could just query the data, output it in tabular format, then use library(csv) to read that and assertz.

jan · October 4, 2020, 7:51am

Me too, but SELECT * from table is doable and enough to import the database.

You can.

P.s. Install a local copy of SWISH and you have a great tool for exploring data as it is easy to write little programs and notebooks, dump tables, charts, diagrams, etc.

Rscho314 · October 4, 2020, 9:10am

Thanks for taking the time to write all that. Indeed, I it looks far simpler and more modular. I’ll try using library(dcg/basics) from now on. I’m not sure yet of how I’ll end up tackling this task, but most likely it will be a mix of all of you guys have shown me here.

Topic		Replies	Views
I've put an introductory Prolog tut for SQL programmers on Swish Nice to know	47	2480	March 4, 2021
Memory use for database Help!	16	913	June 19, 2022
Database micro-benchmark (Discussion) Discussion	22	840	April 25, 2020
Scaling to billions of facts? General	18	4565	November 28, 2024
Does prolog fit my use case? Help!	21	872	September 17, 2019

Can you please provide advice on big data handling strategy?

Related topics