Can you please provide advice on big data handling strategy?

Boris · October 4, 2020, 9:45am

~~I probably can but I don’t know how.~~ I have been trying to use listing/1 to show the results; this was probably the problem. I now edited the program at the same link:

https://swish.swi-prolog.org/p/JqbnsjEW.pl

It is not so bad Prolog for data analysis is somewhat similar to R. The sweet spot is when the data volume is not huge, but the analysis is complex. I guess a mixed approach is to export the useful data first, then use Prolog for only the relevant subset of the data.

The difficulty with that is that it becomes quite difficult to properly automate and document the workflow (query relational database with SQL, import to Prolog, analyze, visualize results and so on).

EricGT · October 4, 2020, 10:23am

You replied to many who left post here but for those of us that used the hide details toggle you did not reply.

In case you are not aware, if you see a post with something with a small triangle in front of it like this

Summary

This text will be hidden

then you need to click on the line to expand the section.

Doing so for this example will reveal This text will be hidden. In the above post, there is lots of valuable code in those hidden details.

We use them to hide the details so that those not wanting to see the detail don’t have to spend time scrolling past it.

EDIT

See: Expandable text sections

CapelliC · October 4, 2020, 10:25am

Albeit I love Prolog, I find your claim about SQL difficult to understand, if not because you’re going to use some higher level library you have available in Prolog.

But, are you already acquainted with aggregation or tabling in SWI-Prolog ?

RIght now I’m working with SqlServer (T-SQL), and I’ve replaced a long standing Delphi program (glue code to handle ADO+DB extensions to handle in memory DBs) with an SQL script, that is way shorter and easier to debug/extend, more accurate and way faster than the original.

What are you missing in your SQL exploration duty ?

Rscho314 · October 4, 2020, 11:00am

Well, I’m trying some metaresearch based on Semantic MEDLINE for my domain of focus (anesthesiology, am M.D.). I’m not claiming anything about SQL apart from the fact that I don’t like to read or write it. I’m planning to explore what is possible in ways similar to those guys, who are using MiniKanren instead of Prolog. However, I’d like to see what’s possible using extensions such as probabilistic logic programming or inductive logic programming. It’s fairly likely that I’ll end up with nothing of value.

Cannot say I’m acquainted, but I do know that those extensions exist, and will possibly use them.

I have no doubt that the SQL script is faster and simpler. Modern databases are finely polished engineering products supporting most of the buisness world, so it’s no surprise that they are extremely well optimized. However, I’m not simply searching for needles in a haystack. I’d like to try new things that might (highly unlikely) result in innovative biomedical results pertaining specifically to anesthesiology. Given the required flexibility (very high) and data volume (large, but not that large), it seems to me that Prolog is a way better candidate than SQL+<random functional/statistical language>. Don’t you agree?

EricGT · October 4, 2020, 11:11am

While you will not see those that use Prolog with bioinformatics here much they are about.

Reactome Pengine (GitHub) (paper)
SWI-Prolog pack bio_analytics.

If you looked at my examples in the QLF post you will see that it was using author references, e.g.

uniProt_reference_authors(reference_id(entry_name(swiss_prot,“001R”,“FRG3G”),1),“Tan W.G.”).

EricGT · October 4, 2020, 11:40am

Normally I would not post back to back, but since you are new here you might miss this if I did an edit to an existing post, e.g.

EDIT

Updated info.

So here is the new info in a new post.

See:
SWI-Prolog pack: cplint - A suite of programs for reasoning with probabilistic logic programs.
SWI-Prolog pack: bims - Bayesian inference of model structure. (Thanks Nicos (ref))

See SWI-Prolog pack: aleph - Aleph Inductive Logic Programming system.

I have no experience with any of them.

EDIT

Also see: Installing a SWI-Prolog pack

CapelliC · October 4, 2020, 12:14pm

To be true, this is exactly the point… in Prolog, matching text is restricted to identity, while in SQL you have approximate matching, easy aggregation, you can ignore (if you want) casing and - to some extent - weird problems related to character text encoding, and all major SQL engines have full text search (FTS) extensions, that could make exploration of large amount of natural language documents feasible.
Anyway, I would add to @EricGT suggestion this pack (see https://terminusdb.com/) that could be an useful tool, specially because of its relation to SWI-Prolog.

Success!

EricGT · October 4, 2020, 12:22pm

Of additional note is that TerminusDB is built using SWI-Prolog (main.pl) and also have a public forum using Discourse and chat using Discord.

They are also a sponsor of SWI-Prolog.

Rscho314 · October 4, 2020, 12:31pm

Yes, I completely agree. But myriads of people are already working on biomedical NLP mining, with limited success when it comes to real-world clinical applicable results, I must say (although they do fare better for more fundamental results such as metabolic pathways). Unfortunately, that’s not a surprise. I have done enough ‘classic’ clinical research to know that we have a long way to go until NLP is reliable enough for the use cases us MDs would like to see. That’s why I’m trying to use semantically-encoded data. The rigidity of the format makes it much more amenable to single-nerd-research. This view is also supported by the fact that the US precision medicine initiative obtained encouraging real-world results based upon semantic-encoded info.

Thanks, I did not know of that pack. TerminusDB is very interesting, and I will definitely try it someday. For now, I’m unfortunately stuck with handling misbehaving Excel spreadsheets from my head of unit (which is my very own corner of hell on earth).

peter.ludemann · October 4, 2020, 5:39pm

It’s easy to do approximate matching in Prolog; you can’t do it with unification, that’s all. For full-text search, wouldn’t something like Lucene be better than SQL add-ons?

For easy aggregation, see library(aggregate).

EricGT · October 4, 2020, 5:59pm

Lucene is what Neo4j uses. (ref)

EDIT

Redis, which was a recent addition to SWI-Prolog, also does Full Text Search (ref)

CapelliC · October 4, 2020, 6:02pm

I feel your pain
If your spreadsheets are in xlsx format, and you want try to use Prolog on them, you could consider this pack

drspro · October 6, 2020, 6:23am

in my experience prolog and swi prolog is very suitable to analyse data and also large amounts of data,

in most cases it is ofcourse not nessecary to keep everything in memory and still to be able consider the total amount of data.

in recent job descriptions I have seen that data-mining , building the data-warehouse, and to be a data engineer is a new area. there exist several data analysis tools that you to be familiar with to get the job.

it seems that i missed something in the past years because i always thought that data analysis was a standard job for the programmer.

again there are new tools that ofcourse i dont know yet, and that you have to have experience with to be able to participate in these jobs.

why are there again new Data-analysis tools software while
we can do data-mining already for years with swi-prolog?

jan · October 6, 2020, 6:45am

At CWI we have been using SWI-Prolog in a data science pipeline. I wrote a blog post about that for Amsterdam Data Science. See also SWISH DataLab: A Web Interface for Data Exploration and Analysis

Sorry for the shameless self-advertising

nicos · October 6, 2020, 8:18am

Dear Rscho314,

pack(db_facts) has facilities for auto mapping SQL tables to Prolog facts.
The added bonus is that you can use a common interface for ODBC & SQLite (via pack(proSQLite)).
In my experience ODBC is much more stable and complete, appropriate for multi-table
dbs- but has a high set up cost and low interoperability, whereas SQLite is much more portable and extremely easy to set up.

pack(bio_db) has many tricks for serving data from databases as prolog facts.
And it comes with a number of large tables that can be used to test in-memory
vs DB performance.

In terms of probabilistic programming, there is also pack(bims).
In terms of analysis there are interface packages to R (pack(Real) for terminal,
and Rserve (https://swish.swi-prolog.org/example/Rserve.swinb) for Swish).

Hope it helps,

Nicos Angelopoulos

EricGT · October 6, 2020, 8:39am

With regards to using SWI-Prolog with SQL databases these might be of use:
Wiki Discussion: Prolog in the mind set of SQL
SWI-Prolog connecting to PostgreSQL via ODBC

EricGT · October 6, 2020, 9:16am

Getting back to the title of the topic Can you please provide advice on big data handling strategy? and now knowing

Do you need to load and transform the data from the data source into Prolog on a regular basis, e.g. daily, weekly, or is this more of a long time between the loading of data, e.g. monthly, quarterly, or is this a one time transfer?

The reason I ask is that if the load and transform will be used quite often then it is worth the effort to look into automating the process as much as possible, i.e. SQL, but if the load and transfer is rare then I would just do what you are doing and load from an SQL dump, XML dump, or such.

Rscho314 · October 6, 2020, 10:14am

This would be a periodic update with at least some months interval, so SQL dump in QLF appears fine to me. I made a QLF last night and it went ok.

A bigger problem though, is that I was wrong stating 30x10^6 facts. The QLF is actually on the order of 150x10^6 facts (38.2 Gb). Unfortunately, 96Gb RAM are not enough so now I have 2 possibilities:

upgrade to server-class hardware
abandon hopes of having the data all in-memory

I am worried that option 2 will get in the way later when experimenting with pack(cplint) and others. BTW I have no idea whether pack(cplint) or pack(aleph) will be able to handle that much data . I also looked at potential alternatives such as answer-set programming, datalog and SQL-based solutions, but found nothing too convincing.

What do industry guys use when they want to combine deductive and statistical inference à la cplint? There are many possibilities if you just want to query and compute stats on that, the most evident being SQL+R but what if you want to generate a list of deduced facts with probability of occurrence?

EricGT · October 6, 2020, 10:29am

In case you or others are not aware of this.

Fabrizio Riguzzi (friguzzi) is the developer of cplint and related code and is a member of this forum.
Members of the this forum can be notified in this forum using @ with the user name, e.g. @EricGT
You can also create private/personal messages with one or more users.

If you click on a users icon, then in the upper right click on message, this will create a personal/private message. The admins can read the personal/private messages if need be, but we haven’t had the need. You can also invite more than one person into the message.

To see your mailbox of private/personal message in upper right of the page click on the in-box

Do not take this info to mean that you are required to use personal/private messages, it is just an option for everyone.

Topic		Replies	Views
Connecting Prolog to a SQL Database to Store facts Data Structure	1	150	April 2, 2024
How to query SQL database from Prolog and use the queried result as facts? Help!	5	3197	February 11, 2019
Scaling to billions of facts? General	18	4510	November 28, 2024
Yet another google language General	4	641	April 16, 2021
[JPL] Transfer large list Help!	6	510	July 25, 2022

Can you please provide advice on big data handling strategy?

Nicos Angelopoulos

Related topics