Get html out of package(http)

The thing is: I need use_module(library(http/html_write)), which is part of package http. No problem, but the more technical parts of http depend on ssl (and ssl depends on other things), so the additional dependency is a bit overkill.

Suggestion:

  • create a separate package html that only has html_write (and the quasiquotation)
  • let http depend on html and ssl

Does this sound reasonable?

It is funny you mention this right now. A couple of days ago I was trying to scrape a web site; here is a tiny incantation that gets the links inside a page (oversimplified for demonstration):

html_links(Source, Link) :-
    load_html(Source, DOM, []),
    xpath(DOM, //a, Link).

Now, load_html/3 is documented as:

Availability: :- use_module(library(sgml)). (can be autoloaded)

And xpath/3 is documented as:

Availability: :- use_module(library(xpath)). (can be autoloaded)

But if you tried to run this, you get two problems. One, the // doesn’t seem to be yet defined, so:

?- [scrape].
ERROR: scrape.pl:4:16: Syntax error: Operator expected
true.

Then, if you do load library(xpath), it compiles but you still get:

?- html_links("https://swi-prolog.org", Links).
ERROR: iri_scheme `https' does not exist

Loading library(sgml) does not help. A minimal working example is:

:- use_module(library(http/http_open)).
:- use_module(library(xpath)).

html_links(Source, Link) :-
    load_html(Source, DOM, []),
    xpath(DOM, //a, Link).

Why can I skip library(http/http_ssl_plugin)? I don’t know that either.

PS: I probably don’t quite understand autoloading.

Autoloading operators gets kind of hard :frowning:

You need this because it acts as a plugin to open_any/5 as used by load_html/3 to provide support for http and https urls. If you do not use this you end up opening the url as a file. Since some versions all basic Prolog predicates that accept a file name also accept a URL provided the plugins for the used scheme are loaded. That is currently used for the res:// URLs that provide access to the resources compiled into a saved state. It should also be added for http, etc (I guess).

All in all, autoloading is great for predicates that

  • Whose arguments do not require operators defined in the same library.
  • Do not depend on plugins (or you must load the plugin beforehand, but that mostly violates the usefulness of autoloading).
  • Do not use goal_expansion/2 on their arguments. This indirectly also affects meta-predicates where the meta-arguments are usually subject to goal_expansion/2 but as we do not know that the predicate is an meta-predicate before it is loaded we do not use this expansion.

That still leaves a large part of the library where it works great, including calling most of the development tools such as listing/1, gtrace/0, etc.

For actual product development I typically explicitly import required libraries. Using PceEmacs you can do so using ^C^D for a specific file (still has some issues, but on most files it works great).

1 Like

Good question. It is surely the case that the HTTP package contains a lot of stuff that is also useful without using the HTTP protocol. Think for example about JSON read/write (which should not be in library(http/…) anyway). Also HTTP comes with a client and server side and many applications only want one of them.

If we go this way, I fear we’ll have to chop many of the packages into smaller chunks. Think of the clib package that is a more or less assorted collection of stuff that started of as a collection of libraries that required (simple) foreign language support.

There is surely some value in splitting, but it also comes at a significant cost in terms of code reorganization and maintenance while the only reward seems the ability to build a lean installation that supports a particular application. Given an application you can always generate a stand-alone binary that just contains the stuff needed by that application.

As is, the whole system, including the Qt console is on Linux 75Mb. If we strip the C debug symbols we end up with 38Mb. A quick analysis says this consists of 4 roughly equally sized pieces: the library, the binaries (including the plugins), xpce and the documentation + example code.

If we want to reduce that, the quickest solution might be to allow libraries (and docs) to be packed in zip archives.

So, what are you after?

My use case is again the R library, for that I would like to have the nice html functions, but remove the dependency on SSL with the problematic warning on the deprecated ‘SecKeychainOpen’ („Custom keychain management is no longer supported“) on MacOS. The warning is one thing, but having the html_write functions separated from http (and ssl) is also a nice feature.

I’ve just finished the experiment. In this particular case, the relevant changes are:

  • PackageSelection.cmake
  • A few changes in CMakeLists
  • replace library(http/html_write) by library(html/html_write) at many places

What do you think? I would send you a bunch of PRs then :blush:

PS. Now, there are people who have use_module(library(http/html_write)) in their code, and they may not be amused by the change. I thought I can insert a little stub into http/html_write that reexports everything from html/html_write, but this did not work out of the box.

I prefer to consider that separate issues. I do not really have a solution. It provides access to the MacOS system TLS certificate chain that is used to validate HTTPS certificates. I have not yet found an alternative way to get these certificates (that may have changed). We could have a build option to use a file based root certificate set. If R already provides certificates this would not be that hard.

It should also not be that hard to ensure the http package can be included without including ssl (isn’t that the case already)?

I think I’m not in favor. To me, it seems an ad-hoc solution for a warning on one specific OS. We can fix that differently. A useful discussion is about the dependency management between extension packages and the granularity of these packages. As is, this dependency management is virtually absent and you have only few realistic scenarios

  • Build only the core
  • Build the core with all but a few large packages (drop e.g., odbc, jpl, xpce, the Qt console)
  • Build all (but some) and drop the docs.

I see no reason to go into a couple of iterations to get all builds, distributions, etc. properly working again for a single ad-hoc change in this area,

Dear Jan,

regarding „It should also not be that hard to ensure the http package can be included without including ssl (isn’t that the case already)?”

Actually, no. If I include http in the build, I cannot switch ssl off. That was the point of my exercise. Well, approximately, since I just took out html from http which still needs ssl.

Otherwise, FWIW, I agree. Though I don’t know how to release the dependencies of http and SSL.

Thank you for your consideration.

Best wishes,

Matthias

Got that. Once upon a time they surely where independent :slight_smile: Some of the dependencies are there more to avoid ordering issues during the parallel build. For example, it doesn’t really matter whether ssl is there, but if it is there all of ssl should be present, not (for example) just ssl.pl. If we do not create a CMake dependency cmake will run build steps for other packages (notably the docs) while only part of ssl has been build.

But, I fear we need to backtrack a little further. SSL support is also required to install user add-ons (pack_install/1) because it needs to query https://www.swi-prolog.org. I don’t think you would like to drop that, no? If we agree on that, our target becomes SSL (certificates) on MacOS …

„SSL support is also required to install user add-ons (pack_install/1) because it needs to query https://www.swi-prolog.org. I don’t think you would like to drop that, no? If we agree on that, our target becomes SSL (certificates) on MacOS…“

Technically correct. Well, I mean, I would not even need pack_install for the R module, but this is very specific. Everyone else would probably want even a minimal system to be extensible.

Ok. Did some more research and found an alternative to access the MacOS root certificates using SecTrustCopyAnchorCertificates(). That is not deprecated and significantly simplifies the code as well :slight_smile: Pushed as PORT: Avoid deprecated SecKeychainOpen(). · SWI-Prolog/packages-ssl@2cbb093 · GitHub

I think this is fine. Sent a mail to Matt Lilley who wrote this code to verify I didn’t miss anything vital.

1 Like