Interesting StackOverflow question on replacing text in HTML. Looking for better way to do this

EricGT · April 9, 2019, 10:08pm

I am a regular on StackOverflow, Guy Coder, and saw this question on the language-agnostic tag. I find the tag is a great place to show off what Prolog can do compared to other languages.

The problem is:

I have a couple of thousands of .html files and I need to search and replace a hardcoded server name to a relative path, but ONLY in the footer.

e.g.

<body>
   <a href="http://hardcoded/something">This is ok</a>      
   ... much more content here
   <div class="footer">
       <a href="http://hardcoded/something">Change this one</a>      
   </div>
</body>

Is there any tool to do this kind of search and replace?

I gave a proof of concept answer, almost ashamed to leave it up, but it was my first time parsing HTML with DCGs. Then I refined it some more into another answer.

Now I am looking at refining it more, looking at SWI-Prolog SGML/XML parser and possibly The library(http/html_write) library, but know that others here can do a much better job from which I can learn something.

Curious to see other Prolog and DCG solutions. No need to answer soon, as this is out of curiosity than need.

Regards,
Eric

jamesnvc · April 9, 2019, 11:49pm

Here’s my approach, using library(sgml) to parse it into something structured. I was first thinking about XPath, but I don’t think that’s easy to use for editing a document (just for selecting).

:- module(html, []).

:- use_module(library(sgml)).
:- use_module(library(sgml_write)).

change_server_in_footer(HtmlIn, HtmlOut) :-
    open_string(HtmlIn, HtmlInStream),
    load_html(stream(HtmlInStream), DOM, []),
    transform(DOM, HtmlOut).

transform([Elt|Elts], [TElt|TElts]) :- !,
    transform(Elt, TElt),
    transform(Elts, TElts).
transform(element(div, Attrs, Body),
          element(div, Attrs, TBody)) :-
    memberchk(class=footer, Attrs), !,
    replace_path(Body, TBody).
transform(element(E, A, B), element(E, A, TB)) :- !,
    transform(B, TB).
transform(Elt, Elt).

replace_path([E|Es], [TE|TEs]) :- !,
    replace_path(E, TE),
    replace_path(Es, TEs).
replace_path(element(a, Attrs, Body), element(a, EditAttrs, Body)) :- !,
    selectchk(href=_OldPath, Attrs, AttrsRest),
    EditAttrs = [href='http://new.path'|AttrsRest].
replace_path(element(E, A, B), element(E, A, RB)) :- !,
    replace_path(B, RB).
replace_path(X, X).

test :-
    HTML = "
<body>\n
   <a href=\"http://hardcoded/something\">This is ok</a>      \n
   <div class=\"footer\">\n
       <a href=\"http://hardcoded/something\">Change this one</a>     \n
     <span class=\"outer\"><span class=\"inner\">
          <a href=\"http://hardcoded/something\">This too</a>
     </span></span>\n
   </div>\n
</body>\n
    ",
    change_server_in_footer(HTML, HtmlOut),
    html_write(user_output, HtmlOut, []).

Edit: handle more deeply nested links in footer

jan · April 10, 2019, 6:47am

I think this is roughly about it if you need to do this once. You could write something along the same lines as xpath that would allow you to edit the DOM. You could also use xpath/3 to find the target node to edit and use a general subterm replacement predicate (based on same_term/2) to make the replacement. That is less work, but quite cumbersome and inefficient if multiple edits are required.

You can probably also write a generic predicate that deals with this type of rewrites. One would be a predicate that is passed a mapping closure and recursively walks down a compound term. This predicate could make a list of parent nodes available to the mapper such that you can check the (footer) context.
Without a parent context it also works: first call it with a mapper that recognises the footer. This mapper calls the meta-predicate with a mapper that rewrites the href. That might be more elegant.

jan · April 10, 2019, 6:56am

That is pretty hard. Think of the syntax variations to write elements and their attributes, mismatches in escaped CDATA, comments, etc. People typically ignore all such thing and use a regex for such tasks, but that is almost invariably too simple, often leading to security issues (= injection attacks). Reliably modifying a document that reflects some formal language typically requires a parse tree. In some cases a token list (created with the complete token grammar for the target language) is sufficient. That at least avoids issues with quoting, comments, etc.

EricGT · April 10, 2019, 9:41am

To Jamesnvc:

Thanks for doing that variation. I have never done anything using those libraries but will study your example and expand my toolbox.

To Jan

Thanks for the feedback. Haven’t considered xpath , but might give that variation a try. Glad you touched on security and regex and noted that parsing HTML with DCGs is pretty hard. I knew about the problems you noted about parsing HTML being hard because people break the syntax specification rules all the time, but lends to answering why seeing DCGs that parse HTML is uncommon.

jan · April 10, 2019, 10:50am

I’m more trying to tell that working on an HTML document (or any formal language document) typically requires a full parser or at least a full tokenizer to be reliable. Reliable means you will not accidentally replace in comments, CDATA or similar, making unwanted changes to the document and you really find the thing you want to replace and do not miss it because they some some syntax variation you did not expect.

Surely you can write an HTML parser using DCGs and most likely it will be a lot shorter than e.g., a C written parser. It will also be a lot slower as parsers typically spent most of their time on the low level tokenizing stuff at the character level and processing text as a C array is simply cheaper than dealing with Prolog lists.

Boris · April 11, 2019, 8:23am

A bit tangential to the discussion, this kind of tasks are easy to solve in a clean way using a stylesheet. For example:

<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="html" version="4.0" />

<!-- Identity template, provides default behavior that copies all content into the output -->
<xsl:template match="@*|node()">
  <xsl:copy>
    <xsl:apply-templates select="@*|node()"/>
  </xsl:copy>
</xsl:template>

<!-- Change href attribute of links inside a footer div -->
<xsl:template match="body/div[@class='footer']/a/@href">
  <xsl:attribute name="href">
    <xsl:value-of select="'http://something/new'"/>
  </xsl:attribute>
</xsl:template>
</xsl:stylesheet>

You can use xsltproc to apply this stylesheet to an HTML document.

$ cat example.html 
<html>
<head>
    <meta charset="utf-8">
    <title>Page Title</title>
</head>
<body>
   <a href="http://hardcoded/something">This is ok</a>
   ... much more content here
   <div class="footer">
       <a href="http://hardcoded/something">Change this one</a>
   </div>
</body>
</html>
$ xsltproc --html replace.xsl example.html 
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    <meta charset="utf-8">
    <title>Page Title</title>
</head>
<body>
   <a href="http://hardcoded/something">This is ok</a>
   ... much more content here
   <div class="footer">
       <a href="http://something/new">Change this one</a>
   </div>
</body>
</html>

The stylesheet solution shows one possible interface for a Prolog solution. I guess it would still involve using the XML parser to parse the HTML to a Prolog term.

EricGT · April 11, 2019, 9:58am

Thanks Boris.

I would not consider this tangential but out of the box thinking. I am glad to see answers like this because they expand ways of solving these problems and now I can think about using this technique in the future.

While I haven’t looked at this in detail yet it may also answers a question I had related to Prince (software) and why it is using CSS.

Boris · April 11, 2019, 10:46am

Hmm, thank you I guess? Using a stylesheet is not “out of the box thinking”, it is the diametrical opposite of that. It is literally what I was taught in university (at least the theory of it).

EricGT · January 16, 2020, 2:03pm

Today for something unrelated needed to parse some HTML, (XMTML) and in looking around in the documentation and source code found new_sgml_parser/2 which is not immediately obvious when searching for HTML and parsing, so I am mentioning it here for others that find this topic.

Also the use of new_sgml_parser/2 is not a simple one line predicate, but this and this help to understand it’s usage.

Thanks.

Here is a very simple test case I created for parsing a single element; it probably leaves streams open and such, but didn’t want to make the simple test case to hard to understand.

:- begin_tests(new_sgml_parser).

test(001) :-
    String = "<span class=\"title\">My Title</span>",
    open_string(String,In),
    new_sgml_parser(Parser,[]),
    sgml_parse(Parser,
               [ source(In),
                 document(Content)
               ]),

    assertion( Content == [element(span,[class='title'],['My Title'])] ).

:- end_tests(new_sgml_parser).

jan · January 16, 2020, 2:58pm

Of course there are many ways to write such a test. This is what I’d probably do:

test(001, Content == [element(span,[class='title'],['My Title'])]) :-
    parse_string("<span class=\"title\">My Title</span>", Content).

parse_string(String, Content) :-
    setup_call_cleanup(
        open_string(String,In),
        ( new_sgml_parser(Parser,[]),
          sgml_parse(Parser,
                     [ source(In),
                       document(Content)
                     ])
        ),
        close(In)).

On thing that does make some sense is to put your expectation in the test declaration. As the meaning of these declarations on the expected result is known you get more precise error messages. Asssertion statements can be useful if you want to test secondary effects or intermediate results. Creating the helper predicate makes debugging a bit easier and probably makes related tests shorter to write.

EricGT · January 16, 2020, 3:23pm

Thanks.

I have been diligently working as I write on more real-world test cases for using new_sgml_parser/2 and
setup_call_cleanup/3 was what I was using with open_string/2.

Some related question I have are:

While close/1 will close a stream, in trying to verify that a stream is closed, I first used, var(In), but then moved onto using catch/3 like

   catch(
        close(In),
        Error,
        Error =@= 'error(existence_error(stream,_),context(system:close/1,'already closed'))'
    )

but that code as you know doesn’t work because =@= fails.

What would be a good way to verify a resource such as a stream is closed in a test case?

Similar question with freeing the resources with new_sgml_parser/2.

What would be a good way to verify the resources associated with new_sgml_parser/2 are closed in a test case?

I am currently using free_sgml_parser/1 with setup_call_cleanup/3, but don’t know what resources are used and how to verify they are free?

My current code looks like this

test(003) :-
    String = "<span class=\"title\">My Title</span>",
    setup_call_cleanup(
        open_string(String,In),
        (
            setup_call_cleanup(
                new_sgml_parser(Parser,[]),
                sgml_parse(Parser,
                        [ source(In),
                          document(Content)
                        ]),
                free_sgml_parser(Parser)
            )
        ),
        close(In)
    ),

    assertion( Content == [element(span,[class='title'],['My Title'])] ).
    % Need assertion to verify In is closed
    % Need assertion to verify Parser resources are free.

Also thanks for

which I have used on occasion, (the long used habit of using assertion statements from other programming languages is hard to break) but I also like to use the forall option often, so will have to experiment more on using them together.

jan · January 16, 2020, 3:43pm

In general that isn’t easy. If there is reason to worry I normally create a loop and run it for a while, possibly using some memory leak detection tool such as valgrind or heaptrack.

For blob references it is typically safe to try and interact with them after the close and expect an existence_error. Some (older) code tests to represent pointers as Prolog integers and interacting with them after freeing can easily cause a crash. In newer code such reference are only used if it concerns stuff the user should never see.

Still, the fact that interacting with a closed stream produces an existing error doesn’t verify all associated memory and OS resources are properly reclaimed. The above loop is (I think) the only sensible way to test that.

Topic		Replies	Views
Xpath and markdown -- how to interoperate Help!	4	579	April 7, 2020
Can SWI-Tinker use a library(markup)? General	14	293	April 6, 2025
A parsing example General	4	7792	March 27, 2019
Adding error detection to DCG rules Algorithm	18	1864	May 7, 2020
SWISH term_singletons/2 and easy text coloring? General	4	112	June 27, 2024

Interesting StackOverflow question on replacing text in HTML. Looking for better way to do this

Related topics