Interesting StackOverflow question on replacing text in HTML. Looking for better way to do this

I am a regular on StackOverflow, Guy Coder, and saw this question on the language-agnostic tag. I find the tag is a great place to show off what Prolog can do compared to other languages.


The problem is:

I have a couple of thousands of .html files and I need to search and replace a hardcoded server name to a relative path, but ONLY in the footer.

e.g.

<body>
   <a href="http://hardcoded/something">This is ok</a>      
   ... much more content here
   <div class="footer">
       <a href="http://hardcoded/something">Change this one</a>      
   </div>
</body>

Is there any tool to do this kind of search and replace?


I gave a proof of concept answer, almost ashamed to leave it up, but it was my first time parsing HTML with DCGs. Then I refined it some more into another answer.

Now I am looking at refining it more, looking at SWI-Prolog SGML/XML parser and possibly The library(http/html_write) library, but know that others here can do a much better job from which I can learn something.

Curious to see other Prolog and DCG solutions. No need to answer soon, as this is out of curiosity than need.

Regards,
Eric

Here’s my approach, using library(sgml) to parse it into something structured. I was first thinking about XPath, but I don’t think that’s easy to use for editing a document (just for selecting).

:- module(html, []).

:- use_module(library(sgml)).
:- use_module(library(sgml_write)).

change_server_in_footer(HtmlIn, HtmlOut) :-
    open_string(HtmlIn, HtmlInStream),
    load_html(stream(HtmlInStream), DOM, []),
    transform(DOM, HtmlOut).

transform([Elt|Elts], [TElt|TElts]) :- !,
    transform(Elt, TElt),
    transform(Elts, TElts).
transform(element(div, Attrs, Body),
          element(div, Attrs, TBody)) :-
    memberchk(class=footer, Attrs), !,
    replace_path(Body, TBody).
transform(element(E, A, B), element(E, A, TB)) :- !,
    transform(B, TB).
transform(Elt, Elt).

replace_path([E|Es], [TE|TEs]) :- !,
    replace_path(E, TE),
    replace_path(Es, TEs).
replace_path(element(a, Attrs, Body), element(a, EditAttrs, Body)) :- !,
    selectchk(href=_OldPath, Attrs, AttrsRest),
    EditAttrs = [href='http://new.path'|AttrsRest].
replace_path(element(E, A, B), element(E, A, RB)) :- !,
    replace_path(B, RB).
replace_path(X, X).

test :-
    HTML = "
<body>\n
   <a href=\"http://hardcoded/something\">This is ok</a>      \n
   <div class=\"footer\">\n
       <a href=\"http://hardcoded/something\">Change this one</a>     \n
     <span class=\"outer\"><span class=\"inner\">
          <a href=\"http://hardcoded/something\">This too</a>
     </span></span>\n
   </div>\n
</body>\n
    ",
    change_server_in_footer(HTML, HtmlOut),
    html_write(user_output, HtmlOut, []).

Edit: handle more deeply nested links in footer

1 Like

I think this is roughly about it if you need to do this once. You could write something along the same lines as xpath that would allow you to edit the DOM. You could also use xpath/3 to find the target node to edit and use a general subterm replacement predicate (based on same_term/2) to make the replacement. That is less work, but quite cumbersome and inefficient if multiple edits are required.

You can probably also write a generic predicate that deals with this type of rewrites. One would be a predicate that is passed a mapping closure and recursively walks down a compound term. This predicate could make a list of parent nodes available to the mapper such that you can check the (footer) context.
Without a parent context it also works: first call it with a mapper that recognises the footer. This mapper calls the meta-predicate with a mapper that rewrites the href. That might be more elegant.

That is pretty hard. Think of the syntax variations to write elements and their attributes, mismatches in escaped CDATA, comments, etc. People typically ignore all such thing and use a regex for such tasks, but that is almost invariably too simple, often leading to security issues (= injection attacks). Reliably modifying a document that reflects some formal language typically requires a parse tree. In some cases a token list (created with the complete token grammar for the target language) is sufficient. That at least avoids issues with quoting, comments, etc.

To Jamesnvc:

Thanks for doing that variation. I have never done anything using those libraries but will study your example and expand my toolbox.

To Jan

Thanks for the feedback. Haven’t considered xpath , but might give that variation a try. Glad you touched on security and regex and noted that parsing HTML with DCGs is pretty hard. I knew about the problems you noted about parsing HTML being hard because people break the syntax specification rules all the time, but lends to answering why seeing DCGs that parse HTML is uncommon.

I’m more trying to tell that working on an HTML document (or any formal language document) typically requires a full parser or at least a full tokenizer to be reliable. Reliable means you will not accidentally replace in comments, CDATA or similar, making unwanted changes to the document and you really find the thing you want to replace and do not miss it because they some some syntax variation you did not expect.

Surely you can write an HTML parser using DCGs and most likely it will be a lot shorter than e.g., a C written parser. It will also be a lot slower as parsers typically spent most of their time on the low level tokenizing stuff at the character level and processing text as a C array is simply cheaper than dealing with Prolog lists.

2 Likes

A bit tangential to the discussion, this kind of tasks are easy to solve in a clean way using a stylesheet. For example:

<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="html" version="4.0" />

<!-- Identity template, provides default behavior that copies all content into the output -->
<xsl:template match="@*|node()">
  <xsl:copy>
    <xsl:apply-templates select="@*|node()"/>
  </xsl:copy>
</xsl:template>

<!-- Change href attribute of links inside a footer div -->
<xsl:template match="body/div[@class='footer']/a/@href">
  <xsl:attribute name="href">
    <xsl:value-of select="'http://something/new'"/>
  </xsl:attribute>
</xsl:template>
</xsl:stylesheet>

You can use xsltproc to apply this stylesheet to an HTML document.

$ cat example.html 
<html>
<head>
    <meta charset="utf-8">
    <title>Page Title</title>
</head>
<body>
   <a href="http://hardcoded/something">This is ok</a>
   ... much more content here
   <div class="footer">
       <a href="http://hardcoded/something">Change this one</a>
   </div>
</body>
</html>
$ xsltproc --html replace.xsl example.html 
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    <meta charset="utf-8">
    <title>Page Title</title>
</head>
<body>
   <a href="http://hardcoded/something">This is ok</a>
   ... much more content here
   <div class="footer">
       <a href="http://something/new">Change this one</a>
   </div>
</body>
</html>

The stylesheet solution shows one possible interface for a Prolog solution. I guess it would still involve using the XML parser to parse the HTML to a Prolog term.

1 Like

Thanks Boris.

I would not consider this tangential but out of the box thinking. I am glad to see answers like this because they expand ways of solving these problems and now I can think about using this technique in the future.

While I haven’t looked at this in detail yet it may also answers a question I had related to Prince (software) and why it is using CSS.

Hmm, thank you I guess? Using a stylesheet is not “out of the box thinking”, it is the diametrical opposite of that. It is literally what I was taught in university (at least the theory of it).

Today for something unrelated needed to parse some HTML, (XMTML) and in looking around in the documentation and source code found new_sgml_parser/2 which is not immediately obvious when searching for HTML and parsing, so I am mentioning it here for others that find this topic.

Also the use of new_sgml_parser/2 is not a simple one line predicate, but this and this help to understand it’s usage.

Thanks.


Here is a very simple test case I created for parsing a single element; it probably leaves streams open and such, but didn’t want to make the simple test case to hard to understand.

:- begin_tests(new_sgml_parser).

test(001) :-
    String = "<span class=\"title\">My Title</span>",
    open_string(String,In),
    new_sgml_parser(Parser,[]),
    sgml_parse(Parser,
               [ source(In),
                 document(Content)
               ]),

    assertion( Content == [element(span,[class='title'],['My Title'])] ).

:- end_tests(new_sgml_parser).

Of course there are many ways to write such a test. This is what I’d probably do:

test(001, Content == [element(span,[class='title'],['My Title'])]) :-
    parse_string("<span class=\"title\">My Title</span>", Content).

parse_string(String, Content) :-
    setup_call_cleanup(
        open_string(String,In),
        ( new_sgml_parser(Parser,[]),
          sgml_parse(Parser,
                     [ source(In),
                       document(Content)
                     ])
        ),
        close(In)).

On thing that does make some sense is to put your expectation in the test declaration. As the meaning of these declarations on the expected result is known you get more precise error messages. Asssertion statements can be useful if you want to test secondary effects or intermediate results. Creating the helper predicate makes debugging a bit easier and probably makes related tests shorter to write.

2 Likes

Thanks.

I have been diligently working as I write on more real-world test cases for using new_sgml_parser/2 and
setup_call_cleanup/3 was what I was using with open_string/2.

Some related question I have are:

  1. While close/1 will close a stream, in trying to verify that a stream is closed, I first used, var(In), but then moved onto using catch/3 like
   catch(
        close(In),
        Error,
        Error =@= 'error(existence_error(stream,_),context(system:close/1,'already closed'))'
    )

but that code as you know doesn’t work because =@= fails.

What would be a good way to verify a resource such as a stream is closed in a test case?

  1. Similar question with freeing the resources with new_sgml_parser/2.

What would be a good way to verify the resources associated with new_sgml_parser/2 are closed in a test case?

I am currently using free_sgml_parser/1 with setup_call_cleanup/3, but don’t know what resources are used and how to verify they are free?

My current code looks like this

test(003) :-
    String = "<span class=\"title\">My Title</span>",
    setup_call_cleanup(
        open_string(String,In),
        (
            setup_call_cleanup(
                new_sgml_parser(Parser,[]),
                sgml_parse(Parser,
                        [ source(In),
                          document(Content)
                        ]),
                free_sgml_parser(Parser)
            )
        ),
        close(In)
    ),

    assertion( Content == [element(span,[class='title'],['My Title'])] ).
    % Need assertion to verify In is closed
    % Need assertion to verify Parser resources are free.

Also thanks for

which I have used on occasion, (the long used habit of using assertion statements from other programming languages is hard to break) but I also like to use the forall option often, so will have to experiment more on using them together.

In general that isn’t easy. If there is reason to worry I normally create a loop and run it for a while, possibly using some memory leak detection tool such as valgrind or heaptrack.

For blob references it is typically safe to try and interact with them after the close and expect an existence_error. Some (older) code tests to represent pointers as Prolog integers and interacting with them after freeing can easily cause a crash. In newer code such reference are only used if it concerns stuff the user should never see.

Still, the fact that interacting with a closed stream produces an existing error doesn’t verify all associated memory and OS resources are properly reclaimed. The above loop is (I think) the only sensible way to test that.

1 Like