Interesting StackOverflow question on replacing text in HTML. Looking for better way to do this

#1

I am a regular on StackOverflow, Guy Coder, and saw this question on the language-agnostic tag. I find the tag is a great place to show off what Prolog can do compared to other languages.


The problem is:

I have a couple of thousands of .html files and I need to search and replace a hardcoded server name to a relative path, but ONLY in the footer.

e.g.

<body>
   <a href="http://hardcoded/something">This is ok</a>      
   ... much more content here
   <div class="footer">
       <a href="http://hardcoded/something">Change this one</a>      
   </div>
</body>

Is there any tool to do this kind of search and replace?


I gave a proof of concept answer, almost ashamed to leave it up, but it was my first time parsing HTML with DCGs. Then I refined it some more into another answer.

Now I am looking at refining it more, looking at SWI-Prolog SGML/XML parser and possibly The library(http/html_write) library, but know that others here can do a much better job from which I can learn something.

Curious to see other Prolog and DCG solutions. No need to answer soon, as this is out of curiosity than need.

Regards,
Eric

#2

Here’s my approach, using library(sgml) to parse it into something structured. I was first thinking about XPath, but I don’t think that’s easy to use for editing a document (just for selecting).

:- module(html, []).

:- use_module(library(sgml)).
:- use_module(library(sgml_write)).

change_server_in_footer(HtmlIn, HtmlOut) :-
    open_string(HtmlIn, HtmlInStream),
    load_html(stream(HtmlInStream), DOM, []),
    transform(DOM, HtmlOut).

transform([Elt|Elts], [TElt|TElts]) :- !,
    transform(Elt, TElt),
    transform(Elts, TElts).
transform(element(div, Attrs, Body),
          element(div, Attrs, TBody)) :-
    memberchk(class=footer, Attrs), !,
    replace_path(Body, TBody).
transform(element(E, A, B), element(E, A, TB)) :- !,
    transform(B, TB).
transform(Elt, Elt).

replace_path([E|Es], [TE|TEs]) :- !,
    replace_path(E, TE),
    replace_path(Es, TEs).
replace_path(element(a, Attrs, Body), element(a, EditAttrs, Body)) :- !,
    selectchk(href=_OldPath, Attrs, AttrsRest),
    EditAttrs = [href='http://new.path'|AttrsRest].
replace_path(element(E, A, B), element(E, A, RB)) :- !,
    replace_path(B, RB).
replace_path(X, X).

test :-
    HTML = "
<body>\n
   <a href=\"http://hardcoded/something\">This is ok</a>      \n
   <div class=\"footer\">\n
       <a href=\"http://hardcoded/something\">Change this one</a>     \n
     <span class=\"outer\"><span class=\"inner\">
          <a href=\"http://hardcoded/something\">This too</a>
     </span></span>\n
   </div>\n
</body>\n
    ",
    change_server_in_footer(HTML, HtmlOut),
    html_write(user_output, HtmlOut, []).

Edit: handle more deeply nested links in footer

#3

I think this is roughly about it if you need to do this once. You could write something along the same lines as xpath that would allow you to edit the DOM. You could also use xpath/3 to find the target node to edit and use a general subterm replacement predicate (based on same_term/2) to make the replacement. That is less work, but quite cumbersome and inefficient if multiple edits are required.

You can probably also write a generic predicate that deals with this type of rewrites. One would be a predicate that is passed a mapping closure and recursively walks down a compound term. This predicate could make a list of parent nodes available to the mapper such that you can check the (footer) context.
Without a parent context it also works: first call it with a mapper that recognises the footer. This mapper calls the meta-predicate with a mapper that rewrites the href. That might be more elegant.

#4

That is pretty hard. Think of the syntax variations to write elements and their attributes, mismatches in escaped CDATA, comments, etc. People typically ignore all such thing and use a regex for such tasks, but that is almost invariably too simple, often leading to security issues (= injection attacks). Reliably modifying a document that reflects some formal language typically requires a parse tree. In some cases a token list (created with the complete token grammar for the target language) is sufficient. That at least avoids issues with quoting, comments, etc.

#5

To Jamesnvc:

Thanks for doing that variation. I have never done anything using those libraries but will study your example and expand my toolbox.

To Jan

Thanks for the feedback. Haven’t considered xpath , but might give that variation a try. Glad you touched on security and regex and noted that parsing HTML with DCGs is pretty hard. I knew about the problems you noted about parsing HTML being hard because people break the syntax specification rules all the time, but lends to answering why seeing DCGs that parse HTML is uncommon.

#6

I’m more trying to tell that working on an HTML document (or any formal language document) typically requires a full parser or at least a full tokenizer to be reliable. Reliable means you will not accidentally replace in comments, CDATA or similar, making unwanted changes to the document and you really find the thing you want to replace and do not miss it because they some some syntax variation you did not expect.

Surely you can write an HTML parser using DCGs and most likely it will be a lot shorter than e.g., a C written parser. It will also be a lot slower as parsers typically spent most of their time on the low level tokenizing stuff at the character level and processing text as a C array is simply cheaper than dealing with Prolog lists.

2 Likes
#7

A bit tangential to the discussion, this kind of tasks are easy to solve in a clean way using a stylesheet. For example:

<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="html" version="4.0" />

<!-- Identity template, provides default behavior that copies all content into the output -->
<xsl:template match="@*|node()">
  <xsl:copy>
    <xsl:apply-templates select="@*|node()"/>
  </xsl:copy>
</xsl:template>

<!-- Change href attribute of links inside a footer div -->
<xsl:template match="body/div[@class='footer']/a/@href">
  <xsl:attribute name="href">
    <xsl:value-of select="'http://something/new'"/>
  </xsl:attribute>
</xsl:template>
</xsl:stylesheet>

You can use xsltproc to apply this stylesheet to an HTML document.

$ cat example.html 
<html>
<head>
    <meta charset="utf-8">
    <title>Page Title</title>
</head>
<body>
   <a href="http://hardcoded/something">This is ok</a>
   ... much more content here
   <div class="footer">
       <a href="http://hardcoded/something">Change this one</a>
   </div>
</body>
</html>
$ xsltproc --html replace.xsl example.html 
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    <meta charset="utf-8">
    <title>Page Title</title>
</head>
<body>
   <a href="http://hardcoded/something">This is ok</a>
   ... much more content here
   <div class="footer">
       <a href="http://something/new">Change this one</a>
   </div>
</body>
</html>

The stylesheet solution shows one possible interface for a Prolog solution. I guess it would still involve using the XML parser to parse the HTML to a Prolog term.

1 Like
#8

Thanks Boris.

I would not consider this tangential but out of the box thinking. I am glad to see answers like this because they expand ways of solving these problems and now I can think about using this technique in the future.

While I haven’t looked at this in detail yet it may also answers a question I had related to Prince (software) and why it is using CSS.

#9

Hmm, thank you I guess? Using a stylesheet is not “out of the box thinking”, it is the diametrical opposite of that. It is literally what I was taught in university (at least the theory of it).