Insanely silly string slicing session, help needed!

emacstheviking · April 29, 2020, 9:13pm

OK, I have a string like this
this/is/the/type/char*

and I want a predicate that can remove the char* from the end returning it as the type, and then return everything before “/char*” as the resultant variable name as that’s how it works. In any imperative language I care to mention this would be an absolute doddle but for some reason the likes of sub_string and split_string are driving me nuts. The problem is if you give it a string with a trailing slash.

I know I am using split_string with the same pad and sep chars to remove duplicated occurrences. I am wondering if the predicate below is even on the right track.

var_type(Var, VarOut,Type) :-
    split_string(Var, "/", "/", Parts),
    length(Parts, L),
    L > 1,
    !,
    last(Parts, Type),
    string_length(Var, VLen),      % Now we just drop the prefix
    string_length(Type, TLen),     % from the end of the source
    PLen is VLen - TLen -1 ,       % including the trailing "/"
    print([VLen, TLen, PLen]),
    sub_string(Var, 0, PLen, _, VarOut)

ast:  ?- var_type("score/longint/the/last/char*/", Var, Type).
[29,5,23]
Var = "score/longint/the/last/",
Type = "char*".

ast:  ?- var_type("score/longint/the/last/char*", Var, Type).
[28,5,22]
Var = "score/longint/the/last",
Type = "char*".

ast:  ?- var_type("score/longint", Var, Type).
[13,7,5]
Var = "score",
Type = "longint".

ast:  ?- var_type("score/", Var, Type).
false.

ast:  ?- var_type("score", Var, Type).
false.

The last two fail but that’s fine as it indicates no trailing type was given in the string. Essentially if the string ends with “/” I should probably reject it in the first place, so I will follow that but I can’t believe how ugly it all feels at this point.

Any guidance / notes on more idiomatic style would be very much appreciated at this juncture!

Thanks.,
Sean.

emacstheviking · April 29, 2020, 9:19pm

I smell append/3… play time… string_concat… homing in…!!!

pmoura · April 29, 2020, 9:33pm

append/3 is too often a code smell…

emacstheviking · April 29, 2020, 9:35pm

It would be the FIRST time I have EVER used it…

emacstheviking · April 29, 2020, 9:35pm

looking at library(pcre) for some salvation… it’s getting to the nitty gritty lately so maybe time I read some of the libraries anyway to see what I can use…

pmoura · April 29, 2020, 9:40pm

Try something like:

?- String = "this/is/the/type/char*",
   sub_string(String, Before, 5, 0, "char*"),
   sub_string(String, 0, Before, _, Prefix).
String = "this/is/the/type/char*",
Before = 17,
Prefix = "this/is/the/type/".

If working with atoms, use instead the standard sub_atom/5 predicate.

emacstheviking · April 29, 2020, 9:42pm

I am working with string objects from the tokeniser @pmoura, thanks for the above. I will REPL it to death… sometimes the multiple modalities of predicates blows my mind, it’s like guitars and so many different tunings!

kind of.

Thanks again.

emacstheviking · April 29, 2020, 10:01pm

Thanks @pmoura …once again a fine solution. I suspect if I had your knowledge of Prolog I would have finished my project at least a week before I started it…

var_type(Var, VarOut, Type) :-
    split_string(Var, "/", "/", Parts),
    length(Parts, L),
    L > 1,
    !,
    last(Parts, Type),
    string_length(Type, TLen),
    sub_string(Var, Before, TLen, 0, Type),
    B1 is Before-1,
    sub_string(Var, 0, B1, _, VarOut).

This also has the advantage of failing of the input ends with “/” which I have now decided is a syntax error on the part of the user and basically not my problem per se as it will be reflected in the rendered code… sounds harsh but my system is a facilitator not judge and jury.

Thanks again.
Sean.

swi · April 30, 2020, 2:04am

You can also use re_matchsub/4 with the proper regular expressions and capture groups.

Boris · April 30, 2020, 5:37am

It is a double-edged sword, to some degree. If you follow the cannon while learning Prolog, you learn early that you can use append/3 to split, but then, why is it called “append”? And if there would be a better word for what append/3 could do, what is it?

sub_atom/5 and its cousin sub_string/5 are already ahead of append/3, at least the name is not the antonym of what the predicate does. In the docs, there is a snippet saying:

The implementation minimises non-determinism and creation of atoms. This is a flexible predicate that can do search, prefix- and suffix-matching, etc.

(see also the extensive comments by users under the predicate documentation)

How do you advertise those features of sub_atom/5 to the uninitiated? (To be fair, the SWI-Prolog docs do as good an effort as I have seen…)

jan · April 30, 2020, 6:58am

This assumes you know char* is the last bit. Note that you do not need the 5 as this is implied by the given output string.

I’d go for

25 ?- split_string("this/is/the/type/char*", "/", "", Parts),
|    append(Before, [Type], Parts),
|    atomics_to_string(Before, "/", Var).
Parts = ["this", "is", "the", "type", "char*"],
Before = ["this", "is", "the", "type"],
Type = "char*",
Var = "this/is/the/type" .

Needs a little tweak if you want a “/” at the end of the var.

The clean Prolog way is of course a DCG. Using regular expressions is a good alternative and finally if your string using “/” to separate, you can use the file handling predicates (file_base_name/2, file_direcory_name/2), etc.

The final remark is that this may be fine for reading data, but internally you should use a term to represent the name and type, as in var(Name, Type). As many people have advertised: strings as a data representation are evil: slow, easy to get wrong and (therefore) vulnerable to security issues.

emacstheviking · April 30, 2020, 7:39am

HI Jan.
My internal form is as you suggest, here’s a sample:

X = [sexp((22, 2, 0), [defvar(tk((30, 2, 8), "toplevelvar/char*"), s2((48, 2, 26),
 "global!"))]), sexp((84, 5, 0), [defun("testing", [vardec(tk((100, ..., ...), "arg1")),
 vardec(tk((..., ...), "arg2"))], sexp((113, 6, 2), [defvar(tk((..., ...), "name/char*"),
s2((..., ...), "Sean Charles"))])), sexp((150, 7, 2), [defvar(tk((..., ...),
"code/EnumType"), kw((..., ...), "working"))]), sexp((184, 8, 2),
[defvar(tk(..., ...), list(..., ...))]), sexp((255, ..., ...), [defvar(..., ...)]),
sexp((..., ...), [...])])].

Using the filename predicates is a stroke of genius as I had not considered that, and as for DCG-s, I so far have used them for AST building, consuming the above terms rather than the raw source text from the file i.e DCG-s eating output from the tokeniser.

I have an FSM that does the tokenisation and somewhere in the back of my mind I know I could have done that as a set of DCG rules too but I don’t have the experience yet to have pulled that off… I couldn’t really understand how to accurately track the line and column positions etc and also how to pass state, although in my DCG ast rules I do use the push-back notation once or twice to do a lookahead at times.

“The nice thing about software is it only has to be good enough” – me.

“One day” I will go back and review it all and refactor it to high heaven but this year I just want to finish something that works. The tests all pass, they grow in number by the hour as I go.

Thanks again everybody for your very knowledgeable and informative input. I read everything ten times in the hope of understanding it once.

Sean.

emacstheviking · April 30, 2020, 7:42am

Strings are evil!?!?! Should I be using character code lists everywhere instead then? I thought strings were a more compact and thus more memory efficient representation. I have lots of little strings from the tokeniser and I rather thought they would be “shared” i.e. if I had the string “defvar” a hundred times, internally it would be pointing to the same little bit of memory.

Is that not the case?

At the lowest level, the tokeniser operates on char code list, turning the token into a string at the last moment as it is accumulated…I can always leave then as char code lists…would that be more memory efficient ?

Thanks again,
Sean.

emacstheviking · April 30, 2020, 7:56am

yes @j4n_bur53 I did. I knew is was non-ISO but as I say, I was getting a little frustrated at seemingly hard such a simple task appeared to be getting! I’ve used once() once or twice, very useful too!

jan · April 30, 2020, 7:58am

This is not about their representation. It is about the so popular way to represent all sorts of structured data as a string in the abstract sense. You often need strings to make systems communicate over byte streams (sockets, etc). These strings typically represent some complex object (XML DOM Model, SQL Query, Prolog call, etc.). Putting these strings together, modifying them at the string level or using general string primitives to get information out of them is slow and dangerous.

emacstheviking · April 30, 2020, 8:01am

Ah! OK< I get it now. Yes, I couldn’t agree more about that actually. My mantra for any job un any language is pretty much “input-process-output” and the input stage involves turning low level byte streams into structured data.

I used to spend my days writing PLC and remote logger station protocols and I ended up writing a tool to turn the spec into C functions…with hindsight I could have used ASN.1 or something but it was a long time ago (mid 1980-s) and I was a lot less experienced.

Topic		Replies	Views
Split_string: the swiss army knife of string goodness. Removing encapsulating quotes General	6	547	June 2, 2020
Friday code drop: Transforming properly between atom, string, codelist and charlist - reply 3 General	0	243	June 11, 2021
Library(strings) - dedent_string, indent_string, split_lines Request For Comments discussion	16	1466	September 27, 2020
Finding the source code name for a variable at run time Help!	12	393	March 3, 2021
Friday code drop: Transforming properly between atom, string, codelist and charlist Announce	0	317	June 11, 2021

Insanely silly string slicing session, help needed!

Related topics