Read_stream_to_codes

mcculloch · March 11, 2020, 11:06pm

read_stream_to_codes doesn’t work for binary data. When there is a FF in the data I get a code of FFFD. Why would this be? How can I get a stream read to codes with the actual binary bytes in the file?

pmoura · March 11, 2020, 11:47pm

Codes mean character codes, not bytes, and thus depend on the text encoding.

I assume that you open the stream using the open/4 option type(binary)?

mcculloch · March 12, 2020, 1:48am

Actually the stream I’m using is simply user_input. I run the program from the command line with input redirection. Is there no binary data version of the predicate? Do I have to do single byte fetches of each byte? I’m not up on “text encoding” as such. I especially don’t understand how you get a two byte value from one byte, but that’s just me. I am parsing a binary file that is actually a 6502 executable. DCG is really good at this, but I need the bytes in a list to do it.

Boris · March 12, 2020, 5:17am

There is set_stream/2. You can set type(binary) on user_input or current_input, just as @pmoura suggested . With this file:

$ cat read_stream.pl 
read_text :-
    read_stream_to_codes(current_input, Codes),
    format("~w~n", [Codes]).

read_bin :-
    set_stream(current_input, type(binary)),
    read_stream_to_codes(current_input, Codes),
    format("~w~n", [Codes]).

I get:

$ echo føø bær | swipl -g read_text -t halt read_stream.pl
[102,248,248,32,98,230,114,10]
$ echo føø bær | swipl -g read_bin -t halt read_stream.pl
[102,195,184,195,184,32,98,195,166,114,10]

(I guess the output I get depends on my locale and so on…)

Boris · March 12, 2020, 8:45am

I was wondering about this, so I actually tested it:

$ echo -e '\xFF' | swipl -g read_bin -t halt read_stream.pl 
[255,10]
$ echo -e '\xFF' | swipl -g read_text -t halt read_stream.pl 
[65533,10]

If you really want to know what is happening, you would need to read a bit into Unicode and UTF-8 encoding. To cut a long story short, to me it seems that the invalid 0xFF is considered an encoding error, and is correctly replaced with 0xFFFD. This happens because you tried to read binary content as if it was text.

EricGT · March 12, 2020, 10:07am

If you are writing the predicates that access the data at the Prolog level directly then yes. But while it may seem like a real problem at first, after you get a few predicates working that pull a few bytes into a word, double word etc, and then use those predicates you quickly find yourself not writing code with predicates where you are directly using get_byte/1 and start to build up higher and higher levels of code. But still under it all is accessing bytes one at a time.

Related post: Example parse of binary data using DCG

If you change the terminology of that sentence it will make more sense. The raw data stored on a file for any encoded data is stored and accesses as raw bytes, when the data is read it is read as one or more groups of bytes then passed through a decoder that converts the bytes into a value. Now the value may itself represent something but in textual display it will most likely be a glyph that is represented as a bit pattern for display on a display device, or the value could be converted to some variation of vector graphics and that is used to create a glyph.

Recently I spent time with UTF-8 which encodes Unicode into bytes for storage with the smallest storage unit being 8 bits or one byte. There is also UTF-16 that encodes Unicode into bytes for storage with the smallest storage unit being 16 bits or two bytes. See: Unicode encoding UTF-8 for more details.

I use to program in 6502 assembly, it was hard having very few general purpose resisters.

mcculloch · March 12, 2020, 11:57am

Thanks, Boris. set_stream was what I was looking for.

EricGT · March 12, 2020, 12:57pm

Just thinking out loud, but you should be cognizant of this.

You noted that you are parsing and using DCGs which are great at parsing. However with executable code it can be context sensitive instead of context free and this can get you into a problem if you are not aware of it and know how to deal with it, especially with DCGs.

In other words if you do not have a formal grammar that describes the 6502 executable format that is context free and which is easily translated into DCGs, then might I suggest that you switch to reading the data as bytes, analyzing the bytes to figure out the displacement, size and types of the current instruction for use in accessing the offset and number of bytes of the next instruction, data, or meta data encoded in the executable file format.

Boris · March 12, 2020, 1:25pm

I am probably completely misunderstanding you, but as long as you allow the head of your DCG rules to have arguments, it is no longer a context-free grammar (and anyway DCGs in Prolog, on some mechanistic level, are just a way to avoid the extra arguments when dealing with difference lists). The textbook example is a DCG that recognizes a string of n a’s followed by n b’s. There are surely examples of this floating around the internet.

(Using “successor” terms is probably the most Prolog-y way to do it, like:

anbn --> a(N), b(N).

a(0) --> [].
a(s(N)) --> [a], a(N).

b(0) --> [].
b(s(N)) --> [b], b(N).

With this definition, you can do like this:

?- phrase(anbn, X).
X = [] ;
X = [a, b] ;
X = [a, a, b, b] ;
X = [a, a, a, b, b, b] .

?- phrase(anbn, [a,a|R]).
R = [b, b] ;
R = [a, b, b, b] ;
R = [a, a, b, b, b, b] .

?- phrase(anbn, [a,b,b]).
false.

?- phrase(anbn, []).
true ;
false.

?- phrase(anbn, [a,a,b,b]).
true ;
false.

… although I strongly suspect that there might be non-termination lurking somewhere in this code.

mcculloch · March 12, 2020, 1:41pm

The file format is pretty straightforward. My program works great now.

EricGT · March 12, 2020, 1:48pm

I take it that you are referring to my post.

Yes that was a bit over the top so here is a more concrete example related to parsing/reading CPU instructions as binary from an executable file format. Sorry for all the specifics, but each one slices off a different part of the set of possibilities and I have been down many of them.

In reading a binary encoded instruction, now a days on most popular CPUs the instructions are one or more 8-bit bytes. To determine what follows requires decoding the bits in the first byte to see if the next part is an extension to the opcode, an operand, a displacement value, and index value, etc. For instructions that have follow on bytes, sometimes those bytes also have bits indicating how many more bytes to read.

So if you are creating a DCG, then the DCG needs to take the bytes apart and look at the bits to pass along for the indexing into the next rule, or have a large lookup table of all of the values. While this can be done with DCGs, unless you have a grammar for all of this that can be translated to a DCG, odds are you will be building up the parsing or reading rules as you progress and I personally find it easier to process the data without using DCGs at first to get a feel for what the patterns will be, then to switch to DCGs if it makes sense. The good thing about the OPs problem is that it is a much older CPU (6502) as compared with the modern Intel processors.

However the 6502 was not created by Motorola which made the 6800, and 68XXX line of CPUs which were much easier to program in assembly and parse the instructions, but by MOS Technology which I am as not well versed in parsing the instructions, but that should be much easier than parsing the instructions from the Intel x86 and AMD86 instructions which requires tens of translation tables for all the different encodings of sets of bits.

6502 instruction set

After having looked at those instructions, they look simple enough that using DCGs from the start might not present the problems I have seen with other instruction sets, but I have not parsed 6502 with DCGs as of present, so don’t know for sure.

Boris · March 12, 2020, 4:58pm

This is a well understood use case for DCGs. The naive (and efficient) way to deal with it is to do exactly as you describe: read the beginning, then use that as the first argument for the rule that parses the rest. This way you get rid of unnecessary choice points without cuts (first-argument indexing). It also feels “declarative”, in the sense that you have a table that you need to extend only in one dimension (by adding rows) and the order of the rows does not matter.

I know of the following techniques for making the process of writing a DCG unnecessarily hard:

Avoid using library(dcg/basics) (so that you can enjoy writing the same things again and again)
Try too hard or too early to make DCGs that generate and parse in the same code path (this way you can enjoy debugging non-termination issues)
Completely forget about getting rid of choice points when you should (this way you can enjoy long running times and running out of memory)

This list is certainly not exhaustive – I have worked on those techniques personally so I know them best.

My own rule of thumb is: if my “input” is a list and it isn’t an obvious map or reduce (fold), then a DCG will save me quite a bit of typing, on average.

Off-topic achieved successfully

EricGT · March 12, 2020, 5:13pm

OK but what about transformations of input streams to output streams, e.g. a hex dump. This is not a list and is not just a parsing of input. How does that fit into your picture?

The tricky part is that the input of bytes becomes three parallel streams during the output.

Address
Hex representation
Display character representation

Recently I have been looking at EDCGs for this.

Boris · March 13, 2020, 7:14am

TL;DR: a lot of words and no code. Probably not worth reading

Because I don’t want to leave your question hanging: it fits somehow. You can use the arguments to the DCG rule to keep the state (I have an irrational dislike of the “semi-context” technique). You can emit from inside the rule. You can just format; you can also use the new library(intercept), along with format.

To me it looks like it is one stream, not three. The address is state (see above) that you need to emit along with the hex dump of the content. I don’t quite understand the difference between the hex representation and the “display character representaion”. Or rather, I don’t know what you mean by “display character representation”.

Where it does get annoying is if you want to chain such transformations, “unix pipe”-style. I have used separate invocations or Prolog (from the shell) for this but I am not very happy about it.

However (warning, it gets even more hand-waivy at this point), I know of three programming concepts that seem to aim at roughly the same problem space.

Unix pipes
Lazy evaluation Haskell style (functional programming style)
Backtracking Prolog style

Erlang-style messages seem to be in the same general area but I haven’t programmed enough Erlang myself to be able to say much about it.

EricGT · March 13, 2020, 8:58am

Thanks, I should have given an actual rendition of what I was meaning.

Here is an actual test case I am using in my current code to create a Hex Dump

hex_dump_test_case_generator(24,success,4,4,1,2,1,[0'a,0'b,0'c,0'd,0'e,0'f,0'g,0'h,0'i,0'j,0'k,0'l,0'm,0'n,0'o,0'p,
                                                   0'q,0'r,0's,0't,0'u,0'v,0'w,0'x,0'y,0'z,0'0,0'1,0'2,0'3,0'4,0'5,
                                                   0'6]                                                                 ,"61 62 63 64  65 66 67 68  69 6A 6B 6C  6D 6E 6F 70  abcd efgh ijkl mnop\n\c
                                                                                                                          71 72 73 74  75 76 77 78  79 7A 30 31  32 33 34 35  qrst uvwx yz01 2345\n\c
                                                                                                                          36                                                  6                  " ).

The input stream is bytes, notice the [ ]

[0'a,0'b,0'c,0'd,0'e,0'f,0'g,0'h,0'i,0'j,0'k,0'l,0'm,0'n,0'o,0'p, 
 0'q,0'r,0's,0't,0'u,0'v,0'w,0'x,0'y,0'z,0'0,0'1,0'2,0'3,0'4,0'5,
 0'6]

which is formatted to wrap at 16 values to help in reading, but it is really just a sequential stream. For those not familiar with 0'a representation, it is a character code but using a displayable character.

?- 0'a = C.
C = 97.

?- format("~2r", 97).
1100001
true.

?- format("~16r", 97).
61
true.

?- format("~c",97).
a
true.

The output is also a stream but this time it is text characters, notice the " "

"61 62 63 64  65 66 67 68  69 6A 6B 6C  6D 6E 6F 70  abcd efgh ijkl mnop\n\c
 71 72 73 74  75 76 77 78  79 7A 30 31  32 33 34 35  qrst uvwx yz01 2345\n\c
 36                                                  6                  "

In my code so far I have not added the first output stream for the address but this example has two streams, the first as a hex representation of each byte and and the second as a 7-bit ACSII character representation of each byte.

Where the problem came in for me when creating this is that as the code is processing one byte on the input, it needs to convert a single input byte to more than one output value. As the bytes are being processed, two independent streams are being created, but when the 17th byte is read, the text representation of the two output streams needs to be joined together to create a single text line and start again for the next text line.

I can guess how Unix pipes and lazy evaluation would be used to solve this, but for the backtracking I am not certain what you are implying because on backtracking the data for the separate streams would be lost. Does the backtracking idea do what I am doing by combing the multiple streams into one before backtracking out of predicates that hold the state for the separate streams?

Another variation I am aware of to solve this is reifying backtracking, but that is still beyond my skill set.

EricGT · March 25, 2020, 11:08am

Now that I have a better grasp on how to combine multiple streams (difference list) using EDCG here is the code. As I am not an expert with EDCG this may not be the best way, but it does work.

Click triangle to see example code

% -----------------------------------------------------------------------------

:- module(edcg_example,
    [
        dl_append_edcg/6
    ]).

% -----------------------------------------------------------------------------

:- use_module(library(edcg)).

% -----------------------------------------------------------------------------

% These are just two versions of the classic difference list append.
% These are here  for use with the test cases to show that these and the EDCG
% version all pass the same test cases.

dl_append_expanded(Source_1,Hole_1,Source_2,Hole_2,Destination,Destination_hole) :-
    Destination = Source_1,
    Hole_1 = Source_2,
    Hole_2 = Destination_hole.

dl_append(Source_1,Source_2,Source_2,Hole_2,Source_1,Hole_2).

% -----------------------------------------------------------------------------

edcg:acc_info(source_1, Destination, Source_1, _Hole_1, Destination = Source_1).
edcg:acc_info(source_2, Hole_1, Source_2, _Hole_2, Hole_1 = Source_2).
edcg:acc_info(destination, Hole_2, _Destination, Destination_hole, Hole_2 = Destination_hole).

edcg:pred_info(dl_append_edcg,0,[source_1,source_2,destination]).

% NB Originally tried using EDCG notation Destination/destination/Hole_2 which failed miserably.

dl_append_edcg -->>
    Destination/destination,   % Set (unify) start of difference list
    [Destination]:source_1,    % Set (unify) hole of difference list
    Hole_1/source_1,
    [Hole_1]:source_2,
    Hole_2/source_2,
    [Hole_2]:destination.

% ?- listing(dl_append_edcg/6).
% edcg_example:dl_append_edcg(A,
%                             Hole_1,
%                             B,
%                             Hole_2,
%                             Destination,
%                             C) :-
%     true,
%     Destination=A,
%     true,
%     true,
%     Hole_1=B,
%     true,
%     true,
%     Hole_2=C,
%     true.

% Refined variable names by hand
%
% edcg_example:dl_append_edcg(Source_1,
%                             Hole_1,
%                             Source_2,
%                             Hole_2,
%                             Destination,
%                             Destination_hole) :-
%     true,
%     Destination=Source_1,
%     true,
%     true,
%     Hole_1=Source_2,
%     true,
%     true,
%     Hole_2=Destination_hole,
%     true.

% Refined variable names by hand
% Removed true statements
%
% edcg_example:dl_append_edcg(Source_1,
%                             Hole_1,
%                             Source_2,
%                             Hole_2,
%                             Destination,
%                             Destination_hole) :-
%     Destination=Source_1,
%     Hole_1=Source_2,
%     Hole_2=Destination_hole.

% Refined variable names by hand
% Removed true statements
% Refactored unification into head
%
% edcg_example:dl_append_edcg(Source_1, Source_2, Source_2, Hole_2, Source_1, Hole_2).

% -----------------------------------------------------------------------------

% Predicates used by test that conform to same signature.

dl_append_expanded(Source_1,Hole_1,Source_2,Hole_2,Output) :-
    dl_append_expanded(Source_1,Hole_1,Source_2,Hole_2,Output,Output_hole),
    Output_hole = [].  % Convert open list to closed list

dl_append(Source_1,Hole_1,Source_2,Hole_2,Output) :-
    dl_append(Source_1,Hole_1,Source_2,Hole_2,Output,Output_hole),
    Output_hole = [].  % Convert open list to closed list

dl_append_edcg(Source_1,Hole_1,Source_2,Hole_2,Output) :-
    dl_append_edcg(Source_1,Hole_1,Source_2,Hole_2,Output,Output_hole),
    Output_hole = [].  % Convert open list to closed list

:- begin_tests(dl_append).

append_list_test_case_generator(        Hole_1 , Hole_1,        Hole_2 , Hole_2,            [] ).
append_list_test_case_generator(     [1|Hole_1], Hole_1,     [a|Hole_2], Hole_2,         [1,a] ).
append_list_test_case_generator(   [1,2|Hole_1], Hole_1,   [a,b|Hole_2], Hole_2,     [1,2,a,b] ).
append_list_test_case_generator( [1,2,3|Hole_1], Hole_1, [a,b,c|Hole_2], Hole_2, [1,2,3,a,b,c] ).
append_list_test_case_generator(     [1|Hole_1], Hole_1,        Hole_2 , Hole_2,           [1] ).
append_list_test_case_generator(        Hole_1 , Hole_1,     [a|Hole_2], Hole_2,           [a] ).

test(01,[forall(append_list_test_case_generator(Source_1,Hole_1,Source_2,Hole_2,Output))]) :-
    dl_append_expanded(Source_1,Hole_1,Source_2,Hole_2,Output).

test(02,[forall(append_list_test_case_generator(Source_1,Hole_1,Source_2,Hole_2,Output))]) :-
    dl_append(Source_1,Hole_1,Source_2,Hole_2,Output).

test(03,[forall(append_list_test_case_generator(Source_1,Hole_1,Source_2,Hole_2,Output))]) :-
    dl_append_edcg(Source_1,Hole_1,Source_2,Hole_2,Output).

:- end_tests(dl_append).

Topic		Replies	Views
Read a number of bytes from a stream Help!	2	282	October 2, 2023
Parsing of binary data Nice to know	0	1316	October 9, 2019
Read_string/3 counts code points, right? Help!	5	490	January 28, 2022
Binary with_output_to/2 PlDoc how-to	2	344	March 7, 2023
How to send a string + string length on a stream? Help!	20	1360	June 24, 2021

Read_stream_to_codes

Related topics