Read_string/3 counts code points, right?

From read_string/3:

read_string(+Stream, ?Length, -String)

Read at most Length characters from Stream and return them in the string String. If Length is unbound, Stream is read to the end and Length is unified with the number of characters read.

I want to double check here since character encoding can get slippery.

The “… at most Length characters…” means “… at most Length Unicode code points…” that are read from Stream, right?

Edit: Assuming the answer is yes: Is there a predicate like read_string that will read N bytes from a stream and convert to a string?

Yes.

No :slight_smile: You can use stream_range_open/3. That creates a filter stream with a defined length. Next you can simply read up to end-of-file. So,

    setup_call_cleanup(
         stream_range_open(Input, Tmp, [size(Bytes)]),
         ( set_steam(Tmp, encoding(utf8)),
           read_string(Tmp, _, String)
         ),
         close(Tmp)).

Now, if you need to read JSON, a Prolog term or something else you can directly read from a stream you do not need the intermediate string. This is used to deal with Content-Length: Bytes in HTTP streams.

Note that with Unicode in UTF-8, that predicate would have a bug if you provide the number of bytes and you don’t read the whole unicode character.

Can you clarify with an example where the bug occurs. I ask because this is my line of thought and you seem to have a different line of thought.

Welcome to SWI-Prolog (threaded, 64 bits, version 8.5.5)
...

?- current_prolog_flag(encoding,E).
E = text.

?- open_string("→",Stream),read_string(Stream,L,String),string_codes(String,Codes).
Stream = <stream>(000000000655DEC0),
L = 1,
String = "→",
Codes = [8594].

?- set_prolog_flag(encoding,utf8).
true.

?- current_prolog_flag(encoding,E).
E = utf8.

?- open_string("→",Stream),read_string(Stream,L,String),string_codes(String,Codes).
Stream = <stream>(000000000655DEC0),
L = 1,
String = "→",
Codes = [8594].

Rightwards Arrow → (ref)

When the length is set to 1, it still reads the Unicode character.

?- open_string("→",Stream),read_string(Stream,1,String),string_codes(String,Codes).
Stream = <stream>(0000000006257D00),
String = "→",
Codes = [8594].

Without going too deep, a byte is 8 bits, so the largest integer you can represent with a byte is 256. In your example, the code 8594 cannot be a byte.

For the gory details, start reading here: UTF-8 - Wikipedia

?- open_string("😀", Stream), 
   stream_range_open(Stream, Tmp, [size(1)]), 
   read_string(Tmp, 1, String), 
   close(Tmp), close(Stream).
Warning: <stream>(0x55eac48fbdb0):1:1: Illegal UTF-8 continuation
Stream = <stream>(0x55eac48fb620),
Tmp = <stream>(0x55eac48fbdb0),
String = "�".
3 Likes