Ann: SWI-Prolog 10.1.7

Dear SWI-Prolog user,

SWI-Prolog 10.1.7 is ready for download. This release lands the
long-running overhaul of Unicode handling — both in the
source-language syntax and in the editor/terminal rendering — together
with a handful of behaviour changes to align with the Unicode
recommendations for programming languages. A few of these change
observable defaults; users who rely on the old behaviour should read
the highlights below before upgrading.

Status

Unicode handling is under discussion in the PIP working group. The
current implementation realises where I think the consensus is growing
towards. SWI-Prolog will follow the decisions of the PIP group, which
implies that additional incompatible changes are not unlikely. Note that
these merely touch edge cases. Notably

  • Classification of solo characters.
  • Handling of Bracket pairs (Ps/Pe) and quote pairs (Pi/Pf).
    Notably the term-representation of quote pairs is new and likely to
    be debated. These features may also be restricted using Prolog flags
    and/or options.
  • Handling of non-ASCII digits is not supported by number_codes/2, etc.
    This may change.

Highlights

  • Unicode source syntax now follows UAX #31 (XID_Start / XID_Continue),
    with super-/subscript-digit profile additions for variables and atoms.
  • Default source encoding on Windows is now UTF-8 (was the legacy ANSI
    codepage).
  • Pluggable Unicode normalisation using the Prolog flag unicode_atoms, and
    Unicode bidi-override / isolate code points are now unconditionally
    rejected in source as a Trojan-source defence (CVE-2021-42574).
  • xpce editor and terminal now render and edit NFD combining marks,
    CJK wide glyphs, and supplementary-plane code points (emoji, CJK
    Extensions B–G) correctly on every platform.

Unicode source syntax

The classifier follows the Unicode release reported by the new read-only
Prolog flag unicode_syntax_version (currently '17.0.0'). See section
Unicode Prolog source in the manual for a worked example. Changes:

  • Identifiers are XID_Start followed by XID_Continue per UAX #31,
    with the super- and subscript digits (², ³, ¹, ⁰..⁹, ₀..₉) added as
    XID_Continue so variables like and X₁ work.
  • Variable vs. atom: a token is a variable iff it starts with _ or
    a code point in general category Lu. MODIFIED: titlecase
    letters (Lt, e.g. Dž) now start an atom, not a variable; this
    differs from earlier releases that used the broader derived
    Uppercase property.
  • MODIFIED: All Unicode symbol classes (Sm, Sc, Sk, So)
    and the connector/dash/other-punctuation classes (Pc, Pd, Po)
    are now treated as solo: each forms an atom on its own and does
    not glue with adjacent symbols. This is a deliberate
    break from earlier releases, in which Unicode symbols glued into
    compound atoms in the same way as ASCII symbols.
  • MODIFIED: NBSP (U+00A0) is no longer treated as whitespace.
    Outside quoted material it raises a stray-character syntax error.
    The layout set is now exactly Unicode Pattern_White_Space.
  • Bracket pairs (Ps/Pe) and quote pairs (Pi/Pf) are recognised
    as reader-level operators. An opening character followed by a
    Prolog term and the matching closing character reads as a unary
    compound '<open><close>'(Term), generalising the existing
    {Term} ⇒ '{}'(Term) form to the full Unicode Ps/Pe set.
    Quote pairs read literal text in the form selected by
    double_quotes; e.g. «hello, world» reads as
    '«»'("hello, world").
  • Seven Pattern_White_Space code points now act as line terminators
    everywhere (LF, VT, FF, CR, NEL, LS, PS). FIXED: %-comments,
    line-counter, and \<newline> continuation were silently ignoring
    NEL (U+0085), LS (U+2028), PS (U+2029) as well as CR/VT/FF.
  • In source code (read_term/2), numeric literals still use ASCII
    digits 0–9 only. Conversion via atom_number/2, number_codes/2 and
    number_chars/2 additionally accepts any Unicode Nd block; in a
    single number all digits must come from the same block.
  • New code_type/2 / char_type/2 categories prolog_layout (the eleven
    Pattern_White_Space code points) and prolog_end_of_line (the seven
    line-terminator-like ones). paren(Close) and the new
    quote(Close) cover the full Unicode Ps/Pe and Pi/Pf sets.
  • unicode_block/3 updated to Unicode 17.0.

Unicode atoms / normalisation

  • ADDED: Unicode handling is guided by the 4-valued option
    unicode_atoms. The values are accept (default),
    nfc, error, and reject. It follows the same three-tier
    hierarchy as encoding: Prolog flag → stream property → per-call
    option of read_term/2,3, read_clause/2,3, set_stream/2 and open/4.
  • Modes nfc and error auto-load library(unicode) at every entry
    point (previously only the flag’s setter did so).
  • Mode error falls back to a wcwidth-based check when no
    normalisation hook is registered, which can over-reject scripts
    (e.g. Thai) that use combining marks in NFC. A new read-only
    atom_normalize_hook flag tells you whether the precise check is
    available.
  • Unicode bidi-override and isolate code points (U+202A..U+202E and
    U+2066..U+2069) are unconditionally rejected in source tokens,
    quoted strings, and comments as a defence against the Trojan-source
    attack (CVE-2021-42574).
  • Loading library(unicode) now installs an NFC normaliser into the
    kernel via PL_atom_normalize_hook(), which makes unicode_atoms(nfc)
    work without further setup.

Windows

  • MODIFIED: Default source encoding on Windows is now UTF-8. The
    system locale on Windows usually reports a legacy codepage
    (Windows-1252 or similar), which used to make the default Prolog
    encoding flag ANSI/Latin-1 and caused UTF-8 source files to be
    read as their byte-wise Latin-1 interpretation.
  • FIXED: init_locale also sets LC_CTYPE to UTF-8. Without
    this, mbrtowc/wcrtomb stayed ASCII-only and any UTF-8 byte above
    0x7F flowing through PL_canonicalise_text returned EILSEQ —
    surfacing as Syntax error: illegal_multibyte_sequence the moment
    a user typed an emoji or other non-ASCII char in the libedit prompt.

C API

  • ADDED: PL_is_id_start, PL_is_id_continue, PL_is_uppercase,
    PL_is_decimal, PL_is_layout. Thin shims over the Unicode flag
    table so foreign extensions and embedded toolkits classify code
    points exactly as SWI-Prolog does, without needing the
    locale-dependent POSIX iswX() family.
  • ADDED: PL_wcwidth(int). Locale-portable display-column width
    for a Unicode code point — same answer as the rest of SWI-Prolog,
    regardless of the process LC_CTYPE or platform. On Unix/macOS it
    forwards to system wcwidth(3); on Windows it goes through the
    bundled mk_wcwidth() table.
  • ADDED: foreign_library_property/2 to query foreign library
    properties.

Editor and terminal (xpce)

  • FIXED: Editor column functions now account for CJK wide chars
    (each counts as 2 columns) and combining marks (each counts as 0).
    getColumnEditor() and getColumnLocationEditor() previously
    counted every non-tab character as one visual column, breaking caret
    positioning after CJK double-width and NFD content.
  • FIXED: Forward/backward char and delete now move by grapheme
    cluster, not code point. For NFD text a visible character may span
    multiple code points; the four basic editing operations previously
    stepped one code point at a time, leaving the caret stranded inside
    a cluster or deleting only part of a grapheme.
  • FIXED: Selection drag no longer corrupts text at grapheme
    cluster boundaries (Thai above-base vowel signs etc.).
  • FIXED: Selection endpoints snap to grapheme cluster boundaries.
  • FIXED: Terminal paint_chunks splits per cluster for non-ASCII
    text. When a system fixed-width font has no glyphs for a script
    (e.g. Thai), Pango falls back to a proportional font and shapes the
    run at its natural advance — so a selection within such a line no
    longer drifts horizontally.
  • FIXED: paint_line renders Unicode highlights without visual
    drift.
  • FIXED: Selection in display->copy sets both primary and
    secondary selection.
  • ADDED: nfd_style instance variable for visual NFD cluster
    highlighting on Editor and TerminalImage. Default @nil (disabled).
  • MODIFIED: auto_copy class variable on text_item, editor
    and terminal_image defaults to @off on Windows and macOS and
    @on elsewhere.
  • FIXED (Windows): supplementary-plane code points (emoji,
    CJK Extensions B–G) no longer render as their PUA-area low 16 bits
    or .notdef glyphs. Several wint_tuchar_t fixes; charW
    now falls back to uint32_t when wchar_t is too narrow; the
    text-buffer file format now round-trips SMP correctly; c_width,
    uchar_display_width, paint_line, regex CHR width and
    ENC_WCHAR stream paths all honour full code points.
  • ENHANCED: SDL backend Windows font specs now cover CJK,
    symbol/emoji, Thai/Lao (Leelawadee UI) and Yi syllables (Microsoft
    Yi Baiti) in the mono/sans/serif fallback chains. Korean, Japanese,
    Chinese and many symbols rendered as tofu before.
  • ENHANCED: font ->member walks the Pango fallback chain by
    default, so it returns @on for code points the system can render
    via fallback (e.g. U+1F600). An optional family=[bool] argument
    forces the old “primary font only” check.
  • ENHANCED: font <-domain likewise unions Pango fontset coverage.
  • ENHANCED: display-level methods to control and observe SDL’s
    on-screen-keyboard policy from Prolog.
  • FIXED: PceEmacs class menu (raised a type error), XPCE native
    finder icons, thread monitor icons.
  • FIXED: Icons for the thread monitor (use new SVG icons).

libedit

  • ADDED: el_get/2 to query editline properties. Initially
    supports editor(?Editor) (emacs / vi). Unknown properties
    raise domain_error(editline_property, _).
  • ADDED: bracketed_paste(?Boolean) property on el_set/2 and
    el_get/2. Tracked per el_context and toggled at runtime; skipped
    (and unbound) in vi mode.
  • ADDED: ^Z is bound to send EOF on Windows (matches the
    platform convention; Unix uses ^D).
  • FIXED: el_cursor() takes wchar_t units while electric() hands
    code points — caret motion past wide chars now lands on the right
    cell.
  • FIXED (Windows): pl_line and el_history_encoded use REP_EL
    not REP_MB so libedit’s UTF-8 bytes are decoded correctly under the
    legacy ANSI codepage.
  • ENHANCED: libedit uses PL_wcwidth as its wcwidth
    implementation, so swipl, swipl-win, pl-write/pl-fmt and xpce all
    share one column-width table.

Other

  • ADDED: library(macros) accepts dicts as valid macros.
  • ADDED: library(unicode) is now included in the WASM build.
  • FIXED: unload_file/1 clears isfile so a subsequent
    use_module/1 actually reloads — notably re-running
    :- use_foreign_library/1.
  • FIXED: tnodebug/0 no longer raises a domain error.
  • FIXED: code_type/2 no longer reports csymf twice.
  • C++ binding: PlTerm::unify_blob() now takes
    std::unique_ptr<T> for any T derived from PlBlob directly,
    without an upcast.

Documentation pipeline

  • The PDF manual is now built with lualatex (was pdflatex). Source
    files can carry literal Unicode without the \urldef ASCII-routing
    ceremony.

  • SWI manual sources renamed *.doc*.plx (Prolog LaTeX) so file
    managers, editors and contributors no longer mistake them for MS
    Word documents. .gitattributes maps *.plx to TeX for editor
    syntax highlighting and git diff hunks.

    Enjoy — Jan

3 Likes