%% This is part of the OpTeX project, see http://petr.olsak.net/optex

\_codedecl \pdfunidef {PDFunicode strings for outlines <2021-02-08>} % preloaded in format

   \_doc -----------------------------
   \`\_hexprint` is a command defined in Lua, that scans a number and expands
   to its UTF-16 Big Endian encoded form for use in PDF hexadecimal strings.
   \_cod -----------------------------

\bgroup
\_catcode`\%=12
\_gdef\_hexprint{\_directlua{
   local num = token.scan_int()
   if num < 0x10000 then
      tex.print(string.format("%04X", num))
   else
      num = num - 0x10000
      local high = bit32.rshift(num, 10) + 0xD800
      local low = bit32.band(num, 0x3FF) + 0xDC00
      tex.print(string.format("%04X%04X", high, low))
   end
}}
\egroup

   \_doc -----------------------------
   \`\pdfunidef``\macro{<text>}` defines `\macro` as <text> converted to
   Big Endian UTF-16 and enclosed to \code{<>}. Example of usage:
   `\pdfunidef\infoauthor{Petr Olšák} \pdfinfo{/Author \infoauthor}`.\nl
   \^`\pdfunidef` does more things than only converting to hexadecimal PDF string.
   The <text> can be scanned in verbatim mode (it is true becuase \^`\_Xtoc`
   reads the <text> in verbatim mode). First `\edef` do
   `\_scantextokens\unexpanded` and second `\edef` expands the parameter
   according to current values on selected macros from `\_regoul`. Then
   \`\_removeoutmath` converts `..$x^2$..` to `..x^2..`, i.e removes dollars.
   Then \`\_removeoutbraces` converts `..{x}..` to `..x..`.
   Finally, the <text> is detokenized, spaces are preprocessed using \^`\replstring`
   and then the \`\_pdfunidefB` is repeated on each character. It calls the
   `\directlua` chunk to print hexadecimal numbers in the macro \^`\_hexprint`.\nl
   Characters for quotes (and separators for quotes) are activated by first
   `\_scatextokens` and they are defined as the same non-active characters.
   But `\_regoul` can change this definition.
   \_cod -----------------------------

\_def\_pdfunidef#1#2{%
   \_begingroup
      \_catcodetable\_optexcatcodes \_adef"{"}\_adef'{'}%
      \_the\_regoul \_relax % \_regmacro alternatives of logos etc.
      \_ifx\_savedttchar\_undefined \_def#1{\_scantextokens{\_unexpanded{#2}}}%
      \_else \_lccode`\;=\_savedttchar \_lowercase{\_prepinverb#1;}{#2}\fi
      \_edef#1{#1}%
      \_escapechar=-1
      \_edef#1{#1\_empty}%
      \_escapechar=`\\
      \_ea\_edef \_ea#1\_ea{\_ea\_removeoutmath   #1$\_fin$}%  $x$ -> x
      \_ea\_edef \_ea#1\_ea{\_ea\_removeoutbraces #1{\_fin}}%  {x} -> x
      \_edef#1{\_detokenize\_ea{#1}}%
      \_replstring#1{ }{{ }}%  text text -> text{ }text
      \_catcode`\\=12 \_let\\=\_bslash
      \_edef\_out{<FEFF}
      \_ea\_pdfunidefB#1^%  text -> \_out in octal
      \_ea
   \_endgroup
   \_ea\_def\_ea#1\_ea{\_out>}
}
\_def\_pdfunidefB#1{%
   \_ifx^#1\_else
      \_edef\_out{\_out \_hexprint `#1}
   \_ea\_pdfunidefB \_fi
}

\_def\_removeoutbraces #1#{#1\_removeoutbracesA}
\_def\_removeoutbracesA #1{\_ifx\_fin#1\_else #1\_ea\_removeoutbraces\_fi}
\_def\_removeoutmath #1$#2${#1\_ifx\_fin#2\_else #2\_ea\_removeoutmath\_fi}

   \_doc -----------------------------
   The \`\_prepinverb``<macro><separator>{<text>}`,
   e.g.\ `\_prepinverb\tmpb|{aaa |bbb| cccc |dd| ee}`
   does `\def\tmpb{<su>{aaa }bbb<su>{ cccc }dd<su>{ ee}}` where
   <su> is `\scantextokens\unexpanded`. It means that in-line verbatim
   are not argument of `\scantextoken`. First `\edef\tmpb` tokenizes again
   the <text> but not the parts which were in the the in-line verbatim.
   \_cod -----------------------------

\_def\_prepinverb#1#2#3{\_def#1{}%
   \_def\_dotmpb ##1#2##2{\_addto#1{\_scantextokens{\_unexpanded{##1}}}%
      \_ifx\_fin##2\_else\_ea\_dotmpbA\_ea##2\_fi}%
   \_def\_dotmpbA ##1#2{\_addto#1{##1}\_dotmpb}%
   \_dotmpb#3#2\_fin
}

   \_doc -----------------------------
   The \^`\regmacro` is used in order to set the values of macros
   `\em`, `\rm`, `\bf`, `\it`, `\bi`, `\tt`, `\/` and `~` to values usable in
   PDF outlines.
   \_cod -----------------------------

\_regmacro {}{}{\_let\em=\_empty \_let\rm=\_empty \_let\bf=\_empty
    \_let\it=\_empty \_let\bi=\_empty \_let\tt=\_empty \_let\/=\_empty
    \_let~=\_space
}
\public \pdfunidef ;

\_endcode % --------------------------------

There are only two encodings for PDF strings (used in PDFoutlines, PDFinfo,
etc.). The first one is PDFDocEncoding which is single-byte encoding, but it
misses most international characters.

The second encoding is Big Endian UTF-16 which is implemented in this file. It
encodes a single character in either two or four bytes.
This encoding is \TeX/-discomfortable because it looks like

\begtt
<FEFF 0043 0076 0069 010D 0065 006E 00ED 0020 006A 0065 0020 007A 00E1 0074
011B 017E 0020 0061 0020 0078 2208 D835DD44>
\endtt

This example shows a hexadecimal PDF string (enclosed in \code{<>} as opposed
to the literal PDF string enclosed in `()`). In these strings each byte is
represented by two hexadecimal characters (`0-9`, `A-F`). You can tell the
encoding is UTF-16BE, becuase it starts with \"Byte order mark" `FEFF`. Each
unicode character is then encoded in one or two byte pairs. The example string
corresponds to the text \"Cvičení je zátěž a ${\rm x} ∈ 𝕄$". Notice the 4 bytes
for the last character, $𝕄$. (Even the whitespace would be OK in a PDF file,
because it should be ignored by PDF viewers, but \LuaTeX\ doesn't allow it.)

\_endinput

2021-02-08 \_octalprint -> \_hexprint
2020-03-12 Released
