Digraphs and trigraphs (programming)

From Wikipedia, the free encyclopedia
(Redirected from Digraph (computing))
Jump to navigation Jump to search

Template:Short description Script error: No such module "other uses". Template:More citations needed Template:Use dmy dates In computer programming, digraphs and trigraphs are sequences of two and three characters, respectively, that appear in source code and, according to a programming language's specification, should be treated as if they were single characters.

Various reasons exist for using digraphs and trigraphs: keyboards may not have keys to cover the entire character set of the language, input of special characters may be difficult, text editors may reserve some characters for special use and so on. Trigraphs might also be used for some EBCDIC code pages that lack characters such as { and }.

History

The basic character set of the C programming language is a subset of the ASCII character set that includes nine characters which lie outside the ISO 646 invariant character set. This can pose a problem for writing source code when the encoding (and possibly keyboard) being used does not support one or more of these nine characters. The ANSI C committee invented trigraphs as a way of entering source code using keyboards that support any national version of the ISO 646 character set.[1]

With the widespread adoption of ASCII and Unicode/UTF-8, trigraph use is limited today, and trigraph support has been removed from C as of C23. [2]

Script error: No such module "anchor".Implementations

Trigraphs are not commonly encountered outside compiler test suites.[3] Some compilers support an option to turn recognition of trigraphs off, or disable trigraphs by default and require an option to turn them on. Some can issue warnings when they encounter trigraphs in source files. Borland supplied a separate program, the trigraph preprocessor (TRIGRAPH.EXE), to be used only when trigraph processing is desired (the rationale was to maximise speed of compilation).

Language support

Different systems define different sets of digraphs and trigraphs, as described below.

ALGOL

Early versions of ALGOL predated the standardized ASCII and EBCDIC character sets, and were typically implemented using a manufacturer-specific six-bit character code. A number of ALGOL operations either lacked codepoints in the available character set or were not supported by peripherals, leading to a number of substitutions including := for (assignment) and >= for (greater than or equal).

Pascal

The Pascal programming language supports digraphs (., .), (* and *) for [, ], { and } respectively. Unlike all other cases mentioned here, (* and *) were and still are in wide use. However, many compilers treat them as a different type of commenting block rather than as actual digraphs, that is, a comment started with (* cannot be closed with } and vice versa.

J

The J programming language is a descendant of APL but uses the ASCII character set rather than APL symbols. Because the printable range of ASCII is smaller than APL's specialized set of symbols, . (dot) and : (colon) characters are used to inflect ASCII symbols, effectively interpreting unigraphs, digraphs or rarely trigraphs as standalone "symbols".[4]

Unlike the use of digraphs and trigraphs in C and C++, there are no single-character equivalents to these in J.

C

Script error: No such module "Labelled list hatnote".

Trigraph Equivalent
??= #
??/ \
??' ^
??( [
??) ]
??! |
??< {
??> }
??- ~

The C preprocessor (used for C and with slight differences in C++; see below) replaces all occurrences of the nine trigraph sequences in this table by their single-character equivalents before any other processing (until C23[5]).[6][7]

A programmer may want to place two question marks together yet not have the compiler treat them as introducing a trigraph. The C grammar does not permit two consecutive ? tokens, so the only places in a C file where two question marks in a row may be used are in multi-character constants, string literals, and comments. This is particularly a problem for the classic Mac OS, where the constant '????' may be used as a file type or creator.[8] To safely place two consecutive question marks within a string literal, the programmer can use string concatenation "...?""?..." or an escape sequence "...?\?...".

??? is not itself a trigraph sequence, but when followed by a character such as - it will be interpreted as ? + ??-, as in the example below which has 16 ?s before the /.

The ??/ trigraph can be used to introduce an escaped newline for line splicing; this must be taken into account for correct and efficient handling of trigraphs within the preprocessor. It can also cause surprises, particularly within comments. For example:

Template:Sxhl

which is a single logical comment line (used in C++ and C99), and

Template:Sxhl

which is a correctly formed block comment. The concept can be used to check for trigraphs as in the following C99 example, where only one return statement will be executed.

Template:Sxhl

Alternative digraphs introduced in the C standard in 1994
Digraph Equivalent
<: [
:> ]
<% {
%> }
%: #

In 1994, a normative amendment to the C standard, C95,[9][10] included in C99, supplied digraphs as more readable alternatives to five of the trigraphs.

Unlike trigraphs, digraphs are handled during tokenization, and any digraph must always represent a full token by itself, or compose the token %:%: replacing the preprocessor concatenation token ##. If a digraph sequence occurs inside another token, for example a quoted string, or a character constant, it will not be replaced.

C++

Script error: No such module "Labelled list hatnote".

C++ (through C++14, see below) behaves like C, including the C99 additions.[11]

As a note, %:%: is treated as a single token, rather than two occurrences of %:.

In the sequence <:: if the subsequent character is neither : nor >, the < is treated as a preprocessing token by itself and not as the first character of the alternative token <:. This is done so certain uses of templates are not broken by the substitution.

The C++ Standard makes this comment with regards to the term "digraph":[12] Template:Quote

Script error: No such module "anchor".Trigraphs were proposed for deprecation in C++0x, which was released as C++11.[13] This was opposed by IBM, speaking on behalf of itself and other users of C++,[14] and as a result trigraphs were retained in C++11. Trigraphs were then proposed again for removal (not only deprecation) in C++17.[15] This passed a committee vote, and trigraphs (but not the additional tokens) are removed from C++17 despite the opposition from IBM.[16] Existing code that uses trigraphs can be supported by translating from the source files (parsing trigraphs) to the basic source character set that does not include trigraphs.[15]

Script error: No such module "anchor".RPL

Hewlett-Packard calculators supporting the RPL language and input method provide support for a large number of trigraphs (also called TIO codes) to reliably transcribe non-seven-bit ASCII characters of the calculators' extended character set[17][18][19] on foreign platforms, and to ease keyboard input without using the Template:Mono application.[20][21][18][19] The first character of all TIO codes is a \, followed by two other ASCII characters vaguely resembling the glyph to be substituted.[20][21][18][19][22] All other characters can be entered using the special \nnn TIO code syntax with nnn being a three-digit decimal number (with leading zeros if necessary) of the corresponding code point (thereby formally representing a tetragraph).[20][18][19]

Application support

Vim

The Vim text editor supports digraphs for actual entry of text characters, following Template:IETF RFC. The entry of digraphs is bound to Template:Keypress by default.[23] The list of all possible digraphs in Vim can be displayed by typing Template:Kbd.

GNU Screen

GNU Screen has a digraph command, bound to Template:Keypress Template:Keypress by default.[24]

Lotus

Lotus 1-2-3 for DOS uses Template:Keypress as compose key to allow easier input of many special characters of the Lotus International Character Set (LICS)[25] and Lotus Multi-Byte Character Set (LMBCS).

See also

Script error: No such module "Portal".

References

Template:Reflist

External links

  1. Script error: No such module "citation/CS1".
  2. Script error: No such module "citation/CS1".
  3. Cite error: Invalid <ref> tag; no text was provided for refs named C
  4. Cite error: Invalid <ref> tag; no text was provided for refs named Hui_2015
  5. Script error: No such module "citation/CS1".
  6. Cite error: Invalid <ref> tag; no text was provided for refs named BSI_2003_C
  7. Cite error: Invalid <ref> tag; no text was provided for refs named Rationale_2003_C
  8. Script error: No such module "citation/CS1".
  9. Template:Cite ISO standard
  10. Script error: No such module "citation/CS1".
  11. Cite error: Invalid <ref> tag; no text was provided for refs named Stroustrup_1994_DEC
  12. Cite error: Invalid <ref> tag; no text was provided for refs named OpenSTD
  13. Cite error: Invalid <ref> tag; no text was provided for refs named N2837
  14. Cite error: Invalid <ref> tag; no text was provided for refs named N2910
  15. a b Cite error: Invalid <ref> tag; no text was provided for refs named N3981
  16. Cite error: Invalid <ref> tag; no text was provided for refs named N4210
  17. Cite error: Invalid <ref> tag; no text was provided for refs named HP82240B_1989
  18. a b c d Cite error: Invalid <ref> tag; no text was provided for refs named HP48G_UG
  19. a b c d Cite error: Invalid <ref> tag; no text was provided for refs named HP50G_AUR
  20. a b c Cite error: Invalid <ref> tag; no text was provided for refs named HP-TIO
  21. a b Cite error: Invalid <ref> tag; no text was provided for refs named Heinz_2005
  22. Cite error: Invalid <ref> tag; no text was provided for refs named Finseth_2012
  23. Cite error: Invalid <ref> tag; no text was provided for refs named vim
  24. Cite error: Invalid <ref> tag; no text was provided for refs named Screen
  25. Cite error: Invalid <ref> tag; no text was provided for refs named HP_1991_95LXUG