added more formal definition, elaborated on languages TAGs can and cannot generate, added more visuals, moved comments on variants of TAG to a new section.
Line 71:
Line 71:
==Description==
==Description==
The rules in a TAG are trees with a special leaf node known as the ''foot node'', which is anchored to a word.
[[File:TAG-schematic-adjunction.svg|thumb|Schematic illustration of the adjunction operation: The tree <math>\alpha</math> combines with an auxiliary tree <math>\beta</math> at a node labelled with non-terminal symbol <math>X</math> that needs to be also the root and foot node of the auxiliary tree. The resulting tree is deeper than the original one.]]
There are two types of basic trees in TAG: ''initial'' trees (often represented as '<math>\alpha</math>') and ''auxiliary'' trees ('<math>\beta</math>'). Initial trees represent basic valency relations, while auxiliary trees allow for recursion.<ref name="jurafsky-martin2000">{{cite book
[[File:TAG-schematic-substitution.svg|thumb|Schematic illustration of the substitution operation: two trees (<math>\alpha</math> and <math>\beta</math>, these don't need to be elementary trees) are joined on a node labelled with nonterminal <math>X</math>: this node is one of the leave nodes marked for substitution in <math>\alpha</math>, and the root of <math>\beta</math> is a node with the same nonterminal.]]
A TAG can be defined as a 5-[[tuple]] <math>\langle \Sigma, NT, I, A, S\rangle</math> with<ref name="joshi-shabes-1991">{{Cite tech report
|last1=Joshi
|first1=Aravind K. Joshi
|last2=Shabes
|first2=Yves
|date=March 1991
|title=Tree-adjoning grammars and lexicalized grammars
|publisher=Department of Computer and Information Science, University of Pennsylvania
}}</ref>
* <math>\Sigma</math> as the [[finite set]] of [[Terminal symbol|terminal symbols]];
* <math>NT</math> as the finite set of non-terminal symbols, disjunct from <math>\Sigma</math>
* <math>I</math> as finite set of finite trees, which are called ''initial trees''.
** Initial trees have non-terminals as inner nodes. The frontier can consist of terminals and non-terminals. Non-terminals at the frontier are marked for substitution (typically by the symbol <math>\downarrow</math> added after the non-terminal). Nodes marked for substitution cannot be adjoined to.
* <math>A</math> as the finite set of finite trees, which are called ''auxiliary trees''.
** Auxiliary trees have a special leaf node known as the ''foot node'' (typically marked by <math>\ast</math>) which needs to have the same non-terminal symbol as the root of this tree. All other non-terminal nodes at the frontier are marked for substitution. Same as for initial trees, again inner nodes have non-terminal symbols.
* <math>S</math> as the special start symbol, belonging to the set of non-terminals.
Additionally, TAGs with adjunction constraints on nodes have been introduced. An adjunction constraint on a node can completely disallow adjunction (NA = null adjunction), make it obligatory (OA) or only allow selected auxiliary trees to adjoin (SA).<ref name="joshi-shabes-1991" />
The two types of basic trees in TAG initial trees (often represented as '<math>\alpha</math>') and auxiliary trees ('<math>\beta</math>') are together called ''elementary trees''. Initial trees represent basic valency relations, while auxiliary trees allow for recursion.<ref name="jurafsky-martin2000">{{cite book
| last = Jurafsky
| last = Jurafsky
| first = Daniel
| first = Daniel
Line 82:
Line 105:
| publisher = Prentice Hall
| publisher = Prentice Hall
| location = Upper Saddle River, NJ }}</ref>
| location = Upper Saddle River, NJ }}</ref>
Auxiliary trees have the root (top) node and foot node labeled with the same symbol.
A derivation starts with an initial tree, combining via either ''substitution'' or ''adjunction''. Substitution replaces a frontier node with another tree whose top node has the same label. The root/foot label of the auxiliary tree must match the label of the node at which it adjoins. Adjunction can thus have the effect of inserting an auxiliary tree into the center of another tree.<ref name="joshi-rambow2003"/>
Other variants of TAG allow [[multi-component tree adjoining grammars|multi-component trees]], trees with multiple foot nodes, and other extensions.
A derivation starts with an initial tree, combining via either ''substitution'' or ''adjunction''. Substitution replaces a frontier node with another tree whose top node has the same label. The root/foot label of the auxiliary tree must match the label of the node at which it adjoins. Adjunction can thus have the effect of inserting an auxiliary tree into the center of another tree.<ref name="joshi-rambow2003" />
==Complexity and application==
==Complexity and application==
Tree-adjoining grammars are more powerful (in terms of [[weak generative capacity]]) than [[context-free grammar]]s, but less powerful than [[linear context-free rewriting system]]s,<ref>Kallmeyer, Laura (2010). Parsing Beyond Context-Free Grammars. Springer. Here: p.215-216</ref> [[indexed grammar|indexed]]<ref group=note>since for each tree-adjoining grammar, a linear indexed grammar can be found producing the same language, see [[#Equivalences|below]], and for the latter, a weakly equivalent (proper) indexed grammar can be found, in turn, see [[Indexed grammar#Computational Power]]</ref> or [[context-sensitive grammar|context-sensitive]] grammars.
For every [[context-free grammar]] a tree-adjoining grammar can be generated which accepts the same string language. Thus, TAGs can generate all [[Context-free language|context-free languages]].<ref name="joshi1985" /> Aside from context-free languages they can generate some, but not all [[Context-sensitive language|context-sensitive languages]].
Two examples of context-sensitive, non-context-free languages that TAGs (with adjunction constraints) can generate are<ref name="joshi1985" />
* The copy language (language of squares) in which an arbitrary string is repeated: <math>\left\{ ww \mid w \in \Sigma^* \right\}</math>[[File:Copy-language-tag.jpg|thumb|The three elementary trees necessary to generate the copy language with the alphabet containing only letters a and b.]]
* The count-4 language <math>\{a^n b^n c^n d^n | 1 \le n \}</math> [[File:Anbncndn.jpg|alt=The picture displays two trees. The first one consists only of the nonterminal S at the root and the empty string as its yield. The second one is rooted in a nonterminal S too, this time with three branches: the child nodes are, from left to right, the terminal symbol a, again a nonterminal S, and the terminal symbol d. The S-child itself has three children again: the terminal symbol b, again a nonterminal S, and the terminal symbol c. Therefore the tree has a chain of three S nodes. The grandparent S and grandchild S can be used for adjunction.|thumb|Trees necessary to generate the Count-4-language including the [[Empty word|empty word ]]<math>\epsilon</math>, formally defined as <math>\text{Count}_4 = \left\{ a^n b^n c^n d^n \mid n \geq 0 \right\}</math>.]]
Tree-adjoining grammars are more powerful (in terms of [[weak generative capacity]]) than context-free grammars, but less powerful than [[linear context-free rewriting system]]s,<ref>Kallmeyer, Laura (2010). Parsing Beyond Context-Free Grammars. Springer. Here: p.215-216</ref> [[indexed grammar|indexed]]<ref group="note">since for each tree-adjoining grammar, a linear indexed grammar can be found producing the same language, see [[#Equivalences|below]], and for the latter, a weakly equivalent (proper) indexed grammar can be found, in turn, see [[Indexed grammar#Computational Power]]</ref> or [[context-sensitive grammar|context-sensitive]] grammars.
Two examples for context-sensitive languages that TAGs cannot generate are<ref name="joshi1985" />
* the language of triplicated strings (language of cubes): <math>\left\{ www \mid w \in \Sigma^* \right\}</math>
* languages with more than four distinct character strings of equal length, e.g. the count-5 language <math>\{a^n b^n c^n d^n f^n | 1 \le n \}</math>
A TAG can describe the language of squares (in which some arbitrary string is repeated), and the language <math>\{a^n b^n c^n d^n | 1 \le n \}</math>. This type of processing can be represented by an [[embedded pushdown automaton]].
The processing of languages that TAGs can generate be represented by an [[embedded pushdown automaton]].
Languages with cubes (i.e. triplicated strings) or with more than four distinct character strings of equal length cannot be generated by tree-adjoining grammars.
For these reasons, tree-adjoining grammars are often described as [[Mildly context-sensitive grammar|mildly context-sensitive]].
Tree-adjoining grammars are often described as [[Mildly context-sensitive grammar|mildly context-sensitive]].
These grammar classes are conjectured to be powerful enough to model [[natural language]]s while remaining efficiently [[parser|parsable]] in the general case.<ref name="joshi1985">{{cite book
These grammar classes are conjectured to be powerful enough to model [[natural language]]s while remaining efficiently [[parser|parsable]] in the general case.<ref name="joshi1985">{{cite book
| last = Joshi
| last = Joshi
Line 113:
Line 145:
Vijay-Shanker and Weir (1994)<ref name="vijayshankarAndWeir1995">Vijay-Shanker, K. and Weir, David J. 1994. ''The Equivalence of Four Extensions of Context-Free Grammars''. Mathematical Systems Theory 27(6): 511–546.</ref> demonstrate that [[Indexed grammar#Linear indexed grammars|linear indexed grammars]], [[combinatory categorial grammar]], tree-adjoining grammars, and [[head grammar]]s are [[Weak equivalence (formal languages)|weakly equivalent]] formalisms, in that they all define the same string languages.
Vijay-Shanker and Weir (1994)<ref name="vijayshankarAndWeir1995">Vijay-Shanker, K. and Weir, David J. 1994. ''The Equivalence of Four Extensions of Context-Free Grammars''. Mathematical Systems Theory 27(6): 511–546.</ref> demonstrate that [[Indexed grammar#Linear indexed grammars|linear indexed grammars]], [[combinatory categorial grammar]], tree-adjoining grammars, and [[head grammar]]s are [[Weak equivalence (formal languages)|weakly equivalent]] formalisms, in that they all define the same string languages.
==Lexicalized ==
Lexicalized tree-adjoining grammars (LTAG) are a variant of TAG in which each elementary tree (initial or auxiliary) is associated with a lexical item. A lexicalized grammar for English has been developed by the XTAG Research Group of the Institute for Research in Cognitive Science at the University of Pennsylvania.<ref name="xtagenglish"/>
==Variants ==
''Lexicalized tree-adjoining grammars (LTAG)'' are a variant of TAG in which each elementary tree (initial or auxiliary) is associated with a lexical item. Each tree has at least one terminal as a leaf node, which is then called the (lexical) anchor of the tree. A lexicalized grammar for English has been developed by the XTAG Research Group of the Institute for Research in Cognitive Science at the University of Pennsylvania.<ref name="xtagenglish"/>
Other variants of TAG allow [[multi-component tree adjoining grammars|multi-component trees]], trees with multiple foot nodes, and other extensions.
==See also==
==See also==
Latest revision as of 13:30, 27 June 2025
Template:Short descriptionTree-adjoining grammar (TAG) is a grammar formalism defined by Aravind Joshi. Tree-adjoining grammars are somewhat similar to context-free grammars, but the elementary unit of rewriting is the tree rather than the symbol. Whereas context-free grammars have rules for rewriting symbols as strings of other symbols, tree-adjoining grammars have rules for rewriting the nodes of trees as other trees (see tree (graph theory) and tree (data structure)).
TAG originated in investigations by Joshi and his students into the family of adjunction grammars (AG),[1]
the "string grammar" of Zellig Harris.[2] AGs handle exocentric properties of language in a natural and effective way, but do not have a good characterization of endocentric constructions; the converse is true of rewrite grammars, or phrase-structure grammar (PSG). In 1969, Joshi introduced a family of grammars that exploits this complementarity by mixing the two types of rules. A few very simple rewrite rules suffice to generate the vocabulary of strings for adjunction rules. This family is distinct from the Chomsky-Schützenberger hierarchy but intersects it in interesting and linguistically relevant ways.[3] The center strings and adjunct strings can also be generated by a dependency grammar, avoiding the limitations of rewrite systems entirely.[4][5]
Description
File:TAG-schematic-adjunction.svgSchematic illustration of the adjunction operation: The tree combines with an auxiliary tree at a node labelled with non-terminal symbol that needs to be also the root and foot node of the auxiliary tree. The resulting tree is deeper than the original one.File:TAG-schematic-substitution.svgSchematic illustration of the substitution operation: two trees ( and , these don't need to be elementary trees) are joined on a node labelled with nonterminal : this node is one of the leave nodes marked for substitution in , and the root of is a node with the same nonterminal.
as the finite set of non-terminal symbols, disjunct from
as finite set of finite trees, which are called initial trees.
Initial trees have non-terminals as inner nodes. The frontier can consist of terminals and non-terminals. Non-terminals at the frontier are marked for substitution (typically by the symbol added after the non-terminal). Nodes marked for substitution cannot be adjoined to.
as the finite set of finite trees, which are called auxiliary trees.
Auxiliary trees have a special leaf node known as the foot node (typically marked by ) which needs to have the same non-terminal symbol as the root of this tree. All other non-terminal nodes at the frontier are marked for substitution. Same as for initial trees, again inner nodes have non-terminal symbols.
as the special start symbol, belonging to the set of non-terminals.
Additionally, TAGs with adjunction constraints on nodes have been introduced. An adjunction constraint on a node can completely disallow adjunction (NA = null adjunction), make it obligatory (OA) or only allow selected auxiliary trees to adjoin (SA).[6]
The two types of basic trees in TAG initial trees (often represented as '') and auxiliary trees ('') are together called elementary trees. Initial trees represent basic valency relations, while auxiliary trees allow for recursion.[7]
A derivation starts with an initial tree, combining via either substitution or adjunction. Substitution replaces a frontier node with another tree whose top node has the same label. The root/foot label of the auxiliary tree must match the label of the node at which it adjoins. Adjunction can thus have the effect of inserting an auxiliary tree into the center of another tree.[4]
Two examples of context-sensitive, non-context-free languages that TAGs (with adjunction constraints) can generate are[8]
The copy language (language of squares) in which an arbitrary string is repeated: File:Copy-language-tag.jpgThe three elementary trees necessary to generate the copy language with the alphabet containing only letters a and b.
Tree-adjoining grammars are often described as mildly context-sensitive.
These grammar classes are conjectured to be powerful enough to model natural languages while remaining efficiently parsable in the general case.[8]
Lexicalized tree-adjoining grammars (LTAG) are a variant of TAG in which each elementary tree (initial or auxiliary) is associated with a lexical item. Each tree has at least one terminal as a leaf node, which is then called the (lexical) anchor of the tree. A lexicalized grammar for English has been developed by the XTAG Research Group of the Institute for Research in Cognitive Science at the University of Pennsylvania.[5]
Other variants of TAG allow multi-component trees, trees with multiple foot nodes, and other extensions.