0603:cs0603028/AI.tex

1: \documentclass[11pt]{article}

2:

3: \usepackage{alltt}

4: \usepackage{graphics}

5: \usepackage{latexsym}

6: \usepackage{textcomp}

7: \usepackage{amssymb}

8: \usepackage{amsmath}

9: \usepackage{amsthm}

10: \usepackage{pst-tree}

11: \psset{levelsep=5ex,nodesep=3pt,radius=1ex,fillcolor=black,hatchsep=1.5pt}

12: \psset{tnpos=r}

13:

14: \renewcommand{\leq}{\leqslant}

15: \renewcommand{\geq}{\geqslant}

16: \renewcommand{\emptyset}{\varnothing}

17: \newcommand{\aco}{\texttt{\symbol{'173}}}

18: \newcommand{\act}{\texttt{\symbol{'175}}}

19:

20: \newtheorem{theorem}{Theorem}

21: \theoremstyle{definition}

22: \newtheorem{definition}{Definition}

23: \theoremstyle{remark}

24: \newtheorem{remark}{Remark}[section]

25:

26: \title{On the tree-transformation power of XSLT}

27:

28: \author{Wim Janssen \and

29: Alexandr Korlyukov$^{\text{\textdied}}$

30: \and

31: Jan Van den Bussche\thanks{Wim Janssen and Jan Van den Bussche are with the University of Hasselt, Belgium.}}

32:

33: \date{}

34:

35: \begin{document}

36: 	\maketitle

37: 	\renewcommand{\thefootnote}{}

38: 		\footnotetext{Alexandr Korlyukov, who was with Grodno State University,

39: Belarus, sadly passed away shortly after we agreed to write a joint paper.}

40: 	\renewcommand{\thefootnote}{\arabic{footnote}}

41:

42:

43: 		\begin{abstract}

44: 			XSLT is a standard rule-based programming language for expressing

45: 			transformations of XML data.  The language is currently in transition from

46: 			version 1.0 to 2.0.  In order to understand the computational consequences of

47: 			this transition, we restrict XSLT to its pure tree-transformation

48: 			capabilities.  Under this focus, we observe that XSLT~1.0 was not yet a

49: 			computationally complete tree-transformation language: every 1.0 program can

50: 			be implemented in exponential time.

51: 			A crucial new feature of version~2.0,

52: 			however, which allows node sets over temporary trees, yields completeness.  We

53: 			provide a formal operational semantics for XSLT programs, and establish

54: 			confluence for this semantics.

55: 		\end{abstract}

56:

57:

58:

59:

60:

61: 		\section{Introduction}

62:

63: 			XSLT is a powerful rule-based programming language, relatively widely used,

64: 			for expressing transformations of XML data, and is developed by the W3C (World

65: 			Wide Web Consortium) \cite{xslt,xslt2,xslt_kay}.  An XSLT program is run on

66: 			an XML document as input, and produces another XML document as output.  (XSLT

67: 			programs are actually called ``stylesheets'', as one of their main uses is to

68: 			produce stylised renderings of the input data, but we will continue to call

69: 			them programs here.)

70:

71: 			The language is actually in a transition period: the current standard,

72: 			version~1.0, is being replaced by version~2.0.  It is important to understand

73: 			what the new features of 2.0 really add.  In the present paper, we focus on

74: 			the tree-transformation capabilities of XSLT\@.  Indeed, XML documents are

75: 			essentially ordered, node-labeled trees.

76:

77: 			From the perspective of tree-transformation capabilities, the most important

78: 			new feature is that of ``node sets over temporary trees''.  We will show that

79: 			this feature turns XSLT into a computationally complete tree-transformation

80: 			language.  Indeed, as we will also show, XSLT~1.0 was \emph{not} yet complete

81: 			in this sense.  Specifically, any 1.0 program can be implemented within

82: 			exponential time in the worst case.  Some programs actually express

83: 			PSPACE-complete problems, because we will show that any linear-space turing

84: 			machine can be simulated by an XSLT~1.0 program.

85:

86: 			To put our results in context, we note that the designers of XSLT will most

87: 			probably regard the incompleteness of their language as a feature, rather than

88: 			a defect.  Indeed, in the requirements document for 2.0, turning XSLT into a

89: 			general-purpose programming language is explicitly stated as a ``non-goal''

90: 			\cite{xsltreq}.  In that respect, our result on the completeness of 2.0

91: 			exposes (albeit in a narrow sense) a failure to meet the requirements!

92:

93: 			At this point we should be a little clearer on what we mean by ``focusing on

94: 			the tree-transformation capabilities of XSLT''\@.  As already mentioned, XML

95: 			documents are essentially trees where the nodes are labeled by arbitrary

96: 			strings.  We make abstraction of this string content by regarding the node

97: 			labels as coming from some finite alphabet.  Accordingly, we strip XSLT of its

98: 			string-manipulation functions, and restrict its arithmetic to arbitrary

99: 			polynomial-time functions on counters, i.e., integers in the range

100: 			$\{1,2,\dots,n\}$ with $n$ the number of nodes in the input tree.  It is,

101: 			incidentally, quite easy to see that XSLT~1.0

102: 			\emph{without} these restrictions

103: 			can express all computable functions on strings (or integers).  Indeed, rules

104: 			in XSLT can be called recursively, and we all know that arbitrary recursion

105: 			over the strings or the integers gives us completeness.

106:

107: 			We will provide a formal operational semantics for the substantial fragment of

108: 			XSLT discussed in this paper.  A formal semantics has not been available,

109: 			although the W3C specifications represent a fine effort in defining it

110: 			informally.  Of course we have tried to make our formalisation faithful to

111: 			those specifications.  Our semantics does not impose an order on operations

112: 			when there is no need to, and as a result the resulting transition relation is

113: 			non-deterministic.  We establish, however, a confluence property, so that any

114: 			two terminating runs on the same input yield the same final result.

115: 			Confluence was not yet proven rigorously for XSLT, and can help in providing a

116: 			formal justification for alternative processing strategies that XSLT

117: 			implementations may follow for the sake of optimisation.

118:

119:

120: 			\section{Data model}

121:

122: 			\newcommand{\X}{\mathcal{X}}

123: 			\newcommand{\T}{\mathcal{T}}

124: 			\newcommand{\V}{\mathcal{V}}

125: 			\renewcommand{\t}{\mathbf{t}}

126: 			\renewcommand{\S}{\mathbf{S}}

127: 			\newcommand{\E}{\mathbf{E}}

128: 			\newcommand{\n}{\mathbf{n}}

129: 			\newcommand{\Input}{\textsf{Input}}

130: 			\newcommand{\nodes}{\textsf{nodes}}

131: 			\newcommand{\mixed}{\textsf{mixed}}

132: 			\newcommand{\eval}{\mathit{eval}}

133: 			\newcommand{\tstring}{\mathit{string}}

134: 			\newcommand{\maketree}{\mathit{maketree}}

135: 			\newcommand{\tta}{\texttt{a}}

136: 			\newcommand{\ttb}{\texttt{b}}

137: 			\newcommand{\ttc}{\texttt{c}}

138: 			\newcommand{\ttdoc}{\texttt{doc}}

139:

140: 			\subsection{Data trees} \label{secdatatrees}

141:

142: 				Let $\Sigma$ be a finite alphabet, including the special label \ttdoc.

143: 				By a \emph{data tree} we simply mean a

144: 				finite ordered tree, in which the nodes are labeled by elements of $\Sigma$.

145: 				Up to isomorphism, we can describe a data tree

146: 				$\t$ by a string $\tstring(\t)$

147: 				over the alphabet $\Sigma$ extended with the two symbols \aco\ and \act: if

148: 				the root of $\t$ is labeled \tta\

149: 				and its sequence of top-level subtrees is

150: 				$\t_1,\dots,\t_k$, then $$ \tstring(\t) = \tta \aco\tstring(\t_1)\dots

151: 				\tstring(\t_k) \act $$

152: 				Thus, for the data tree shown in Figure~\ref{figdatatree}, the string

153: 				representation equals

154: 				$$ \tta \aco \ttb \aco \act \ttc \aco \tta \aco \act \ttb \aco \act \act \ttc

155: 				\aco \act \act. $$

156:

157: 				\begin{figure}

158: 					\begin{center}

159: 						\pstree{\TR{$\tta$}}

160: 						{

161: 						  \TR{$\ttb$}

162: 						  \pstree{\TR{$\ttc$}}

163: 						  {

164: 						    \TR{$\tta$}

165: 						    \TR{$\ttb$}

166: 						  }

167: 						  \TR{$\ttc$}

168: 						}

169: 					\end{center}

170: 			%		\nocaptionrule

171: 					\caption{A data tree.}

172: 					\label{figdatatree}

173: 				\end{figure}

174:

175: 				A \emph{data forest} is a finite sequence of data trees.  Forests arise

176: 				naturally in XSLT, and for uniformity reasons we need to be able to

177: 				present them as data trees.  This can easily be done as follows:

178:

179: 				\begin{definition}[maketree] \label{defmaketree}

180: 					Let $F$ be a data forest.  Then $\maketree(F)$ is

181: 					the data tree obtained by affixing a root node on top of $F$,

182: 					and labeling this root node with \ttdoc.\footnote{The root node added by

183: 					$\maketree$ models what is called the ``document root'' in the XPath data

184: 					model \cite{xquerypath_datamodel}, although we do not model it entirely

185: 					faithfully, as we do not formally distinguish ``document nodes'' from

186: 					``element nodes''.  This is only for simplicity; it is no problem to

187: 					incorporate this distinction in our formalism, and our technical results do

188: 					not depend on our simplification.}

189: 				\end{definition}

190:

191: 			\subsection{Stores and values}

192:

193: 				Let $\T$ be a supply of \emph{tree variables},

194: 				including the special tree variable \Input.  We define:

195:

196: 				\begin{definition}

197: 					A \emph{store} is a finite set $\S$ of pairs of the form

198: 					$(x,\t)$, where $x \in \T$ and $\t$ is a data tree, such that

199: 					(1)~\Input\ occurs in $\S$;

200: 					(2)~no tree variable occurs twice in $\S$; and

201: 					(3)~all data trees occurring in $\S$ have disjoint sets of nodes.

202:

203: 					The tree assigned to \Input\ is called the \emph{input tree}; the

204: 					other trees are called the \emph{temporary trees}.

205: 				\end{definition}

206:

207: 				\begin{definition}

208: 					A \emph{value over $\S$} is a finite sequence consisting

209: 					of nodes from trees in $\S$, and counters

210: 					over $\S$.  Here, a \emph{counter over $\S$} is an integer in the range

211: 					$\{1,2,\dots,n\}$, where $n$ is the total number of nodes in $\S$.

212: 				\end{definition}

213:

214: 				Values as defined above formalise the kind of values that can be returned by

215: 				XPath expressions.  XPath \cite{xpath,xpath2} is a language that is used as a

216: 				sublanguage in XSLT for the purpose of selecting nodes from trees.  But XPath

217: 				expressions can also return numbers, which is useful as an aid in making

218: 				node selections (e.g., the $i$-th child of a node, or the $i$-th

219: 				node of the tree in preorder).  We limit these numbers to counters, in order

220: 				to concentrate on pure tree transformations.

221:

222:

223: 		\section{XPath abstraction}

224:

225: 			\newcommand{\rootexp}{\texttt{/*}}

226: 			\newcommand{\childexp}{\texttt{child::*}}

227:

228: 			Since the language XPath is already well understood

229: 			\cite{wadler_xpath,xpath_nutshell,xquery_formal,hidders_light}, and its study

230: 			in itself is not our focus, we will work with an abstraction of XPath, which

231: 			we denote by $\X$.  For our purposes it will suffice to divide the

232: 			$\X$-expressions in only two different types, which we denote by \nodes\ and

233: 			\mixed.  A value is of type \nodes\ if it consists exclusively of

234: 			nodes; otherwise it is of type \mixed.

235:

236: 			In order to define the semantics of $\X$, we need some definitions, which

237: 			reflect those from the XPath specification.

238: 			Let $\V$ be a supply of \emph{value variables}, disjoint from $\T$.

239:

240: 			\begin{definition}

241: 				An \emph{environment over $\S$}

242: 				is a finite set $\E$ of pairs of the form

243: 				$(x,v)$, where $x \in \V$ and $v$ is a value over $\S$, such that no value

244: 				variable occurs twice in $\E$.

245: 			\end{definition}

246:

247: 			\begin{definition}

248: 				A \emph{context triple over $\S$} is a triple $(z,i,k)$ where $z$ is a node

249: 				from $\S$ or a counter over $\S$,

250: 				and $i$ and $k$ are counters over $\S$ such that $i \leq k$.

251: 				We call $z$ the \emph{context item}, $i$ the \emph{context position}, and $k$

252: 				the \emph{context size}.

253: 			\end{definition}

254:

255: 			\begin{definition}

256: 				A \emph{context} is a triple $(\S,\E,c)$ where $\S$ is a store, $\E$ is an

257: 				environment over $\S$, and $c$ is a context triple over $\S$.

258: 			\end{definition}

259:

260: 			If we denote the universe of all possible contexts by \textit{Contexts}, the

261: 			semantics of $\X$ is now given by a partial function $\eval$ on $\X \times

262: 			\mathit{Contexts}$, such that whenever defined, $\eval(e,C)$ is a value over

263: 			$C$'s store, and this value has the same type as $e$.

264:

265: 			\begin{remark} \label{xmlschema}

266: 				A static type

267: 				system, based on XML Schema \cite{xmlschema,essencexml}, can be put on

268: 				contexts to ensure definedness of expressions \cite{xquery_formal}, but we

269: 				omit that as safety is not the focus of the present paper.

270: 				\qed

271: 			\end{remark}

272:

273: 			In general we do not assume much from $\X$, except for the

274: 			availability of the

275: 			following basic expressions, also present in real XPath:

276: 			\begin{itemize}

277: 				\item

278: 					An expression `\rootexp', such that $\eval(\rootexp,C)$ equals

279: 					the root node of the input tree in $C$'s store.

280: 				\item

281: 					An expression `\childexp', such that $\eval(\childexp,\allowbreak C)$

282: 					is defined whenever $C$'s context item is a node $\n$, and then

283: 					equals the list of children of $\n$.

284: 			\end{itemize}

285:

286:

287:

288: 		\section{Syntax} \label{secsyntax}

289:

290: 			In this section, we define the syntax of a sizeable fragment of XSLT~2.0.  The

291: 			reader familiar with XSLT will notice that we have simplified and cleaned up

292: 			the language in a few places.  These modifications are only for the sake of

293: 			simplicity of exposition, and our technical results do not depend on them.  We

294: 			discuss our deviations from the real language further in

295: 			Section~\ref{deviations}.

296:

297: 			Also, the concrete syntax of real

298: 			XSLT is XML-based and rather unwieldy.  For the sake of presentation, we

299: 			therefore give a syntax of our own, which is non-XML, but otherwise follows the

300: 			same lines as the real syntax.

301:

302: 			The grammar is shown in Figure~\ref{figgrammar}.  The only typing condition we

303: 			need is that in an apply-statement or in a vcopy-statement, \textit{expr} must

304: 			be of type \nodes.  Also, no two different rules can have the same

305: 			\textit{name}, and the \textit{name} in a call-statement must be the

306: 			\textit{name} of some rule.

307:

308: 			\begin{figure}

309: 				\begin{tabbing}

310: 					Program $\to$ Rule* \\

311: 					Rule $\to$ \=\textsf{template} \textit{name} \textsf{match} \textit{expr} (\textsf{mode} \textit{name})?

312: 																\aco\ Template \act \\

313: 					Template $\to$ Statement* \\

314: 					Statement $\to$ \=$|$ \=\textsf{cons}

315:                                         \textit{label} \aco\ Template

316:                                         \act\kill

317: 					Statement $\to$ \>\> \textsf{cons} \textit{label} \aco\ Template \act\\

318: 						\> $|$ \textsf{apply} \textit{expr} (\textsf{mode} \textit{name})?\\

319: 						\> $|$ \textsf{call} \textit{name}\\

320: 						\> $|$ \textsf{foreach} \textit{expr} \aco\ Template \act\\

321: 						\> $|$ \textsf{val} \textit{value\_variable} \textit{expr}\\

322: 						\> $|$ \textsf{tree} \textit{tree\_variable} \aco\ Template \act\\

323: 						\> $|$ \textsf{vcopy} \textit{expr}\\

324: 						\> $|$ \textsf{tcopy} \textit{tree\_variable}\\

325: 						\> $|$ \textsf{if} \textit{expr} \aco\ Template \act\ \textsf{else} \aco\ Template \act

326:

327: 				\end{tabbing}

328: %				\nocaptionrule

329: 				\caption{Our syntax.

330: 				The terminal symbol \textit{expr} stands for an

331: 				$\X$-expression; \textit{label} stands for an element of our alphabet

332: 				$\Sigma$; \textit{value\_variable} and \textit{tree\_variable} stand for

333: 				elements of $\V$ and $\T$, respectively; and \textit{name} is

334: 				self-explanatory.  As usual we use * to denote repetition, ? to denote

335: 				optionality, and use (~and ) for lexical grouping.}

336: 				\label{figgrammar}

337: 			\end{figure}

338:

339: 			We will often identify a template $M$ with its \emph{syntax tree}.  This tree

340: 			consists of all occurrences of statements in $M$ and represents how they

341: 			follow each other and how they are nested in each other; we omit the formal

342: 			definition.  Observe that only cons-, foreach-, tree-, and if-statements can

343: 			have children.

344: 			Note also that, since a

345: 			template is a sequence of statements, the syntax ``tree'' is actually a

346: 			forest, i.e., a sequence of trees, but we will still call it a tree.

347:

348: 			Variable definitions happen through val- and tree-statements.

349: 			We will need the notion of a statement being in the scope of some variable

350: 			definition; this is defined in the standard way as follows.

351:

352: 			\begin{definition}

353: 				Let $M$ be a template, and let $S_1$ and $S_2$ be two

354: 				statements occurring in $M$.  We say that $S_2$ is \emph{in the scope of}

355: 				$S_1$ if $S_2$ is a right sibling of $S_1$

356: 				in the syntax tree of $M$, or a descendant of such a right sibling.

357: 				An illustration is in Figure~\ref{figscope}.

358: 			\end{definition}

359:

360: 			\begin{figure}

361: 				\begin{center}

362: 					\pstree{\TC}

363: 					{

364: 					  \pstree{\TC}

365: 					  {

366: 					    \TC

367: 					    \pstree{\TC[fillstyle=solid]}

368: 					    {

369: 					      \TC

370: 					      \TC

371: 					    }

372: 					    \TC[fillstyle=vlines]

373: 					    \pstree{\TC[fillstyle=vlines]}

374: 					    {

375: 					      \TC[fillstyle=vlines]

376: 					      \TC[fillstyle=vlines]

377: 					    }

378: 					  }

379: 					  \TC

380: 					  \pstree{\TC}

381: 					  {

382: 					    \TC

383: 					    \TC

384: 					  }

385: 					}

386: 				\end{center}

387: 	%			\nocaptionrule

388: 				\caption{Depiction of a syntax tree.  The nodes in the scope of the black node

389: 				are those that are striped.}

390: 				\label{figscope}

391: 			\end{figure}

392:

393: 			One final definition:

394:

395: 			\begin{definition}

396: 				Template $M'$ is called a \emph{subtemplate} of template $M$ if $M'$ consists

397: 				of a sequence of consecutive sibling statements occurring in  $M$.

398: 			\end{definition}

399:

400:

401: \section{Operational semantics} \label{secsemantics}

402:

403: \newcommand{\context}{\textit{context}}

404: \newcommand{\confrepl}[2]{(#1\gets #2)}

405:

406: Fix a program $P$ and a data tree $\t$.  We will describe the semantics of $P$

407: on input $\t$ as a rewrite relation $\Rightarrow$ among configurations.

408:

409: \begin{definition} \label{defconfig}

410: A \emph{configuration} consists of a template $M$

411: together with a partial function that assigns a context to some of

412: the statements of $M$ (more precisely, the nodes of its syntax tree).

413: The statements that have a context are called \emph{active}; we require that

414: the descendants of an inactive node are inactive too.

415: Cons-statements are never active.

416:

417: We use the following notation concerning configurations:

418: \begin{itemize}

419: \item

420: $S \lhd \gamma$ denotes that $S$ is a statement occurring in the template of

421: configuration $\gamma$.

422: \item

423: If $S \lhd \gamma$, then

424: $\gamma(S) = C$ denotes that $S$ is active in $\gamma$, having context $C$.

425: \item

426: If $M$ is a subtemplate of a configuration $\gamma$, then $M$

427: itself can be taken as a configuration by inheriting all the context

428: assignments done by $\gamma$.  We call such a configuration a

429: \emph{subconfiguration}.

430: \item

431: If $M$ is a subconfiguration of $\gamma$, and $\gamma'$ is another

432: configuration, then

433: $\gamma\confrepl{M}{\gamma'}$

434: denotes the configuration obtained from $\gamma$ by

435: replacing $M$ by $\gamma'$.

436: \end{itemize}

437: \end{definition}

438:

439: The initial configuration is defined as follows.

440:

441: \begin{definition}

442: \begin{enumerate}

443: \item

444: The \emph{initial context} equals

445: $$ \bigl (

446: \{(\Input,\t)\},\,\emptyset,\,(\mathbf{r},1,1)

447: \bigr ) $$

448: where $\mathbf{r}$ is the root of $\t$.

449: \item

450: The \emph{initial template} equals the single statement `\textsf{apply}

451: \rootexp'.

452: \item

453: The \emph{initial configuration} consists of the initial template, whose

454: single statement is assigned the initial context.

455: \end{enumerate}

456: \end{definition}

457:

458: The goal will be to rewrite the initial configuration into a \emph{terminal

459: template}; this is a configuration consisting exclusively of cons-statements.

460: Observe that terminal templates can be viewed as data forests; indeed, simply

461: by removing the \textsf{cons}'s from a terminal template, we obtain the string

462: representation of a data forest.

463:

464: For the rewrite relation $\Rightarrow$ we are going to define, terminal

465: configurations will be normal forms, i.e., cannot be rewritten further.  If,

466: for two configurations $\gamma_0$ and $\gamma_1$, we have $\gamma_0

467: \Rightarrow \cdots \Rightarrow \gamma_1$ and $\gamma_1$ is a normal form, we

468: denote that by $\gamma_0 \Rightarrow^! \gamma_1$.  The relation $\Rightarrow$

469: will be defined in such a way that if $\gamma_0$ is the initial configuration

470: and $\gamma_0 \Rightarrow^! \gamma_1$, then $\gamma_1$ will be terminal.

471: Moreover, we will prove in Theorem~\ref{confluence} that each configuration

472: $\gamma_0$ has at most one such normal form $\gamma_1$.  We thus define:

473:

474: \begin{definition} \label{deffinal}

475: Given $P$ and $\t$, let $\gamma_0$ be the initial configuration and let

476: $\gamma_0 \Rightarrow^! \gamma_1$.  Then the \emph{final result

477: tree of applying $P$ to $\t$} is defined to be $\maketree(\gamma_1)$.

478: \end{definition}

479:

480: In the above definition, we can indeed apply $\maketree$, defined on data

481: forests (Definition~\ref{defmaketree}),

482: to $\gamma_1$, since $\gamma_1$ is terminal and we just observed that

483: terminal templates describe forests.

484: Note that the final result tree is only determined up to isomorphism.

485:

486: \subsection{If-statements}

487:

488: \newcommand{\transif}{\stackrel{\mathsf{if}}{\Rightarrow}}

489: \newcommand{\normif}{\mathrel{\transif{}^!}}

490:

491: If-statements are the only ones that generate control flow, so we treat

492: them by a separate rewrite relation $\transif$, defined

493: by the semantic rules shown in Figure~\ref{figif}.

494:

495: \begin{figure}

496: \begin{itemize}

497: \item

498: $

499: \begin{array}[t]{l}

500: S = \text{\textsf{if} $e$ \aco\ $M_{\rm true}$ \act\

501: \textsf{else} \aco\ $M_{\rm false}$ \act} \lhd \gamma \\

502: \gamma(S)=C \\

503: \eval(e,C) \neq \emptyset \\

504: \hline

505: \gamma \transif \gamma\confrepl{S}{M_{\rm true}}

506: \end{array}

507: $

508: \medskip

509: \item

510: $

511: \begin{array}[t]{l}

512: S = \text{\textsf{if} $e$ \aco\ $M_{\rm true}$ \act\

513: \textsf{else} \aco\ $M_{\rm false}$ \act} \lhd \gamma \\

514: \gamma(S)=C \\

515: \eval(e,C) = \emptyset \\

516: \hline

517: \gamma \transif \gamma\confrepl{S}{M_{\rm false}}

518: \end{array}

519: $

520: \end{itemize}

521: %\nocaptionrule

522: \caption{Semantics of if-statements; $\emptyset$ denotes the empty sequence.}

523: \label{figif}

524: \end{figure}

525:

526: It is not difficult to show that $\transif$ is terminating and

527: locally confluent, whence confluent,

528: so that every configuration has a unique normal form w.r.t.\

529: $\transif$ \cite{terese}.  This normal form no longer contains any active

530: if-statements.  (Quite obviously, the most efficient way to get to this normal

531: form is to work out the if-statements top-down.)

532: We write $\gamma \normif \gamma'$ to denote that $\gamma'$ is the

533: normal form of $\gamma$ w.r.t.\ $\transif$.

534:

535: \begin{remark}

536: Our main rewrite relation $\Rightarrow$ is not terminating in general.

537: The reason why we treat if-statements separately is to avoid

538: nonsensical rewritings such as

539: where we execute a non-terminating statement

540: in the else-branch of an if-statement whose test

541: evaluates to true.

542: \end{remark}

543:

544: \subsection{Apply-, call-, and foreach-statements}

545:

546: \newcommand{\init}{\mathit{init}}

547:

548: For the semantics of apply-statements, we need the following definitions.

549:

550: \begin{definition}[ruletoapply] \label{defruletoapply}

551: Let $C$ be a context, let

552: $\n$ be a node, and let $m$ be a name.

553: Then $\mathit{ruletoapply}(C,\n)$ (respectively,

554: $\mathit{ruletoapply}(C,\n,m)$)

555: equals the template belonging to the

556: first rule in $P$ (respectively, with \textsf{mode} name equal to $m$)

557: whose \textit{expr} satisfies

558: $\n \in \eval(\textit{expr},C)$.

559:

560: If no such rule exists, both $\mathit{ruletoapply}(C,\n)$

561: and $\mathit{ruletoapply}(C,\n,m)$ default to the

562: single-statement template `\textsf{apply} \childexp'.

563: \end{definition}

564:

565: \begin{definition}[init] \label{definit}

566: Let $M$ be a template, and let $C$ be a context.  Then $\init(M,C)$

567: equals the configuration obtained from $M$ by assigning context $C$ to every

568: statement in $M$, except for all statements in the scope of any variable

569: definition, and all statements that are below a foreach-statement;

570: all those statements remain inactive.

571: \end{definition}

572:

573: We are now ready for the semantic rule for apply-statements, shown in

574: Figure~\ref{figapply}.  We omit the rule for an apply-statement with a

575: \textsf{mode $m$}: the only difference with the rule shown

576: is that we use $\mathit{ruletoapply}(\n_i,C,m)$.

577:

578: \begin{figure}

579: \begin{itemize}

580: \item

581: $

582: \begin{array}[t]{l}

583: S = \text{\textsf{apply $e$}} \lhd \gamma \\

584: \gamma(S) = C = (\S,\E,c) \\

585: \eval(e,C) = (\n_1,\dots,\n_k) \\

586: \mathit{ruletoapply}(\n_i,C) = M_i \quad \text{for $i=1,\dots,k$} \\

587: \init(M_i,(\S,\E,(\n_i,i,k))) = \gamma_i \quad \text{for $i=1,\dots,k$} \\

588: \gamma\confrepl{S}{\gamma_1\dots\gamma_k} \normif \gamma' \\

589: \hline

590: \gamma \Rightarrow \gamma'

591: \end{array}

592: $

593: \medskip

594: \item

595: $

596: \begin{array}[t]{l}

597: S = \text{\textsf{foreach} $e$ \aco\ $M$ \act} \lhd \gamma \\

598: \gamma(S) = C = (\S,\E,c) \\

599: \eval(e,C) = (z_1,\dots,z_k) \\

600: \init(M,(\S,\E,(z_i,i,k))) = \gamma_i \quad \text{for $i=1,\dots,k$} \\

601: \gamma\confrepl{S}{\gamma_1\dots\gamma_k} \normif \gamma' \\

602: \hline

603: \gamma \Rightarrow \gamma'

604: \end{array}

605: $

606: \medskip

607: \item

608: $

609: \begin{array}[t]{l}

610: S = \text{\textsf{call} \textit{name}} \lhd \gamma \\

611: \gamma(S) = C \\

612: \mathit{rulewithname}(\mathit{name}) = M \\

613: \init(M,C) = \gamma_1 \\

614: \gamma\confrepl{S}{\gamma_1} \normif \gamma' \\

615: \hline

616: \gamma \Rightarrow \gamma'

617: \end{array}

618: $

619: \end{itemize}

620: %\nocaptionrule

621: \caption{Semantics of apply-, call-, and foreach-statements.}

622: \label{figapply}

623: \end{figure}

624:

625: The semantic rule for foreach-statements is very similar to that for

626: apply-statements, and is also shown in Figure~\ref{figapply}.

627:

628: For call-statements, we need the following definition.

629:

630: \begin{definition}[rulewithname] \label{defrulewithname}

631: 	For any \textit{name}, let $\mathit{rulewithname}(\mathit{name})$ denote the

632: template of the rule in $P$ with that \textit{name}.

633: \end{definition}

634:

635: The semantic rule for a call-statement is then again shown in Figure~\ref{figapply}.

636:

637: \subsection{Variable definitions}

638:

639: \newcommand{\updateset}{\mathit{updateset}}

640: \newcommand{\contupd}[2]{(#1\colon #2)}

641:

642: For a context $C=(\S,\E,c)$, a value variable $x$, a value $v$, a tree

643: variable $y$, and a data tree $\t$, we denote by

644: \begin{itemize}

645: \item

646: $C\contupd{x}{v}$

647: the context obtained from $C$ by updating $\E$ with the pair $(x,v)$;

648: and by

649: \item

650: $C\contupd{y}{\t}$

651: the context obtained from $C$ by updating $\S$ with the pair

652: $(y,\t)$.

653: \end{itemize}

654:

655: We also define:

656:

657: \begin{definition}[updateset] \label{defupdateset}

658: Let $\gamma$ be a configuration and let $S \lhd \gamma$.

659: Let $M$ be the template underlying $\gamma$.

660: Let $S_1,\dots,S_k$ be the right siblings of $S$ in $M$, in that

661: order.  Let $j$ be the smallest index for which $S_j$ is active in $\gamma$;

662: if all the $S_i$ are inactive, put $j=k+1$.  Then the template $S_1 \dots

663: S_{j-1}$ is denoted by $\updateset(\gamma,S)$.  If $j=1$ then this is the

664: empty template.

665: \end{definition}

666:

667: We are now ready for the semantic rules for variable definitions,

668: shown in Figure~\ref{figvar}.

669:

670: \begin{figure}

671: \begin{itemize}

672: \item

673: $\begin{array}[t]{l}

674: S = \text{\textsf{val} $x$ $e$} \lhd \gamma \\

675: \gamma(S) =  C \\

676: C\contupd{x}{\eval(e,C)} = C' \\

677: \updateset(\gamma,S) = M \\

678: \init(M,C') = \gamma_1 \\

679: \gamma\confrepl{SM}{\gamma_1} \normif \gamma' \\

680: \hline

681: \gamma \Rightarrow \gamma'

682: \end{array}$

683: \medskip

684: \item

685: $\begin{array}[t]{l}

686: S = \text{\textsf{tree} $y$ \aco\ $M$ \act} \lhd \gamma \\

687: \text{$M$ is terminal} \\

688: \gamma(S) =  C \\

689: C\contupd{y}{\maketree(M)} = C' \\

690: \updateset(\gamma,S)=M' \\

691: \init(M',C') = \gamma_3 \\

692: \gamma\confrepl{SM'}{\gamma_3} \normif \gamma' \\

693: \hline

694: \gamma \Rightarrow \gamma'

695: \end{array}$

696: \end{itemize}

697: %\nocaptionrule

698: \caption{Semantics of variable definitions.}

699: \label{figvar}

700: \end{figure}

701:

702: \subsection{Copy-statements}

703:

704: \newcommand{\forest}{\mathit{forest}}

705: \newcommand{\ttemp}{\mathit{ttemp}}

706: \newcommand{\choproot}{\mathit{choproot}}

707:

708: The following definitions are illustrated in Figure~\ref{figforest}.

709:

710: \begin{definition}[forest] \label{defforest}

711: Let $\S$ be a store, and let $(\n_1,\dots,\n_k)$ be a sequence of nodes from

712: $\S$.  For $i=1,\dots,k$, let $\t_i$ be the data subtree rooted at $\n_i$.

713: Then $\forest((\n_1,\dots,\n_k),\S)$ equals the data forest

714: $(\t_1,\dots,\t_n)$.

715: \end{definition}

716:

717: \begin{definition}[ttemp] \label{defttemp}

718: Let $F$ be a data forest.  Then $\ttemp(F)$ equals the

719: terminal template describing $F$.

720: \end{definition}

721:

722: \begin{figure}

723: \begin{center}

724: \pstree{\TR{$\tta$}}

725: {

726:   \TR{$\ttb$}~{$\n_1$}

727:   \pstree{\TR{$\ttc$}~{$\n_2$}}

728:   {

729:     \TR{$\tta$}~{$\n_3$}

730:     \TR{$\ttb$}

731:   }

732:   \TR{$\ttc$}

733: }

734: \qquad

735: \pstree{\TR{$\ttc$}~{$\n_4$}}

736: {

737:   \TR{$\tta$}

738:   \TR{$\ttb$}

739: }

740: \end{center}

741:

742: \begin{tabbing}

743: 	$ \ttemp \bigl ( \forest((\n_4,\n_1,\n_2,\n_3,\n_1),\S) \bigr ) = {}$\=

744: 	\textsf{cons $\ttc$ \aco\ cons $\tta$ \aco \act\ cons $\ttb$ \aco \act\ \act}\\

745: 	\>\textsf{cons $\ttb$ \aco \act}\\

746: 	\>\textsf{cons $\ttc$ \aco\ cons $\tta$ \aco \act\ cons $\ttb$ \aco \act\ \act}\\

747: 	\>\textsf{cons $\tta$ \aco \act}\\

748: 	\>\textsf{cons $\ttb$ \aco \act}

749: \end{tabbing}

750: %\nocaptionrule

751: \caption{Illustration of Definitions \ref{defforest} and \ref{defttemp}.}

752: \label{figforest}

753: \end{figure}

754:

755: We also need:

756:

757: \begin{definition}[choproot]

758: Let $\t$ be a data tree with top-level subtrees $\t_1,\dots,\t_k$, in that

759: order.  Then $\choproot(\t)$ equals the data forest $(\t_1,\dots,\t_k)$.

760: \end{definition}

761:

762: The semantic rules for copy-statements are now shown in Figure~\ref{figcopy}.

763:

764: \begin{figure}

765: \begin{itemize}

766: \item

767: $

768: \begin{array}[t]{l}

769: S = \text{\textsf{vcopy} $e$} \lhd \gamma \\

770: \gamma(S) = C = (\S,\E,c) \\

771: \eval(e,C) = (\n_1,\dots,\n_k) \\

772: \ttemp \bigl ( \forest((\n_1,\dots,\n_k),\S) \bigr ) = M \\

773: \hline

774: \gamma \Rightarrow \gamma\confrepl{S}{M}

775: \end{array}

776: $

777: \medskip

778: \item

779: $

780: \begin{array}[t]{l}

781: S = \text{\textsf{tcopy} $y$} \lhd \gamma \\

782: \gamma(S) = (\S,\E,c) \\

783: (y,\t) \in \S \\

784: \ttemp(\choproot(\t)) = M \\

785: \hline

786: \gamma \Rightarrow \gamma\confrepl{S}{M}

787: \end{array}

788: $

789: \end{itemize}

790: %\nocaptionrule

791: \caption{Semantics of copy-statements.}

792: \label{figcopy}

793: \end{figure}

794:

795: \subsection{Discussion} \label{deviations}

796:

797: The final result of applying $P$ to $\t$ (Definition~\ref{deffinal}) may be

798: undefined for two very different reasons.  The first, fundamental, reason is

799: that the rewriting may be nonterminating.  The second reason is that the

800: rewriting may abort because the evaluation of an $\X$-expression is undefined,

801: or the tree variable in a tcopy-statement is not defined in the store.  This

802: second reason can easily be avoided by a type system on $\X$, as already

803: mentioned in Remark~\ref{xmlschema}, together with scoping rules to keep track

804: of which variables are visible in the XSLT program and which variables are

805: used in the $\X$-expressions.  Such scoping rules are entirely standard, and

806: indeed are implemented in the XSLT processor SAXON \cite{saxon}.

807:

808: In the same vein, we have simplified the parameter passing mechanism of XSLT,

809: and have omitted the feature of global variables.  On the other hand, our

810: mechanism for choosing the rule to apply (Definition~\ref{defruletoapply}) is

811: more powerful than the one provided by XSLT, as ours is context-dependent.

812: It is actually easier to define that way.  As already

813: mentioned at the beginning of Section~\ref{secsyntax}, none of our technical

814: results depend on the modifications we have made.

815:

816: Finally, we note that the XSLT processor SAXON evaluates variable definitions

817: lazily, whereas we simply evaluate them eagerly.  Again, lazy evaluation could

818: have been easily incorporated in our formalism.  Some programs may terminate

819: on some inputs lazily, while they do not terminate eagerly, but for programs

820: that use all the variables they define there is no difference.

821:

822: \subsection{Confluence}

823:

824: \newcommand{\R}{\mathcal{R}}

825:

826: Recall that we call a rewrite relation \emph{confluent} if, whenever we can

827: rewrite a configuration $\gamma_1$ to $\gamma_2$ as well as to $\gamma_3$,

828: then there exists $\gamma_4$ such that we can further rewrite both $\gamma_2$

829: and $\gamma_3$ into $\gamma_4$.  Confluence guarantees that all terminating

830: runs from a common configuration also end in a common configuration

831: \cite{terese}.  Since, for our rewrite relation $\Rightarrow$,

832: either all runs on some input are nonterminating, or none is, the

833: following theorem implies that the same final result of a program $P$ on an

834: input $\t$, if defined at all, will be obtained regardless of the order in

835: which we process active statements.

836:

837: \begin{theorem} \label{confluence}

838: Our rewrite relation $\Rightarrow$ is confluent.

839: \end{theorem}

840:

841: \begin{proof}

842:

843: The proof is a very easy application of

844: a basic theorem of Rosen about subtree replacement systems

845: \cite{rosen_cr}.  A subtree replacement system $\R$ is a (typically infinite)

846: set of pairs of the form $\phi \to \psi$, where $\phi$ and $\psi$ are

847: descriptions up to isomorphism of ordered, node-labeled trees, where the node

848: labels come from some (again typically infinite) set $V$.  Let us refer to

849: such trees as $V$-trees.  Such a system $\R$ naturally induces a rewrite

850: system $\Rightarrow_{\R}$ on $V$-trees: we have $\t \Rightarrow_{\R} \t'$ if

851: there exists a node $\n$ of $\t$ and a pair $\phi \to \psi$ in $\R$ such that

852: the subtree $\t/\n$ is isomorphic to $\phi$, and $\t' =

853: \t\confrepl{\n}{\psi}$.  Here, we use the notation $\t/\n$ for the subtree of

854: $\t$ rooted at $\n$, and the notation $\t\confrepl{\n}{\psi}$ for the tree

855: obtained from $\t$ by replacing $\t/\n$ by a fresh copy of $\psi$.  Rosen's

856: theorem states that if $\R$ is ``unequivocal'' and ``closed'', then

857: $\Rightarrow_{\R}$ is confluent.

858:

859: ``Unequivocal'' means that for each $\phi$ there is at most one $\psi$ such

860: that $\phi\to \psi$ is in $\R$.  The definition of $\R$ being ``closed'' is a

861: bit more complicated.  To state it, we need the notion of a \emph{residue map}

862: from $\phi$ to $\psi$.  This is a mapping $r$

863: from the nonroot nodes of $\phi$ to

864: sets of nonroot nodes of $\psi$, such that for $m \in r(n)$ the subtrees

865: $\phi/n$ and $\psi/m$ are isomorphic.  Moreover, if $n_1$ and $n_2$ are

866: independent (no descendants of each other), then all nodes in $r(n_1)$ must

867: also be independent of all nodes in $r(n_2)$.

868:

869: Now $\R$ being closed means that we can assign a residue map $r[\phi,\psi]$

870: to every $\phi\to\psi$ in $\R$ in such a way that for any $\phi_0 \to \psi_0$

871: in $\R$, and any node $n$ of $\phi_0$, if there exists a pair $\phi_0/n \to

872: \psi$ in $\R$, then the pair $\phi_0\confrepl{n}{\psi} \to

873: \psi_0\confrepl{r[\phi_0,\psi_0](n)}{\psi}$ is also in $\R$.  Denoting the

874: latter pair by $\phi_1 \to \psi_1$, we must moreover have for each node $p$

875: of $\phi_0$ that is independent of $n$, that $r[\phi_1,\psi_1](p) =

876: r[\phi_0,\psi_0](p)$.

877:

878: To apply Rosen's theorem, we view configurations (Definition~\ref{defconfig})

879: as $V$-trees, where $V = \mathit{Statements} \cup (\mathit{Statements} \times

880: \mathit{Contexts})$.  Here, $\mathit{Statements}$ is the set of all possible

881: syntactic forms of statements.  So, given a configuration, we take the syntax

882: tree of the underlying template, and label every inactive node by its

883: corresponding statement, and every active node by its corresponding statement

884: and its context in the configuration.  (Since templates are sequences, we

885: actually get $V$-forests rather than $V$-trees, but that is a minor fuss.)

886:

887: Now consider the subtree replacement system $\R$ consisting of all pairs

888: $\gamma \to \gamma'$ for which $\gamma \Rightarrow \gamma'$ as defined by our

889: semantics, where $\gamma$ consists of a single statement $S_0$, and the active

890: statement being processed to get $\gamma'$ is a direct child of $S_0$.  Since

891: our semantics always substitutes siblings for siblings, it is clear that

892: $\Rightarrow_{\R}$ then coincides with our rewrite relation $\Rightarrow$.

893: Since the processing of every individual statement is always deterministic (up

894: to isomorphism of trees), $\R$ as just defined is clearly unequivocal.

895:

896:

897: We want to show that $\R$ is closed.  Thereto, we define residue

898: maps $r[\gamma,\gamma']$ as follows.

899:

900: The case where $\gamma\to\gamma'$ is the processing of an apply- or

901: call-statement, is depicted in Figure~\ref{figrosen} (top).  The node being

902: processed is shown in black.  The subtemplates to the left and right are left

903: untouched.  Referring to the notation used in Figure~\ref{figapply}, the newly

904: substituted subtemplate $\gamma_{\mathrm{new}}$ is such that $\gamma_1 \dots

905: \gamma_k \normif \gamma_{\mathrm{new}}$ (for apply) or $\gamma_1 \normif

906: \gamma_{\mathrm{new}}$ (for call).  Indeed, since we apply $\normif$ at the

907: end of every processing step, $\gamma$ itself does not contain any active

908: if-statements.  We define $r = r[\gamma,\gamma']$ as follows:

909: \begin{itemize}

910: \item

911: For nodes $n$ in $\gamma_{\mathrm{left}}$ or $\gamma_{\mathrm{right}}$, we put

912: $r(n) := \{n'\}$, where $n'$ is the corresponding node in $\gamma'$.

913: \item

914: For the black node $b$, we put $r(b) := \emptyset$.

915: \end{itemize}

916: The main condition for closedness is clearly satisfied, because statements can

917: be processed independently.  Note that the black node has no children, let

918: alone active children, which allows us to put $r(b)=\emptyset$.  The condition

919: on $p$'s is also satisfied, because both $r[\phi_0,\psi_0]$ and

920: $r[\phi_1,\psi_1]$ will set $r(p)$ to $\{p'\}$.

921:

922: The case where $\gamma\to\gamma'$ is the processing of a foreach-statement is

923: depicted in Figure~\ref{figrosen} (middle).  This case is analogous to the

924: previous one.  The only difference is that the black node now has descendants

925: ($M$ in the figure).  Because the $\init$ function (Definition~\ref{definit})

926: always leaves descendants of a foreach node inactive, however, the nodes in

927: $M$ are inactive at this time, and we can

928: put $r(n) := \emptyset$ for all of them.

929:

930: The case where $\gamma\to\gamma'$ is the processing of a val-statement is

931: depicted in Figure~\ref{figrosen} (bottom).  Since all

932: nodes in the update set are inactive by definition

933: (Definition~\ref{defupdateset}), we can again put $r(n) :=

934: \emptyset$ for all nodes in the update set.

935: The case of a tree-statement is similar; now the black node again

936: has descendants, but again these are all inactive (they are all

937: cons-statements).

938: \begin{figure}

939: \begin{center}

940: \psset{tnpos=a}

941: \pstree{\TC~{$\gamma$}}

942: {

943:   \TR{$\gamma_{\mathrm{left}}$}

944:   \TC[fillstyle=solid]~[tnpos=b]{\textsf{apply/call}}

945:   \TR{$\gamma_{\mathrm{right}}$}

946: }

947: \quad

948: $\to$

949: \quad

950: \pstree{\TC~{$\gamma'$}}

951: {

952:   \TR{$\gamma_{\mathrm{left}}$}

953:   \TR{$\gamma_{\mathrm{new}}$}

954:   \TR{$\gamma_{\mathrm{right}}$}

955: }

956: \bigskip

957:

958: \pstree{\TC~{$\gamma$}}

959: {

960:   \TR{$\gamma_{\mathrm{left}}$}

961:   \pstree{\TC[fillstyle=solid]~[tnpos=r]{\textsf{foreach}}}

962:   {

963:     \TR{$M$}

964:   }

965:   \TR{$\gamma_{\mathrm{right}}$}

966: }

967: \quad

968: $\to$

969: \quad

970: \pstree{\TC~{$\gamma'$}}

971: {

972:   \TR{$\gamma_{\mathrm{left}}$}

973:   \TR{$\gamma_{\mathrm{new}}$}

974:   \TR{$\gamma_{\mathrm{right}}$}

975: }

976: \bigskip

977:

978: \pstree{\TC~{$\gamma$}}

979: {

980:   \TR{$\gamma_{\mathrm{left}}$}

981:   \TC[fillstyle=solid]~[tnpos=b]{\textsf{val}}

982:   \TR{\textit{updateset}}

983:   \TR{$\gamma_{\mathrm{right}}$}

984: }

985: \quad $\to$ \quad

986: \pstree{\TC~{$\gamma'$}}

987: {

988:   \TR{$\gamma_{\mathrm{left}}$}

989:   \TR{$\gamma_{\mathrm{new}}$}

990:   \TR{$\gamma_{\mathrm{right}}$}

991: }

992: \end{center}

993: %\nocaptionrule

994: \caption{Illustration to the proof of Theorem~\ref{confluence}.}

995: \label{figrosen}

996: \end{figure}

997: The case where $\gamma\to\gamma'$ is the processing of a

998: copy-statement, finally, is again analogous.

999: \end{proof}

1000:

1001: \section{Computational completeness}

1002:

1003: \newcommand{\allexp}{\texttt{//*}}

1004: \newcommand{\firstchild}{\texttt{child::*[1]}}

1005: \newcommand{\nextexp}{\texttt{following-sibling::*[1]}}

1006: \newcommand{\flatree}{\mathit{flattree}}

1007:

1008: As defined in Definition~\ref{deffinal},

1009: an XSLT program $P$ expresses a partial

1010: function from data trees to data forests, where the output forest is

1011: represented by a tree by affixing a root node labeled $\ttdoc$ on top

1012: (Definition~\ref{defmaketree}).  The output

1013: is defined up to isomorphism only, and $P$ does not distinguish between

1014: isomorphic inputs.  This leads us to the following definition:

1015:

1016: \begin{definition}

1017: A \emph{tree transformation}

1018: is a partial function from data trees to data

1019: trees with root labeled $\ttdoc$,

1020: mapping isomorphic trees to isomorphic trees.

1021: \end{definition}

1022:

1023: Using the string representation of data trees defined in

1024: Section~\ref{secdatatrees}, we further define:

1025:

1026: \begin{definition}

1027: A tree transformation $f$ is called \emph{computable} if the string function

1028: $\tilde f\colon \tstring(\t) \mapsto \tstring(f(\t))$ is computable in the

1029: classical sense.

1030: \end{definition}

1031:

1032: Up to now, we have assumed from our XPath abstraction $\X$ only the

1033: availability of the expressions `\rootexp' and `\childexp'. For

1034: our proof of the following

1035: theorem, we need to assume the availability of a few more very simple

1036: expressions, also present in real XPath:

1037:

1038: \begin{itemize}

1039:

1040: \item

1041:

1042: $y$\texttt{/*},

1043: for any tree variable $y$,

1044: evaluates to the

1045: root of the tree assigned to $y$.

1046:

1047: \item

1048:

1049: \allexp\ evaluates to the sequence of all nodes in the store (it does not

1050: matter in which order).

1051:

1052: \item

1053:

1054: \firstchild\ evaluates to the first child of the context item (which should

1055: be a node).

1056:

1057: \item

1058:

1059: \nextexp\ evaluates to the immediate right sibling of the context node, or

1060: the empty sequence if the context node has no right siblings.

1061:

1062: \item

1063:

1064: Increment, decrement, and test on counters:

1065: the constant expression

1066: `\texttt{1}', and the expressions

1067: `\texttt{$x$+1}', `\texttt{$x$-1}', and `\texttt{$x$=1}' for any value

1068: variable $x$, which should consist of a single counter.  If

1069: $x$ has the maximal counter value,  then

1070: \texttt{$x$+1} need not be defined, and if

1071: $x$ has value 1, then

1072: \texttt{$x$-1} need not be defined.  The test

1073: \texttt{$x$=1} yields any nonempty sequence for true and the empty sequence

1074: for false.

1075:

1076: \item

1077:

1078: \texttt{name()='a'}, for any $\tta \in \Sigma$, returning any nonempty

1079: sequence if the label of the context node is \tta, and the empty sequence

1080: otherwise.

1081:

1082: \item

1083:

1084: \texttt{()} evaluates to the empty sequence.

1085:

1086: \end{itemize}

1087:

1088: We establish:

1089:

1090: \begin{theorem} \label{completeness}

1091: Every computable tree transformation $f$ can be realised by a program.

1092: \end{theorem}

1093:

1094: \begin{proof}

1095: We can naturally represent

1096: any string $s$ over some finite alphabet as a flat data tree

1097: over the same alphabet.  We denote this flat tree by $\flatree(s)$.

1098: Its root is

1099: labeled \ttdoc, and has $k$

1100: children, where $k$ is the length of $s$, such that the labels of the children

1101: spell out the string $s$.  There are no other nodes.

1102:

1103: The proof now consists of three parts:

1104: \begin{enumerate}

1105: \item

1106: Program the transformation $\t \mapsto \flatree(\tstring(\t))$.

1107: \item

1108: Show that every turing machine (working on strings)

1109: can be simulated by some program working on the

1110: $\flatree$ representation of strings.

1111: \item

1112: Program the transformation $\flatree(\tstring(\t)) \mapsto \t$.

1113: \end{enumerate}

1114: The theorem then follows by composing these three steps, where we simulate a

1115: turing machine for $\tilde f$ in step~2. Note that the composition of three

1116: programs can be written as a single program, using a temporary tree to pass

1117: the intermediate results, and using modes to keep the rules from the different

1118: programs separate.

1119:

1120: The programs for steps 1 and 3 are shown in Figures \ref{figt2s} and

1121: \ref{figs2t}. For simplicity, they are for an alphabet consisting of a single

1122: letter \tta, but it is obvious how to generalise the programs.  The real XSLT

1123: versions are given in the Appendix.  We point out that these programs are

1124: actually 1.0 programs, so it is only for step~2 of the proof that we need

1125: XSLT~2.0.

1126:

1127: \begin{figure}

1128: \begin{tabbing}

1129: \quad\=\kill

1130: \textsf{template \texttt{tree2string} match (\allexp)} \\

1131: \aco \+\\

1132: \textsf{cons \texttt{a} \aco\ \act} \\

1133: \textsf{cons \texttt{lbrace} \aco\ \act} \\

1134: \textsf{apply (\childexp)} \\

1135: \textsf{cons \texttt{rbrace} \aco\ \act} \- \\

1136: \act

1137: \end{tabbing}

1138: %\nocaptionrule

1139: \caption{From $\t$ to $\flatree(\tstring(\t))$.}

1140: \label{figt2s}

1141: \end{figure}

1142:

1143: \begin{figure}

1144: \small

1145: \begin{tabbing}

1146: \quad\=\quad\=\quad\=\quad\=\quad\=\quad\=\kill

1147: \textsf{template \texttt{doc} match (\rootexp)} \\

1148: \aco \\

1149: \>\textsf{apply (\firstchild)} \\

1150: \act \\[\medskipamount]

1151: \textsf{template \texttt{string2tree} match (\allexp)} \\

1152: \aco \+ \\

1153: \textsf{cons \texttt{a}} \\

1154: \aco\ \textsf{apply (\nextexp) mode \texttt{dochildren}} \act \\

1155: \textsf{val \texttt{counter} (\texttt{1})} \\

1156: \textsf{call \texttt{searchnextsibling}} \-\\

1157: \act \\[\medskipamount]

1158: \textsf{template \texttt{dochildren} match (\allexp)

1159: mode \texttt{dochildren}} \\

1160: \aco \+ \\

1161: \textsf{if \texttt{name()='lbrace'}} \\

1162: \aco\ \textsf{apply (\nextexp) mode \texttt{dochildren}} \act \\

1163: \textsf{else} \aco \+\\

1164: \textsf{if \texttt{name()='a'}} \\

1165: \aco\ \textsf{call \texttt{string2tree}} \act \\

1166: \textsf{else} \aco\ \act \-\\

1167: \act \-\\

1168: \act \\[\medskipamount]

1169: \textsf{template \texttt{searchnextsibling} match (\allexp) mode

1170: \texttt{search}} \\

1171: \aco \+ \\

1172: \textsf{if \texttt{name()='lbrace'}} \aco\+\\

1173: \textsf{val \texttt{counter} (\texttt{counter + 1})} \\

1174: \textsf{apply (\nextexp) mode \texttt{search}} \-\\

1175: \act \\

1176: \textsf{else} \aco\+\\

1177: \textsf{if \texttt{name()='a'}} \\

1178: \aco\ \textsf{apply (\nextexp) mode \texttt{search}} \act\\

1179: \textsf{else} \aco\+\\

1180: \textsf{val \texttt{counter} (\texttt{counter - 1})} \\

1181: \textsf{if \texttt{counter = 1}} \\

1182: \aco\ \textsf{apply }\=\textsf{(\nextexp)} \\

1183: \>\textsf{mode \texttt{dochildren}} \act \\

1184: \textsf{else} \\

1185: \aco\ \textsf{apply (\nextexp) mode \texttt{search}} \act \-\\

1186: \act \-\\

1187: \act \-\\

1188: \act

1189: \end{tabbing}

1190: %\nocaptionrule

1191: \caption{From $\flatree(\tstring(\t))$ to $\t$.}

1192: \label{figs2t}

1193: \end{figure}

1194:

1195: For step~2, we can represent a configuration of a turing machine $A$ by two

1196: temporary trees \texttt{left} and \texttt{right}.  At each step, variable

1197: \texttt{right} holds (as a flat tree)

1198: the content of the tape starting at the

1199: head position and ending in the last tape cell;

1200: variable \texttt{left} holds the reverse of the tape portion left of the head

1201: position.  To keep track of the current state of the machine, we use value

1202: variables $q$ for each state $q$ of $A$, such that at each step precisely one

1203: of these is nonempty.  (This is why we need the $\X$-expression \texttt{()}.)

1204: Changing the symbol under the head to an \texttt{a} amounts to assigning a new

1205: content to \texttt{right} by putting in \textsf{cons \tta\ \aco\act},

1206: followed by copies of the nodes in the current content of

1207: \texttt{right}, where we skip the first one.

1208: Moving the head a cell

1209: to the right amounts to assigning a new content to \texttt{left} by putting

1210: in a node labeled with the current symbol, followed by copies of the nodes in

1211: the current content of \texttt{left}. We also assign a new content to

1212: \texttt{right} in the now obvious way; if we were at the end of the

1213: tape we add a new node labeled \texttt{blank}.

1214: Moving the head a cell to the left is

1215: simulated analogously. The only $\X$-expressions we need here are the ones we

1216: have assumed to be available.

1217:

1218: The simulation thus consists of repeatedly calling a big if-then-else that

1219: tests for the transition to be performed, and performs that transition.  We

1220: may assume $A$ is programmed in such a way that the final output is produced

1221: starting from a designated state.  In this way we can build up the final

1222: output string in a fresh temporary tree and pass it to step~3.

1223: \end{proof}

1224:

1225:

1226:

1227: \section{XSLT 1.0}

1228:

1229: \newcommand{\G}{\mathcal{G}}

1230:

1231: In this section we will show that every XSLT~1.0 program can be implemented in

1232: exponential time, in sharp contrast to the computational completeness result

1233: of the previous section.

1234:

1235: A fundamental difference between XSLT 1.0 and 2.0 is that in 1.0,

1236: $\X$-expressions are ``input-only'', defined as follows.

1237:

1238: \begin{definition} \begin{enumerate} \item Let $C=(\S,\E,(z,i,k))$ be a

1239: context. Let the input tree in $\S$ be $\t$. Then we call $C$

1240: \emph{input-only} if every value appearing in $\E$ is already a value over the

1241: store $\{(\Input,\t)\}$, and also $(z,i,k)$ is like that.

1242: \item

1243: By $\hat C$, we mean

1244: the context $(\{(\Input,\t)\},\E,(z,i,k))$.  So, $\hat C$ equals $C$ where we

1245: have removed all temporary trees.

1246: \item

1247: Now an $\X$-expression $e$ is called input-only if for any input-only context

1248: $C$ for which $\eval(e,C)$ is defined, we have $\eval(e,C) = \eval(e,\hat C)$,

1249: and this must be a value over $C$'s input tree only.

1250: \end{enumerate} \end{definition}

1251:

1252: In other words, input-only expressions are oblivious to the temporary trees in

1253: the store; they only see the input tree.

1254:

1255: We further define:

1256:

1257: \begin{definition} An input-only $\X$-expression $e$ is called

1258: \emph{polynomial} if for each input-only context $C$, the computation of

1259: $\eval(e,C)$ can be done in time polynomial in the size of $C$'s input tree.

1260: \end{definition}

1261:

1262: We now define:

1263:

1264: \begin{definition}

1265: A program is called 1.0 if it only uses input-only, polynomial

1266: $\X$-expressions.

1267: \end{definition}

1268:

1269: Essentially, 1.0 programs cannot do anything with temporary trees except copy

1270: them using \textsf{tcopy} statements.  We note that real XPath~1.0 expressions

1271: are indeed input-only and polynomial; actually, real XPath~1.0 is much more

1272: restricted than that, but for our purpose we do not need to assume anything

1273: more.

1274:

1275: In order to establish an exponential upper bound on the time-complexity of 1.0

1276: programs, we cannot use an explicit representation of the output tree.

1277: Indeed, 1.0 programs can produce result trees of size \emph{doubly}

1278: exponential in the size of the input tree.  For example, using subsets of

1279: input nodes, ordered lexicographically, as depth counters, we can produce a

1280: full binary tree of depth $2^n$ from an input tree with $n$ nodes.  Obviously

1281: a doubly exponentially long output could never be computed in singly

1282: exponential time.

1283:

1284: We therefore use a \emph{DAG representation} of trees: an old and well-known

1285: trick \cite{linearunification} that is also used in tree transduction

1286: \cite{maneth_macromplexity}, and that has recently found new

1287: applications in XML \cite{bgk_xmlcompressed}.

1288: Formally, a DAG representation is a collection $\G$ of trees, where trees in

1289: $\G$ can have special leafs which are not labeled, and from which a pointer

1290: departs to the root of another tree in $\G$. On condition that the resulting

1291: pointer graph is acyclic, starting from a designated ``root tree'' in $\G$ we

1292: can naturally obtain a tree by unfolding along the pointers.  An illustration

1293: is shown in Figure~\ref{figdag}.

1294:

1295: \begin{figure}

1296: \begin{center}

1297: \includegraphics{dag.eps}

1298: \end{center}

1299: %\nocaptionrule

1300: \caption{Left, a data tree, and right, a DAG representation of it.}

1301: \label{figdag}

1302: \end{figure}

1303:

1304: We establish:

1305:

1306: \begin{theorem} \label{complexity}

1307: Let $P$ be an 1.0 program. Then the following problem is solvable in

1308: exponential, i.e., $2^{n^{O(1)}}$ time:

1309: \begin{description}

1310: \item[Input:] a data tree $\t$

1311: \item[Output:] a DAG representation of the final result tree of applying $P$

1312: to $\t$, or a message signaling non-termination if $P$ does not terminate on

1313: $\t$.

1314: \end{description}

1315: \end{theorem}

1316:

1317: \begin{proof}

1318: We will generate a DAG representation $\G$ by applying modified versions of

1319: the semantic rules from Section~\ref{secsemantics}.  We initialise $\G$ with

1320: all the subtrees of $\t$.  These trees have no pointers.  Each tree

1321: that will be added to $\G$ will be a configuration, which still has to be

1322: developed further into a final data tree with pointers, using the same

1323: modified rules.  Because we will have to point to the newly added

1324: configurations later, we identify each added configuration by a pair

1325: $(\mathit{name},C)$ where $\mathit{name}$ is the name of a template rule in

1326: $P$ and $C$ is a context.  In the description below, whenever we say that we

1327: ``add'' a configuration to $\G$, identified by some pair $(\mathit{name},C)$,

1328: we really mean that we add it unless a configuration identified by that same

1329: pair already exists in $\G$.

1330:

1331: The modifications are now the following.

1332:

1333: \begin{enumerate}

1334:

1335: \item

1336:

1337: When executing an apply-statement, we do not directly insert copies of the

1338: templates belonging to the rules that must be applied (the $\gamma_i$'s in

1339: Figure~\ref{figapply}). Rather, we add, for $i=1,\dots,k$, the configuration

1340: $\gamma_i'$ to $\G$, where $\gamma_i \normif \gamma_i'$.  We identify

1341: $\gamma_i'$ by the pair $(\mathit{name}_i,C_i)$, with $\mathit{name}_i$ the

1342: name of the rule $\gamma_i$ comes from, and $C_i = (\S,\E,(\n_i,i,k))$ using

1343: the notation of Figure~\ref{figapply}.  Moreover, in place of the

1344: apply-statement we insert a sequence of

1345: $k$ pointer nodes pointing to $(\mathit{name}_1,C_1)$, \dots,

1346: $(\mathit{name}_k,C_k)$, respectively.

1347:

1348: \item

1349:

1350: When executing a call-statement \textsf{call $\mathit{name}$} under context

1351: $C$, we again do not insert $\gamma_1$ (compare Figure~\ref{figapply}), but

1352: add the configuration $\gamma_1'$ to $\G$, where $\gamma_1 \normif \gamma_1'$,

1353: and identify it by the pair $(\mathit{name},C)$. We then replace the

1354: statement by a pointer node pointing to that pair.

1355:

1356: \item

1357:

1358: By making template rules from the bodies of all foreach-statements in $P$, we

1359: may assume without loss of generality that the body of every foreach-statement

1360: is a single call-statement.  A foreach-statement is then processed analogously

1361: to apply- and call-statements.

1362:

1363: \item

1364:

1365: As we did with foreach-statements, we may assume that the body of each

1366: tree-statement is a single call-statement.  When executing a tree-statement,

1367: we may assume that the call-statement has already been turned into a pointer

1368: to some pair $(\mathit{name}_0,C_0)$. We then assign that pair directly to $y$

1369: in the new context $C'$ (compare Figure~\ref{figvar}); we no longer apply

1370: $\maketree$.

1371:

1372: So, in the modified kind of store we use, we assign name--context pairs,

1373: rather than fully specified temporary trees, to tree variables.

1374:

1375: \item

1376:

1377: Correspondingly, when executing a statement \textsf{tcopy $y$}, we now

1378: directly turn it into a pointer to the pair assigned to $y$.

1379:

1380: \item

1381:

1382: Finally, when executing a vcopy-statement, we do not insert the whole

1383: forest generated by $(\n_1,\dots,\n_k)$

1384: in the configuration (compare

1385: Figure~\ref{figcopy}), but merely insert a sequence of

1386: $k$ pointers to the input subtrees rooted at $\n_1$, \dots, $\n_k$,

1387: respectively.

1388:

1389: \end{enumerate}

1390:

1391: We initiate the generation of $\G$ by starting with the initial configuration

1392: as always.  Processing that configuration will add the first tree to $\G$,

1393: which serves as the root tree of the DAG representation.  When all trees in

1394: $\G$ have been fully developed into data trees with pointer nodes, the

1395: algorithm terminates.  In case $P$ does not terminate on $\t$, however, that

1396: will never happen, and we need a way to detect nontermination.

1397:

1398: Thereto, recall that every context consists of an environment $\E$ and a

1399: context triple $c$ on the one hand, and a store $\S$ on the other hand.  Since

1400: all $\X$-expressions used are input-only, and thus oblivious to the store-part

1401: of a context (except for the input tree, which does not change), we are in an

1402: infinite loop from the moment that there is a cycle in $\G$'s pointer graph

1403: \emph{where we ignore the store-part of the contexts}.  More precisely, this

1404: happens when from a pointer node in a tree identified by $(\mathit{name},C_1)$

1405: we can follow pointers and reach a pointer to a pair $(\mathit{name},C_2)$

1406: with the same $\mathit{name}$ and where $C_1$ and $C_2$ are equal in their

1407: $(\E,c)$-parts.  As soon as we detect such a cycle, we terminate the algorithm

1408: and signal nontermination.  Note that thus the algorithm always terminates.

1409: Indeed, since only input-only $\X$-expressions are used, all contexts that

1410: appear in the computation are input-only, and there are only a finite number

1411: of possible $(\E,c)$-parts of input-only configuration over a fixed input

1412: tree.

1413:

1414: Let us analyse the complexity of this algorithm.  Since all $\X$-expressions

1415: used are polynomial, there is a natural number $K$ such that each value that

1416: appears in a context is at most $n^K$ long, where $n$ equals the number of

1417: nodes in $\t$.  Each element of such a length-$n^K$ sequence is a node or a

1418: counter over $\t$, so there are at most $(2n)^{n^K}$ different values.  There

1419: are a constant $c_1$ number of different value variables in $P$, so there are

1420: at most $((2n)^{n^K})^{c_1}$ different environments.  Likewise, the number of

1421: different context triples is $(2n)^3$, so, ignoring the stores, there are in

1422: total at most $(2n)^3 \cdot (2n)^{c_1 n^K} \leq 2^{n^{K'}}$ different

1423: contexts, for some natural number $K' \geq K$.

1424: With a constant $c_2$ number of different template names in $P$, we get a

1425: maximal number of $c_2 2^{n^{K'}}$ different configurations that can be added

1426: to $\G$ before the algorithm will surely terminate.

1427:

1428: It remains to see how long it takes to fully rewrite each of those

1429: configurations into a data tree with pointers.  A configuration initially

1430: consists of at most a constant $c_3$ number of statements.  The evaluation of

1431: $\X$-expressions, which are polynomial, takes at most $c_3 n^K$ time in total.

1432: Processing an apply- or a foreach-statement takes at most $c_3n^K$

1433: modifications to the configuration and to $\G$; for the other statements this

1434: takes at most $c_3$ such operations.  Each such operation, however, involves

1435: the handling of contexts, whose stores can become quite large if treated

1436: naively.  Indeed, tree-statements assign a context to a tree variable,

1437: yielding a new context which may then again be assigned to a tree variable,

1438: and so on.  To keep this under control, we do not copy the contexts literally,

1439: but number them consecutively in the order they are introduced in $\G$.  A map

1440: data structure keeps track of this numbering.  The stores then consist of an

1441: at most constant $c_4$ number of assignments of pairs (name, context number)

1442: to tree variables.  As there are at most $2^{n^{K'}}$ different contexts, each

1443: number is at most $n^{K'}$ bits long.  Looking up whether a given context is

1444: already in $\G$, and if so, finding its number, takes $O(\log 2^{n^{K'}}) =

1445: O(n^{K'})$ time using a suitable map data structure.

1446:

1447: We conclude that the processing of $\G$ takes a total time of $c_2 2^{n^{K'}}

1448: \cdot O(n^{K'}) = 2^{n^{O(1)}}$, as had to be proven.

1449: \end{proof}

1450:

1451: A legitimate question is whether the complexity bound given by

1452: Theorem~\ref{complexity} can still be improved.  In this respect we can show

1453: that, even within the limits of real XSLT~1.0, any linear-space turing machine

1454: can be simulated by a 1.0 program.  Note that some PSPACE-complete problems,

1455: such as QBF-SAT \cite{papadimitriou}, are solvable in linear space. This shows

1456: that the time complexity upper bound of Theorem~\ref{complexity} cannot be

1457: improved without showing that PSPACE is properly included in EXPTIME (a famous

1458: open problem).

1459:

1460: The simulation gets as input a flat tree representing an input string, and

1461: uses the $n$ child nodes to simulate the $n$ tape cells.  For each letter

1462: \tta\ of the tape alphabet, a value variable $\mathit{cell}_{\text{\tta}}$

1463: holds the nodes representing the tape cells that have an \tta.  A value

1464: variable $\it head$ holds the node representing the cell seen by the machine's

1465: head.  The machine's state is kept by additional value variables ${\it

1466: state}_q$ for each state $q$, such that ${\it state}_q$ is nonempty iff the

1467: machine is in state $q$.  Writing a letter in a cell, moving the head left or

1468: right, or changing state, are accomplished by easy updates on the

1469: value-variables, which can be expressed by real XPath~1.0 expressions.

1470: Choosing the right transition is done by a big if-then-else statement.

1471: Successive transitions are performed by recursively applying the simulating

1472: template rule until a halting state is reached.

1473:

1474: \begin{remark}

1475: A final remark is that our results imply that XSLT~1.0 is not closed under

1476: composition.  Indeed, building up a tree of

1477: doubly exponential size (as we already remarked is possible in XSLT~1.0),

1478: followed by the building up of a tree of exponential

1479: size, amounts to building up a tree of triply exponential size.  If that would

1480: be possible by a single program, then a DAG representation of a triply

1481: exponentially large tree would be computable

1482: in singly exponential time.  It is well known, however, that a DAG

1483: representation cannot be more than singly

1484: exponentially smaller than the tree it

1485: represents.  Closure under composition is another sharp contrast

1486: between XSLT~1.0 and 2.0, as the latter is indeed

1487: closed under composition as already noted in the proof of

1488: Theorem~\ref{completeness}.

1489: \end{remark}

1490:

1491: \section{Conclusions}

1492:

1493: W3C recommendations such as the XSLT specifications are no Holy scriptures.

1494: Theoretical scrutinising of W3C work, which is what we have done here, can

1495: help in better understanding the possibilities and limitations of various

1496: newly proposed programming languages related to the Web, eventually leading to

1497: better proposals.

1498:

1499: A formalisation of the full XSLT~2.0 language, with all the dirty details both

1500: concerning the language itself as concerning the XPath~2.0 data model, is

1501: probably something that should be done.  We believe our work gives a clear

1502: direction how this could be done.

1503:

1504: Note also that XSLT contains a lot of redundancies.  For example,

1505: foreach-statements are eliminable, as are call-statements, and the match

1506: attribute of template rules.  A formalisation such as ours can provide a

1507: rigorous foundation to prove such redundancies, or to prove correct various

1508: processing strategies or optimisation techniques XSLT implementations may use.

1509:

1510: A formal tree transformation model denoted by TL, in part inspired by XSLT,

1511: but still omitting many of its features, has already been studied by Maneth

1512: and his collaborators \cite{frank_xsltformal,maneth_pods}.  The TL model can

1513: be compiled into the earlier formalism of ``macro tree transducers''

1514: \cite{mtt,ps_macroforest}.  It is certainly an interesting topic for further

1515: research to similarly translate our XSLT formalisation (even partially) into

1516: macro tree transducers, so that techniques already developed for these

1517: transducers can be applied.  For example, under regular expression types

1518: \cite{xduce} (known much earlier under the name of ``recognisable tree

1519: languages''), exact automated typechecking is possible for compositions of

1520: macro tree transducers, using the method of ``inverse type inference''

1521: \cite{msv_typecheckxml}.  This method has various other applications, such as

1522: deciding termination on all possible inputs \cite{maneth_pods}.  Being able to

1523: apply this method to our XSLT~1.0 formalism would improve the analysis

1524: techniques of Dong and Bailey \cite{bailey_xslt}, which are not complete.

1525:

1526: \section*{Acknowledgment}

1527:

1528: We are indebted to Frank Neven for his initial participation in this research.

1529:

1530: \bibliographystyle{plain}

1531: \begin{thebibliography}{10}

1532:

1533: \bibitem{xpath}

1534: {XML} path language ({XPath}) version 1.0.

1535: \newblock W3C Recommendation, November 1999.

1536:

1537: \bibitem{xslt}

1538: {XSL} transformations ({XSLT}) version 1.0.

1539: \newblock W3C Recommendation, November 1999.

1540:

1541: \bibitem{xsltreq}

1542: {XSLT} requirements version 2.0.

1543: \newblock W3C Working Draft, February 2001.

1544:

1545: \bibitem{xmlschema}

1546: {XML} schema.

1547: \newblock W3C Recommendation, October 2004.

1548:

1549: \bibitem{xpath2}

1550: {XML} path language ({XPath}) version 2.0.

1551: \newblock W3C Working Draft, April 2005.

1552:

1553: \bibitem{xquerypath_datamodel}

1554: {XQuery} 1.0 and {XPath} 2.0 data model.

1555: \newblock W3C Working Draft, April 2005.

1556:

1557: \bibitem{xquery_formal}

1558: {XQuery} 1.0 and {XPath} 2.0 formal semantics.

1559: \newblock W3C Working Draft, June 2005.

1560:

1561: \bibitem{xslt2}

1562: {XSL} transformations ({XSLT}) version 2.0.

1563: \newblock W3C Working Draft, April 2005.

1564:

1565: \bibitem{frank_xsltformal}

1566: G.J. Bex, S.~Maneth, and F.~Neven.

1567: \newblock A formal model for an expressive fragment of {XSLT}.

1568: \newblock {\em Information Systems}, 27(1):21--39, 2002.

1569:

1570: \bibitem{bgk_xmlcompressed}

1571: P.~Buneman, M.~Grohe, and C.~Koch.

1572: \newblock Path queries on compressed {XML}.

1573: \newblock In J.C. Freytag, P.C. Lockemann, et~al., editors, {\em Proceedings

1574:   29th International Conference on Very Large Data Bases}, pages 141--152.

1575:   Morgan Kaufmann, 2003.

1576:

1577: \bibitem{bailey_xslt}

1578: C.~Dong and J.~Bailey.

1579: \newblock Static analysis of {XSLT} programs.

1580: \newblock In K.D. Schewe and H.E. Williams, editors, {\em Database

1581:   technologies---Proceedings ADC 2004}, pages 151--160. Australian Computer

1582:   Society, 2004.

1583:

1584: \bibitem{mtt}

1585: J.~Engelfriet and H.~Vogler.

1586: \newblock Macro tree transducers.

1587: \newblock {\em Journal of Computer and System Sciences}, 31(1):71--146, 1985.

1588:

1589: \bibitem{xpath_nutshell}

1590: G.~Gottlob, C.~Koch, and R.~Pichler.

1591: \newblock {XPath} processing in a nutshell.

1592: \newblock {\em SIGMOD Record}, 32(2):21--27, 2003.

1593:

1594: \bibitem{hidders_light}

1595: J.~Hidders, J.~Paredaens, R.~Vercammen, et~al.

1596: \newblock A light but formal introduction to {XQuery}.

1597: \newblock In Z.~Bellahs\`ene, T.~Milo, M.~Rys, et~al., editors, {\em Database

1598:   and XML Technologies---Proceedings XSym}, volume 3186 of {\em Lecture Notes

1599:   in Computer Science}, pages 5--20. Springer, 2004.

1600:

1601: \bibitem{xduce}

1602: H.~Hosoya and B.C. Pierce.

1603: \newblock {XDuce}: A statically typed {XML} processing language.

1604: \newblock {\em ACM Transactions on Internet Technology}, 3(2):117--148, 2003.

1605:

1606: \bibitem{saxon}

1607: M.~Kay.

1608: \newblock {SAXON}: The {XSLT} and {XQuery} processor.

1609: \newblock http://saxon.sourceforge.net.

1610:

1611: \bibitem{xslt_kay}

1612: M.~Kay.

1613: \newblock {\em {XSLT} 2.0 Programmer's Reference}.

1614: \newblock Wrox, 3rd edition, 2004.

1615:

1616: \bibitem{maneth_macromplexity}

1617: S.~Maneth.

1618: \newblock The complexity of compositions of deterministic tree transducers.

1619: \newblock In M.~Agrawal and A.~Seth, editors, {\em FST TCS 2002 Proceedings},

1620:   volume 2556 of {\em Lecture Notes in Computer Science}, pages 265--276.

1621:   Springer, 2002.

1622:

1623: \bibitem{maneth_pods}

1624: S.~Maneth, A.~Berlea, T.~Perst, and H.~Seidl.

1625: \newblock {XML} type checking with macro tree transducers.

1626: \newblock In {\em Proceedings 24th ACM Symposium on Principles of Database

1627:   Systems}, pages 283--294. ACM Press, 2005.

1628:

1629: \bibitem{msv_typecheckxml}

1630: T.~Milo, D.~Suciu, and V.~Vianu.

1631: \newblock Typechecking for {XML} transformers.

1632: \newblock {\em Journal of Computer and System Sciences}, 66(1):66--97, 2003.

1633:

1634: \bibitem{papadimitriou}

1635: C.H. Papadimitriou.

1636: \newblock {\em Computational Complexity}.

1637: \newblock Addison-Wesley, 1994.

1638:

1639: \bibitem{linearunification}

1640: M.S. Paterson and M.N. Wegman.

1641: \newblock Linear unification.

1642: \newblock {\em Journal of Computer and System Sciences}, 16(2):158--167, 1978.

1643:

1644: \bibitem{ps_macroforest}

1645: T.~Perst and H.~Seidl.

1646: \newblock Macro forest transducers.

1647: \newblock {\em Information Processing Letters}, 89(3):141--149, 2004.

1648:

1649: \bibitem{rosen_cr}

1650: B.K. Rosen.

1651: \newblock Tree-manipulating systems and {Church}-{Rosser} theorems.

1652: \newblock {\em Journal of the ACM}, 20(1):160--187, 1973.

1653:

1654: \bibitem{essencexml}

1655: J.~Sim\'eon and P.~Wadler.

1656: \newblock The essence of {XML}.

1657: \newblock In {\em Proceedings 30th ACM Symposium on Principles of Programming

1658:   Languages}, pages 1--13. ACM Press, 2003.

1659:

1660: \bibitem{terese}

1661: Terese.

1662: \newblock {\em Term Rewriting Systems}.

1663: \newblock Cambridge University Press, 2003.

1664:

1665: \bibitem{wadler_xpath}

1666: P.~Wadler.

1667: \newblock A formal semantics of patterns in {XSLT} and {XPath}.

1668: \newblock {\em Markup Languages: Theory and Practice}, 2(2):183--202, 2000.

1669:

1670: \end{thebibliography}

1671:

1672: \clearpage

1673: \appendix

1674:

1675: \section{Real XSLT programs}

1676: 	\subsection{Figure~\ref{figt2s} in real XSLT}

1677:

1678:

1679:

1680: 	\begin{alltt}

1681:

1682:

1683: 	<xsl:transform

1684: 	  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"

1685: 	  version="1.0">

1686:

1687: 	  <xsl:template name="tree2string" match="//*">

1688: 	    <a/>

1689: 	    <lbrace/>

1690: 	    <xsl:apply-templates select="child::*"/>

1691: 	    <rbrace/>

1692: 	  </xsl:template>

1693:

1694: 	</xsl:transform>

1695:

1696: 	\end{alltt}

1697:

1698: 	\subsection{Figure~\ref{figs2t} in real XSLT}

1699:

1700: \begin{alltt}

1701:

1702: <xsl:transform

1703:   xmlns:xsl="http://www.w3.org/1999/XSL/Transform"

1704:   version="1.0">

1705:

1706: <xsl:template match="/doc">

1707:   <xsl:apply-templates select="child::*[1]"/>

1708: </xsl:template>

1709:

1710: <xsl:template name="string2tree" match="/doc//*">

1711:   <a>

1712:     <xsl:apply-templates select="following-sibling::*[1]" mode="dochildren"/>

1713:   </a>

1714:   <xsl:call-template name="searchnextsibling">

1715:     <xsl:with-param name="counter" select="1"/>

1716:   </xsl:call-template>

1717: </xsl:template>

1718:

1719: <xsl:template match="//*" mode="dochildren">

1720:   <xsl:if test="name()='lbrace'">

1721:     <xsl:apply-templates select="following-sibling::*[1]" mode="dochildren"/>

1722:   </xsl:if>

1723:   <xsl:if test="name()='a'">

1724:     <xsl:call-template name="string2tree"/>

1725:   </xsl:if>

1726: </xsl:template>

1727:

1728: <xsl:template name="searchnextsibling" match="//*" mode="search">

1729:   <xsl:param name="counter"/>

1730:   <xsl:if test="name()='lbrace'">

1731:     <xsl:apply-templates select="following-sibling::*[1]" mode="search">

1732:       <xsl:with-param name="counter" select="$counter + 1"/>

1733:     </xsl:apply-templates>

1734:   </xsl:if>

1735:   <xsl:if test="name()='a'">

1736:     <xsl:apply-templates select="following-sibling::*[1]" mode="search">

1737:       <xsl:with-param name="counter" select="$counter"/>

1738:     </xsl:apply-templates>

1739:   </xsl:if>

1740:   <xsl:if test="name()='rbrace'">

1741:     <xsl:if test="$counter=2">

1742:       <xsl:apply-templates select="following-sibling::*[1]" mode="dochildren"/>

1743:     </xsl:if>

1744:     <xsl:if test="$counter>2">

1745:       <xsl:apply-templates select="following-sibling::*[1]" mode="search">

1746:         <xsl:with-param name="counter" select="$counter - 1"/>

1747:       </xsl:apply-templates>

1748:     </xsl:if>

1749:   </xsl:if>

1750: </xsl:template>

1751:

1752: </xsl:transform>

1753:

1754: \end{alltt}

1755:

1756: \end{document}

1757:

1758: