0311:cs0311033/slova.tex

1: \documentstyle[12pt,graphicx,natbib,ogonek]{article}

2: %\documentclass{article}

3: %\usepackage{natbib}

4: %\usepackage{graphicx}

5:

6: \textwidth=18cm %16cm

7: \textheight=24cm

8: \hoffset=-2cm %-1cm

9: \voffset=-3cm

10:

11: \begin{document}

12: \title{The Rank--Frequency Analysis for the Functional Style Corpora in the Ukrainian Language}

13: \author{Solomija N.~Buk$^*$, Andrij A.~Rovenchak$^{**}$\\

14: $^*$ Department for General Linguistics, Ivan Franko National University of Lviv,\\

15: 1 Universytetska St., Lviv, UA-79000, Ukraine\\

16: $^{**}$ Department for Theoretical Physics, Ivan Franko National University of Lviv,\\

17: 12 Drahomanov St., Lviv, UA-79005, Ukraine

18: }

19:

20: \maketitle

21: %Short Title:\\

22: %Rank--Frequency Dependencies in Ukrainian

23:

24: \abstract{

25: We use the rank--frequency analysis for the estimation of Kernel Vocabulary size within

26: specific corpora of Ukrainian. The extrapolation of high-rank behaviour is utilized for

27: estimation of the total vocabulary size.

28:

29: {\bf Key words:} corpus, Ukrainian, rank--frequency dependence, vocabulary size, entropy}

30:

31: \section{Introduction}

32: The problem of rank--frequency analysis for texts is a very interesting one.

33: In application to the natural languages it gives a possibility to draw the information

34: which is necessary when compiling dictionaries, in particular, professionally-oriented

35: dictionaries, creating text-compressors, determining the basic vocabulary for studying

36: a language as foreign, etc.

37:

38: In recent years, the development of computational techniques made it possible to study

39: large amounts of text. Such analysis usually involves the so-called Zipf's law \citep{Zipf49}

40: establishing the relation between the rank of a word and its frequency.

41: It was shown that the initially supposed linear behaviour on large samples of text

42: gets broken \citep{Nes87,Mon01,CanSol01}.

43:

44: Several decades ago, statistical study of the Ukrainian language was held in the Potebnja

45: Institute of Linguistics in Kyiv \citep{Per67,Str74}. In these research works, however,

46: the results were just established, and no special analysis was made.

47: Unfortunately, such studies in Ukraine had been stalled for years, and only now they

48: are revived with application of modern techniques \citep{Dem01}.

49:

50: In the paper we present results for different functional styles of Ukrainian language.

51: Such material is novel since the statistical analysis as well as corpus studies of Ukrainian

52: are now standing in the first stages only. While the volume of material involved in this

53: work is quite small comparing with e.~g., English, we hope that described techniques

54: together with preliminary results will be useful in future.

55:

56:

57: The paper is organized as follows. It the next section the description of

58: sources and text processing is given.

59: Section~3 contains the analysis of rank--frequency dependencies for different corpora

60: due to some specific features. The possible techniques for estimation of the vocabulary

61: size are adduced in Section~4. A brief discussion is presented in Section~5.

62:

63: \section{Material Overview}

64:

65: \subsection{Definition of Terms}

66: In this work we use the following terms:

67: \begin{itemize}

68: \item {\bf Corpus} --- body of collection of linguistic data, specially the one considered complete

69:       and representative, from a particular language or languages, in the form of recorded

70:       utterances or written text, which is available for theoretical or/and applied

71:       linguistic investigation \citep{Bur98}. In the present paper we consider

72:       {\bf text corpus} which must be distinguished from {\bf corpus of language (national corpus)}

73:       being a structured representative collection of texts from a given language.

74:

75: \item {\bf Token} --- a word in any form (a sequence of letters between two spaces)

76:       in a text, e.~g., the sentence {\it I have not seen her yet} contains

77:       six tokens;

78:

79: \item {\bf Corpus size} --- total number of tokens in the given corpus;

80:

81: \item {\bf Vocabulary size} --- number of different words in the given corpus generated by the

82:       {\bf lemmatisation} process;

83:

84: \item {\bf Lemmatisation} --- process of the reduction of word-forms to the initial (vocabulary)

85:       form, e.~g., verbs to the Infinitive, nouns to Nominative Singular, etc.

86:

87: \item {\bf Vocabulary volume} --- estimated number of possible different words of the language

88:       (in the content of this work we mean it within specific functional style);

89: \end{itemize}

90:

91: \subsection{Corpus Description}

92: In this work, we analyse a middle-sized corpus of Ukrainian language.

93: The size classification of corpora uses the Brown Standard Corpus of American English

94: \citep{Brown} as a reference point. Its parameters are as

95: follows: a)~one million words of running text; b)~500 text samples;

96: c)~2 thousand words per sample.

97: Corpora with less than one million words are considered as small,

98: corpora with 1--10 million words are middle-sized, and corpora containing

99: more than 10 million words are large.

100:

101: Total corpus size alalysed in this work is about 1.7 million tokens.

102: It consists of five sub-corpora according to main five functional styles of speech (genres).

103:

104: 1.~The sub-corpus of {Belles-lettres Prose} contains 500 thousand tokens.

105: The frequency data were taken from \citep{Per81}. This frequency dictionary was compiled

106: on the basis of 25 creative works, with several text pieces extracted from different

107: places of one work. Although the time of the writings is restricted to 1945--1970, we suggest

108: that the changes in the first three thousand most frequent words are not significant.

109:

110: 2.~The sub-corpus of {\it Colloquial Style} contains about 300 thousand tokens.

111: It consists of 45 text pieces over approximately 6,000 tokens each.

112: Since big collections of `pure' Ukrainian colloquial speech do not exist, we used modern

113: dramas written within the last two decades~\citep{Buk03a}. The adequacy between these two

114: types of speech might be disputable but such a principle was used, e.~g., in \citep{JuiBro70}

115: and \citep{KurLew90}.

116: %Colloquial: 45 pieces, 286,490 tokens of raw text

117:

118: 3.~The sub-corpus of {\it Scientific Style} was collected from 104 pieces each containing

119: about 3,000 tokens. Its total size slightly exceeds 300 thousand tokens.

120: The following scientific areas were represented in approximately equal

121: parts: biology, chemistry, psychology and pedagogics, physics, mathematics, technics,

122: geography and geology, history, linguistics~\citep{Buk03b}.

123: %Scientific: (biology 39,842); chemistry and medicine (37,655); psychology (46,853);

124: %geography (37,482); history (37,675); mathematics (48,594); physics (36,941); linguistics (42,889).

125: %327,931. ??technics (35,525)

126:

127: 4.~{\it Official (business) Style} corpus was composed from texts of different kinds of

128: documents. These are: The Constitution of Ukraine, codices, Ukrainian and international laws,

129: international treaties, conventions, memoranda, declarations, speeches, economic documents,

130: contracts, all types of administrative documents, etc.

131: The size of the sub-corpus is about 300 thousand tokens.

132:

133: 5.~{\it Journalistic Style} frequency statistics was taken from \citep{UkrPub}.

134: The correspondent corpus build on basis texts from several all-Ukrainian newspapers

135: %"��冷��� ���'��", "����� ������", "������� �����", "������ � �����", "������ ������",

136: %"������ୠ ������", "����� ������", "����� � ������", "���㡫���", "������ ����"

137: issued in 1994. These newspapers are addressed to both city-dwellers and villagers,

138: and to people of different age. The size of the sub-corpus is also about 300 thousand tokens.

139:

140:

141: \subsection{Text processing}

142: At the first stage, several types of items were removed from texts. These are: numbers,

143: word containing numbers, punctuation signs (see comment on dashes below),

144: and words written in a non-Ukrainian script.

145: Then, texts were processed manually for homonyms. This is a very important stage as some

146: of these words appear with high frequency. As an example, we propose some homonym pairs

147: (note, that stress is usually omitted in Ukrainian\footnote{Hereafter for the sake of convenience we use

148: transliteration for representing Ukrainian words according to the table given in Appendix.}):

149: {\it br\'aty} (`to take', verb in the Infinitive) and {\it brat\'y} (`brothers', noun

150: in Plural, Nominative); {\it m\'aty} (`to have', verb in the Infinitive) and

151: {\it m\'aty} (`mother', noun in Singular, Nominative); {\it ni\v{z}} (`than', particle)

152: and {\it ni\v{z}} (`knife', noun in Singular, Nominative); {\it \v{s}\v{c}o} being a particle,

153: a conjunction (`which'), and a pronoun (`what'); etc.

154:

155:

156: \section{Frequency Analysis}

157:

158: \subsection{Low ranks}

159:

160: The behaviour at low ranges is significantly influenced by some

161: specific features of Ukrainian language. Several very frequent

162: words have different forms due to the principle of so called

163: euphony. Namely, the word

164: {\it i} (`and') may appear also in the

165: forms {\it j} and {\it ta}. The word {\it v} (`in') may have also

166: forms {\it u} and {\it vvi} or {\it uvi} (the last two are rare).

167:

168: The verb `to be', very frequent in different language corpora, in Ukrainian can be

169: replaced by a dash (---) or omitted at all, and therefore, it appears a bit less

170: frequently when comparing with its rank in other languages, especially in spoken language.

171: Note, however, that the inverse statement is incorrect. i.~e., not every dash represents

172: this verb.

173:

174: In the table below we present first five most frequent words from different corpora.

175: English statistics is based on the British National Corpus \citep{BNC}.

176: German language statistics was kindly granted by Sabine Schulte from

177: the University of Stuttgart. Croatian corpus data is taken from \citep{HNC}, and Polish

178: is from \citep{PWN}. Ukrainian statistics is collected by the authors.

179:

180: \bigskip

181: \noindent

182: \begin{center}

183: \begin{tabular}{c|ll|ll|ll|ll|ll}

184: \hline

185: \hline

186: Rank&\multicolumn{2}{c|}{English}&\multicolumn{2}{c|}{German}

187:     &\multicolumn{2}{c|}{Croatian}&\multicolumn{2}{c|}{Polish}

188:     &\multicolumn{2}{c}{Ukrainian}\\

189: \hline

190: 1& the &0.0619& die  &0.0702& i  &0.0314& w       &0.0317& i  &0.0371\\

191: 2& be  &0.0424& sein &0.0289& u  &0.0276& i       &0.0282& v  &0.0303\\

192: 3& of  &0.0309& in   &0.0274& je &0.0264& si\k{e} &0.0192& na &0.0173\\

193: 4& and &0.0268& der  &0.0245& se &0.0156& na      &0.0167& z  &0.0166\\

194: 5& a   &0.0219& ein  &0.0234& da &0.0130& z       &0.0159& ne &0.0157\\

195: \hline

196: \hline

197: \end{tabular}

198: \end{center}

199:

200: %\bigskip

201: %\begin{tabular}{l|ll|ll|ll|ll|ll}

202: %\hline

203: %\hline

204: %%art; sci; publ; ofic; coll

205: %1& i  &0.0379& v  &0.0427&i  &0.0234& i  &0.0387& ja &0.0410\\

206: %2& v  &0.0256& i  &0.0422&na &0.0173& v  &0.0309& i  &0.0315\\

207: %3& ne &0.0221& z  &0.0154&u  &0.0167& na &0.0178& ne &0.0299\\

208: %4& na &0.0208& na &0.0151&v  &0.0142& z  &0.0162& ty &0.0226\\

209: %\hline

210: %\hline

211: %\end{tabular}

212: %%publ:

213: %%i  7024; ta: 2248, j: 817

214: %%na 5204;

215: %%u  5007;

216: %%v  4274;

217: %%    ; z: 4041, zi: 125, iz: 734

218:

219: \bigskip

220: This table demonstrates that our data are consistent with other Slavic languages.

221:

222: \subsection{Kernel Vocabulary}

223: Zipf formulated the relation between the frequency of the word $f$ and its rank $r$,

224: basing on the `principle of least effort' which he considered as one of the most

225: important features of human behaviour, on the analogy of Poincar\'e's principle

226: of least action in physics. A slightly modified, in comparison with its original form,

227: this dependence reads:

228: \begin{equation}\label{Zipf}

229: f_r=A/r^z,

230: \end{equation}

231: where $A$ and $z$ are parameters, the exponent $z$ slightly deviates from unity.

232: (Originally Zipf put the value $z=1$). Further, we refer this relation as Zipf's law.

233:

234: %��� ������㢠� � ���� ������� ������������� ��������୮���, ���������

235: %���� �������� ���⭨� ᫮������. ��� � ���� ����������� ஧��鸞��� ����������

236: %��� �������� � ࠭��� ᫮�� � ���� ����ᥬ���, ��� �������� � �������� ���

237: %� ����� �����, ��� �������� ��� ࠭��� ᫮�� � ���� ��������.

238:

239: We have analysed rank--frequency dependencies for our corpora in the following way.

240: Since the Zipf's law (\ref{Zipf}) after taking the logarithm from both sides is linearised,

241: it is common to express the rank--frequency relations in a log--log plot

242: (see figures below).

243:

244: \bigskip

245: \centerline{\includegraphics[angle=-90,width=70mm,clip]{art.ps}\

246: \includegraphics[angle=-90,width=70mm,clip]{col.ps}}

247: \centerline{Belles-letres \hfil Colloquial}

248: \centerline{\includegraphics[angle=-90,width=70mm,clip]{sci.ps}\

249: \includegraphics[angle=-90,width=70mm,clip]{ofi.ps}}

250: \centerline{Scientific \hfil Official}

251: \centerline{\includegraphics[angle=-90,width=70mm,clip]{pub.ps}\

252: \includegraphics[angle=-90,width=70mm,clip]{BNC.ps}}

253: \centerline{Journalistic \hfil BNC}

254: \bigskip

255:

256: The idea of the selection of Kernel Vocabulary is based on the assumption

257: that on the `rank--frequency' curve the deviation from linear (Zipf's) behaviour corresponds

258: to the transition to a different type of vocabulary \citep{Mon01}.

259: The author made the analysis using the British National Corpus \citep{BNC}.

260: Although the size of our corpus is far from the size of British National Corpus

261: but anyway, as we show further, such scales already allow for conclusions

262: on some statistical features of the text under consideration.

263:

264: One can easily notice a slight change in the curve slope when moving to higher-rank region.

265: In order to find the place where this change occurs a detailed analysis is required.

266: We have divided the ranks into domains of 200: from 1 to 200, from 101 to 300,

267: from 201 to 400 and so on.

268: Then, for each domain the best-fit parameters to Zipf's law (\ref{Zipf}) were calculated.

269:

270: After making the detailed numerical analysis of data for each sub-corpus

271: we noticed the following specific features:

272: \begin{itemize}

273: \item in the official, journalistic and scientific sub-corpora at some rank $r_{\rm max}$

274:       the value of $z$

275:       changes significantly, which corresponds to the transition to a different part

276:       of the vocabulary. The values are $r_{\rm max}\simeq800$ for the official sub-corpus,

277:       $r_{\rm max}\simeq1000$ for the scientific one,

278:       $r_{\rm max}\simeq 1600$ for journalistic sub-corpus.

279: \item in the colloquial corpus the deviation from (\ref{Zipf}) is less significant,

280:       the Zipf's law with $z=1.09$ describes the whole domain of ranges quite well. However,

281:       the numerical analysis allows for stating the value of $r_{\rm max}$ close to that

282:       of the journalistic sub-corpus.

283: \end{itemize}

284: %As the text coverage in these domains of ranks reaches similar values for both cases,

285: %we propose to consider the Kernel Vocabulary limit $r_{\rm max}=963$ for the scientific

286: %and $r_{\rm max}=1622$ for journalistic style, the corresponding text coverage is 70 per cent.

287:

288: In order to give a better understanding for the behaviour of the Zipf's exponent

289: we propose a visual interpretation in Fig.~\ref{ExpVisual} below.

290:

291: \begin{figure}[h]\label{ExpVisual}

292: \centerline{\includegraphics[angle=-90,width=100mm,clip]{bl-c-j.eps}}

293: \centerline{\includegraphics[angle=-90,width=100mm,clip]{bnc-s-o.eps}}

294: \caption{Zipf's exponent behaviour showing the transition between different types of

295: vocabulary.}

296: \end{figure}

297:

298: \subsection{Entropy Comparison}

299: It is interesting to analyse the frequency dependencies due to the entropy $S$:

300: \begin{equation}

301: S_N=-\sum_{r=1}^{N} f_r\,\ln f_r,

302: \end{equation}

303: where $N$ is a big number. By putting $N=3000$ for each sub-corpus we obtained

304: the following values: BP 2.192; CS 2.356; PS 2.368; SS 2.602; OS 2.750.

305: While the smallest value of entropy for the belles-lettres sub-corpus looks a bit unexpected,

306: we propose the following interpretation of the rest data. In physics, the entropy

307: is the measure of disorder in a system. As we know from our experience, official texts are

308: usually hardly-readable, therefore, they need more effort to be understood. In scientific

309: texts a similar statement is a bit less applicable when taking into account the fact of

310: reading the text by addressees --- specialists in the respective field. From this point of

311: view, the journalistic texts must be quite close to the everyday speech --- and we see it

312: from the numbers.

313:

314:

315:

316:

317: \section{Vocabulary size estimation}

318: Suppose one has the whole language corpus, and its vocabulary size is $\cal R$.

319: This means that $\cal R$ is the maximal possible rank, so frequency $f$ of the

320: next-ranked word $f_{{\cal R}+1}$ will be zero. If one accepts Zipf's dependence

321: (\ref{Zipf}) to be valid, such situation never appears. Let us therefore accept a bit

322: modified function \citep{Lua94}:

323:

324: \begin{equation}

325: f_r^t=-A+B r^t,

326: \end{equation}

327: where the exponent $t$ is a small positive number.

328: In this case, the value of $\cal R$ is defined as follows: ${\cal R}=(A/B)^{1/t}$.

329: Typical values of $t$ are of order 0.1. Thus, the estimation of the vocabulary size for the

330: specific functional genre gives the values 200 to 700 hundred different words. A more precise

331: estimation will be made after larger corpus is analysed.

332:

333: \section{Discussion}

334:

335: We analysed the rank--frequency relations for the middle-sized corpus

336: of the Ukrainian language. The data for the first most ranked words are consistent with other

337: Slavic languages. The presented results allows for the establishing of the

338: Kernel vocabulary and vocabulary size estimation. The entropy was calculated for different

339: functional genres.

340: We hope that our data will be useful when compiling the National corpus

341: of the Ukrainian language. A more precise results will be available after larger corpus

342: is considered.

343:

344:

345:

346:

347: \begin{thebibliography}{}

348:

349: %\bibitem[Besters-Dilger 2002]{Bes02}Besters-Dilger, J. (2002).

350: %``Deutsche lexikalische Entlehnungen im Ukrainischen''.

351: %{\it Litteraria Humanitas} XI, 25--49.

352:

353: \bibitem[BNC http]{BNC}British National Corpus (BNC):\\ ftp://ftp.itri.bton.ac.uk/bnc/;

354: \ \ \ http://www.natcorp.ox.ac.uk

355:

356: \bibitem[Brown http]{Brown}Brown Standard Corpus:\\

357: \ \ \ http://khnt.hit.uib.no/icame/manuals/brown/INDEX.HTM\#bc9

358:

359: \bibitem[Buk 2003a]{Buk03a}Buk, S. (2003).

360: ``\v{C}astotnyj slovnyk rozmovno-pobutovoho stylju su\v{c}asnoji ukrajinsjkoji movy''.

361: [Colloquial Genre Frequency Dictionary of Modern Ukrainian Language]

362: {\it Lingvisty\v{c}ni Studiji (Donetsk University)} 11 (Part I), 266--271.

363: %�� �.

364: %����⭨� ᫮���� ஧�����-����⮢��� �⨫� ���᭮� �����쪮� ����

365: %// ���������� �����: ��. ���. ����. ����� 11.

366: %� 2 ��⨭�� / ���.: �.  �����⪮ (���. ।) � ��.

367: %����. �.- ������: �����, 2003.- 350 c.- �. 266-271.

368:

369: \bibitem[Buk 2003b]{Buk03b}Buk, S. (2003)

370: ``\v{C}astotnyj slovnyk naukovoho stylju su\v{c}asnoji ukrajinsjkoji movy''.

371: [Scientific Genre Frequency Dictionary of Modern Ukrainian Language]

372: {\it Cherkasy University Herald. Ser. philol.} 44, 90--96.

373: %�� �.

374: %����⭨� ᫮���� ��㪮���� �⨫� ���᭮� �����쪮� ����

375: %// ��᭨� ��ઠ�쪮�� ���������. ����� ���������� ��㪨.

376: %��ઠ�: ��ઠ�쪨� ��ঠ���� �������� ����� ������� ����쭨�쪮��, 2003.- ??.

377:

378: \bibitem[Burkhanov 1998]{Bur98}Burkhanov, I. (1998).

379: {\it Lexicography: A Dictionary of Basic Terminology}.

380: Rzesz\'ow: Wydawnictwo wy\.zszej szko\l{}y pedagogicznej.

381:

382: \bibitem[Cancho \& Sol\'e 2001]{CanSol01}Cancho, R. F., \& Sol\'e, R. V. (2001).

383: ``Two regimes in the frequency of words and the origins of complex lexicons: Zipf's law revisited''.

384: {\it Journal of Quantitative Linguistics} 8(3), 165--173.

385:

386: \bibitem[HNC http]{HNC}Croatian National Corpus (HNC, 9.156.446 tokens):\\

387: http://www.hnk.ffzg.hr/corpus.htm

388:

389: \bibitem[FDP http]{UkrPub}\v{C}astotnyj slovnyk publicystyky.

390: [Frequency dictionary of publicism]:

391: http://80.78.37.38/freqcard.aspx?sl=publicist

392:

393: \bibitem[Demska-Kulchytska 2001]{Dem01}Demska-Kulchytska O. (2001).

394: ``Korpus tekstov ukrainskoj periodiki''.  [Ukrainian periodicals text corpus]

395: In: {\it Issledovanie slavjanskix jazykov v rusle tradicij sravnitelno-istori\v{c}eskogogo

396: i sopostavitel'nogo jazykoznanija}. Moskow. 26--28

397:

398: %\bibitem[Grigoryan \& Manasyan 1988]{GriMan88}Grigoryan, S., \& Manasyan, N. (1988).

399: %``On some quantitative and linguistical pecularities of the Armenian basic vocabulary''.

400: %{\it Tartu Riikliku \"Ulikooli Toimetised} 827, 62--73.

401:

402: \bibitem[Juilland et al. 1970]{JuiBro70}Juilland, A., Brodin, D., Davidovich, C. (1970).

403: {\it Frequency Dictionary of French Words}. The Hague--Paris.

404:

405: %\bibitem[Kornai 2002]{Kor02}Kornai, A. (2002).

406: %``How many words are there?''.

407: %{\it Glottometrics} 4, 61--86.

408:

409: \bibitem[PWN http]{PWN}Korpus J\k{e}zyka Polskiego Wydawnictwa Naukowego PWN.

410: [Polish language corpus of Scientific publishing house PWN]:

411: http://korpus.pwn.pl/

412:

413: \bibitem[Kurcz et al. 1990]{KurLew90}Kurcz, I., Lewicki, A., Sambor, J., Szafran, K., Woronczak, J. (1990).

414: {\it S\l{}ownik frekwencyjny polszczyzny wsp\'o\l{}czesnej}.

415: [Frequency Dictionary of Contemporary Polish]. Krak\'ow: PAN, Instytut J\k{e}zyka Polskiego.

416:

417: \bibitem[Lua 1994]{Lua94}Lua, K. T. (1994).

418: ``Frequency--Rank Curves and Entropy for Chinese Characters and Words''.

419: {\it Computer Processing of Chinese \& Oriental Languages} 8, 37--52.

420:

421: \bibitem[Montemurro 2001]{Mon01}Montemurro, M. A. (2001).

422: ``Beyond the Zipf--Mandelbrot law in quantitative linguistics''.

423: {\it Physica A} 300, 567--578.

424:

425: \bibitem[Montemurro \& Zanette 2002]{MonZan02}Montemurro, M. A., \& Zanette, D. H. (2002).

426: ``New perspectives of Zipf's law in linguistics: from single texts to large corpora''.

427: {\it Glottometrics} 4, 87--99.

428:

429: \bibitem[Muravytska \& Oleksijenko 1974]{Str74}Muravytska, M. P. and Oleksijenko, L. A. (eds.) (1974).

430: {\it Struktura movy i statystyka movlennja}.

431: [Structure of language and statistics of speech].

432: Kyiv: Naukova Dumka.

433:

434: \bibitem[Ne\v{s}itoj 1987]{Nes87}Ne\v{s}itoj, V. V. (1987).

435: ``About the form of representing rank distributions''.

436: {\it Tartu Riikliku \"Ulikooli Toimetised} 774, 123--134.

437:

438: \bibitem[Perebyjnis 1967]{Per67}Perebyjnis, V. S. (1967).

439: {\it Statysty\v{c}ni parametry styliv}.

440: [Statistical parameters of styles].

441: Kyiv: Naukova Dumka.

442:

443: \bibitem[Perebyjnis 1981]{Per81}Perebyjnis, V. S. (ed.) (1981).

444: {\it \v{C}astotnyj slovnyk su\v{c}asnoji ukrajinsjkoji xudo\v{z}njoji prozy}.

445: Kyiv: Naukova Dumka.

446:

447: %\bibitem[Saloni 1990]{Sal90}Saloni, Z. (ed.) (1990).

448: %{\it S\l{}ownik frekwencyjny polszczyzny wspo\l{}czesnej}. Krakow: Uniwersytet Jagiello\'nski.

449:

450: %\bibitem[Simon 1955]{Sim55}Simon, H. A. (1955).

451: %``On a Class of Skew Distribution Functions''.

452: %{\it Biometrika} 42, 425--440.

453:

454: %\bibitem[Tuldava 1980]{Tul80}Tuldava, J. (1980).

455: %``On the analytical expression of the relation between size of vocabulary and size of text''.

456: %{\it Tartu Riikliku \"Ulikooli Toimetised: T\"oid keelestatistika alalt. VI.} 549, 113--144.

457:

458: \bibitem[Zipf 1949]{Zipf49}Zipf, G. K. (1949).

459: {\it Human behavior and the principle of least effort.}

460: Reading, Mass.: Addison-Wesley.

461:

462: \end{thebibliography}

463:

464:

465: \section*{Appendix}

466:

467: \bigskip

468: \begin{figure}[h]

469: %\epsfxsize=60mm

470: %\epsfbox{F2.eps}

471: \centerline{\includegraphics[width=120mm,clip]{Translit.eps}}

472: \caption{Ukrainian Transliteration Table.\protect\\

473: This transliteration scheme is free of ambiguity and allows for making bi-directional

474: transliterations. While in some places it seems to be a bit complicated, in the practical

475: applications difficult letter combinations appear very rarely. In addition, it

476: is concordant with some Slavic written systems based on Latin script.}

477: \label{TranslitTable}

478: \end{figure}

479: %\smallskip

480:

481: \end{document}

482: