cs0311033/slova.tex
1: \documentstyle[12pt,graphicx,natbib,ogonek]{article}
2: %\documentclass{article}
3: %\usepackage{natbib}
4: %\usepackage{graphicx}
5: 
6: \textwidth=18cm %16cm
7: \textheight=24cm
8: \hoffset=-2cm %-1cm
9: \voffset=-3cm
10: 
11: \begin{document}
12: \title{The Rank--Frequency Analysis for the Functional Style Corpora in the Ukrainian Language}
13: \author{Solomija N.~Buk$^*$, Andrij A.~Rovenchak$^{**}$\\
14: $^*$ Department for General Linguistics, Ivan Franko National University of Lviv,\\
15: 1 Universytetska St., Lviv, UA-79000, Ukraine\\
16: $^{**}$ Department for Theoretical Physics, Ivan Franko National University of Lviv,\\
17: 12 Drahomanov St., Lviv, UA-79005, Ukraine
18: }
19: 
20: \maketitle
21: %Short Title:\\
22: %Rank--Frequency Dependencies in Ukrainian
23: 
24: \abstract{
25: We use the rank--frequency analysis for the estimation of Kernel Vocabulary size within
26: specific corpora of Ukrainian. The extrapolation of high-rank behaviour is utilized for
27: estimation of the total vocabulary size.
28: 
29: {\bf Key words:} corpus, Ukrainian, rank--frequency dependence, vocabulary size, entropy}
30: 
31: \section{Introduction}
32: The problem of rank--frequency analysis for texts is a very interesting one.
33: In application to the natural languages it gives a possibility to draw the information
34: which is necessary when compiling dictionaries, in particular, professionally-oriented
35: dictionaries, creating text-compressors, determining the basic vocabulary for studying
36: a language as foreign, etc.
37: 
38: In recent years, the development of computational techniques made it possible to study
39: large amounts of text. Such analysis usually involves the so-called Zipf's law \citep{Zipf49}
40: establishing the relation between the rank of a word and its frequency.
41: It was shown that the initially supposed linear behaviour on large samples of text
42: gets broken \citep{Nes87,Mon01,CanSol01}.
43: 
44: Several decades ago, statistical study of the Ukrainian language was held in the Potebnja
45: Institute of Linguistics in Kyiv \citep{Per67,Str74}. In these research works, however,
46: the results were just established, and no special analysis was made.
47: Unfortunately, such studies in Ukraine had been stalled for years, and only now they
48: are revived with application of modern techniques \citep{Dem01}.
49: 
50: In the paper we present results for different functional styles of Ukrainian language.
51: Such material is novel since the statistical analysis as well as corpus studies of Ukrainian
52: are now standing in the first stages only. While the volume of material involved in this
53: work is quite small comparing with e.~g., English, we hope that described techniques
54: together with preliminary results will be useful in future.
55: 
56: 
57: The paper is organized as follows. It the next section the description of
58: sources and text processing is given.
59: Section~3 contains the analysis of rank--frequency dependencies for different corpora
60: due to some specific features. The possible techniques for estimation of the vocabulary
61: size are adduced in Section~4. A brief discussion is presented in Section~5.
62: 
63: \section{Material Overview}
64: 
65: \subsection{Definition of Terms}
66: In this work we use the following terms:
67: \begin{itemize}
68: \item {\bf Corpus} --- body of collection of linguistic data, specially the one considered complete
69:       and representative, from a particular language or languages, in the form of recorded
70:       utterances or written text, which is available for theoretical or/and applied
71:       linguistic investigation \citep{Bur98}. In the present paper we consider
72:       {\bf text corpus} which must be distinguished from {\bf corpus of language (national corpus)}
73:       being a structured representative collection of texts from a given language.
74: 
75: \item {\bf Token} --- a word in any form (a sequence of letters between two spaces)
76:       in a text, e.~g., the sentence {\it I have not seen her yet} contains
77:       six tokens;
78: 
79: \item {\bf Corpus size} --- total number of tokens in the given corpus;
80: 
81: \item {\bf Vocabulary size} --- number of different words in the given corpus generated by the
82:       {\bf lemmatisation} process;
83: 
84: \item {\bf Lemmatisation} --- process of the reduction of word-forms to the initial (vocabulary)
85:       form, e.~g., verbs to the Infinitive, nouns to Nominative Singular, etc.
86: 
87: \item {\bf Vocabulary volume} --- estimated number of possible different words of the language
88:       (in the content of this work we mean it within specific functional style);
89: \end{itemize}
90: 
91: \subsection{Corpus Description}
92: In this work, we analyse a middle-sized corpus of Ukrainian language.
93: The size classification of corpora uses the Brown Standard Corpus of American English
94: \citep{Brown} as a reference point. Its parameters are as
95: follows: a)~one million words of running text; b)~500 text samples;
96: c)~2 thousand words per sample.
97: Corpora with less than one million words are considered as small,
98: corpora with 1--10 million words are middle-sized, and corpora containing
99: more than 10 million words are large.
100: 
101: Total corpus size alalysed in this work is about 1.7 million tokens.
102: It consists of five sub-corpora according to main five functional styles of speech (genres).
103: 
104: 1.~The sub-corpus of {Belles-lettres Prose} contains 500 thousand tokens.
105: The frequency data were taken from \citep{Per81}. This frequency dictionary was compiled
106: on the basis of 25 creative works, with several text pieces extracted from different
107: places of one work. Although the time of the writings is restricted to 1945--1970, we suggest
108: that the changes in the first three thousand most frequent words are not significant.
109: 
110: 2.~The sub-corpus of {\it Colloquial Style} contains about 300 thousand tokens.
111: It consists of 45 text pieces over approximately 6,000 tokens each.
112: Since big collections of `pure' Ukrainian colloquial speech do not exist, we used modern
113: dramas written within the last two decades~\citep{Buk03a}. The adequacy between these two
114: types of speech might be disputable but such a principle was used, e.~g., in \citep{JuiBro70}
115: and \citep{KurLew90}.
116: %Colloquial: 45 pieces, 286,490 tokens of raw text
117: 
118: 3.~The sub-corpus of {\it Scientific Style} was collected from 104 pieces each containing
119: about 3,000 tokens. Its total size slightly exceeds 300 thousand tokens.
120: The following scientific areas were represented in approximately equal
121: parts: biology, chemistry, psychology and pedagogics, physics, mathematics, technics,
122: geography and geology, history, linguistics~\citep{Buk03b}.
123: %Scientific: (biology 39,842); chemistry and medicine (37,655); psychology (46,853);
124: %geography (37,482); history (37,675); mathematics (48,594); physics (36,941); linguistics (42,889).
125: %327,931. ??technics (35,525)
126: 
127: 4.~{\it Official (business) Style} corpus was composed from texts of different kinds of
128: documents. These are: The Constitution of Ukraine, codices, Ukrainian and international laws,
129: international treaties, conventions, memoranda, declarations, speeches, economic documents,
130: contracts, all types of administrative documents, etc.
131: The size of the sub-corpus is about 300 thousand tokens.
132: 
133: 5.~{\it Journalistic Style} frequency statistics was taken from \citep{UkrPub}.
134: The correspondent corpus build on basis texts from several all-Ukrainian newspapers
135: %"Урядовий кур'їр", "Голос Укра∙ни", "Сўльськў вўстў", "Культура ў життя", "Укра∙на молода",
136: %"Лўтературна Укра∙на", "Млодь Укра∙ни", "Вўстў з Укра∙ни", "Республўка", "Золотў ворота"
137: issued in 1994. These newspapers are addressed to both city-dwellers and villagers,
138: and to people of different age. The size of the sub-corpus is also about 300 thousand tokens.
139: 
140: 
141: \subsection{Text processing}
142: At the first stage, several types of items were removed from texts. These are: numbers,
143: word containing numbers, punctuation signs (see comment on dashes below),
144: and words written in a non-Ukrainian script.
145: Then, texts were processed manually for homonyms. This is a very important stage as some
146: of these words appear with high frequency. As an example, we propose some homonym pairs
147: (note, that stress is usually omitted in Ukrainian\footnote{Hereafter for the sake of convenience we use
148: transliteration for representing Ukrainian words according to the table given in Appendix.}):
149: {\it br\'aty} (`to take', verb in the Infinitive) and {\it brat\'y} (`brothers', noun
150: in Plural, Nominative); {\it m\'aty} (`to have', verb in the Infinitive) and
151: {\it m\'aty} (`mother', noun in Singular, Nominative); {\it ni\v{z}} (`than', particle)
152: and {\it ni\v{z}} (`knife', noun in Singular, Nominative); {\it \v{s}\v{c}o} being a particle,
153: a conjunction (`which'), and a pronoun (`what'); etc.
154: 
155: 
156: \section{Frequency Analysis}
157: 
158: \subsection{Low ranks}
159: 
160: The behaviour at low ranges is significantly influenced by some
161: specific features of Ukrainian language. Several very frequent
162: words have different forms due to the principle of so called
163: euphony. Namely, the word
164: {\it i} (`and') may appear also in the
165: forms {\it j} and {\it ta}. The word {\it v} (`in') may have also
166: forms {\it u} and {\it vvi} or {\it uvi} (the last two are rare).
167: 
168: The verb `to be', very frequent in different language corpora, in Ukrainian can be
169: replaced by a dash (---) or omitted at all, and therefore, it appears a bit less
170: frequently when comparing with its rank in other languages, especially in spoken language.
171: Note, however, that the inverse statement is incorrect. i.~e., not every dash represents
172: this verb.
173: 
174: In the table below we present first five most frequent words from different corpora.
175: English statistics is based on the British National Corpus \citep{BNC}.
176: German language statistics was kindly granted by Sabine Schulte from
177: the University of Stuttgart. Croatian corpus data is taken from \citep{HNC}, and Polish
178: is from \citep{PWN}. Ukrainian statistics is collected by the authors.
179: 
180: \bigskip
181: \noindent
182: \begin{center}
183: \begin{tabular}{c|ll|ll|ll|ll|ll}
184: \hline
185: \hline
186: Rank&\multicolumn{2}{c|}{English}&\multicolumn{2}{c|}{German}
187:     &\multicolumn{2}{c|}{Croatian}&\multicolumn{2}{c|}{Polish}
188:     &\multicolumn{2}{c}{Ukrainian}\\
189: \hline
190: 1& the &0.0619& die  &0.0702& i  &0.0314& w       &0.0317& i  &0.0371\\
191: 2& be  &0.0424& sein &0.0289& u  &0.0276& i       &0.0282& v  &0.0303\\
192: 3& of  &0.0309& in   &0.0274& je &0.0264& si\k{e} &0.0192& na &0.0173\\
193: 4& and &0.0268& der  &0.0245& se &0.0156& na      &0.0167& z  &0.0166\\
194: 5& a   &0.0219& ein  &0.0234& da &0.0130& z       &0.0159& ne &0.0157\\
195: \hline
196: \hline
197: \end{tabular}
198: \end{center}
199: 
200: %\bigskip
201: %\begin{tabular}{l|ll|ll|ll|ll|ll}
202: %\hline
203: %\hline
204: %%art; sci; publ; ofic; coll
205: %1& i  &0.0379& v  &0.0427&i  &0.0234& i  &0.0387& ja &0.0410\\
206: %2& v  &0.0256& i  &0.0422&na &0.0173& v  &0.0309& i  &0.0315\\
207: %3& ne &0.0221& z  &0.0154&u  &0.0167& na &0.0178& ne &0.0299\\
208: %4& na &0.0208& na &0.0151&v  &0.0142& z  &0.0162& ty &0.0226\\
209: %\hline
210: %\hline
211: %\end{tabular}
212: %%publ:
213: %%i  7024; ta: 2248, j: 817
214: %%na 5204;
215: %%u  5007;
216: %%v  4274;
217: %%    ; z: 4041, zi: 125, iz: 734
218: 
219: \bigskip
220: This table demonstrates that our data are consistent with other Slavic languages.
221: 
222: \subsection{Kernel Vocabulary}
223: Zipf formulated the relation between the frequency of the word $f$ and its rank $r$,
224: basing on the `principle of least effort' which he considered as one of the most
225: important features of human behaviour, on the analogy of Poincar\'e's principle
226: of least action in physics. A slightly modified, in comparison with its original form,
227: this dependence reads:
228: \begin{equation}\label{Zipf}
229: f_r=A/r^z,
230: \end{equation}
231: where $A$ and $z$ are parameters, the exponent $z$ slightly deviates from unity.
232: (Originally Zipf put the value $z=1$). Further, we refer this relation as Zipf's law.
233: 
234: %вўн дослўджував ў ўншў загальнў лўнгвостатистичнў закономўрностў, пўдказанў
235: %йому аналўзом частотних словникўв. Вўн та його послўдовники розглядають залежностў
236: %мўж частотнўстю та рангом слова ў його полўсемўїю, мўж частотнўстю ў кўлькўстю слўв
237: %з даною частотою, мўж частотнўстю або рангом слова ў його довжиною.
238: 
239: We have analysed rank--frequency dependencies for our corpora in the following way.
240: Since the Zipf's law (\ref{Zipf}) after taking the logarithm from both sides is linearised,
241: it is common to express the rank--frequency relations in a log--log plot
242: (see figures below).
243: 
244: \bigskip
245: \centerline{\includegraphics[angle=-90,width=70mm,clip]{art.ps}\
246: \includegraphics[angle=-90,width=70mm,clip]{col.ps}}
247: \centerline{Belles-letres \hfil Colloquial}
248: \centerline{\includegraphics[angle=-90,width=70mm,clip]{sci.ps}\
249: \includegraphics[angle=-90,width=70mm,clip]{ofi.ps}}
250: \centerline{Scientific \hfil Official}
251: \centerline{\includegraphics[angle=-90,width=70mm,clip]{pub.ps}\
252: \includegraphics[angle=-90,width=70mm,clip]{BNC.ps}}
253: \centerline{Journalistic \hfil BNC}
254: \bigskip
255: 
256: The idea of the selection of Kernel Vocabulary is based on the assumption
257: that on the `rank--frequency' curve the deviation from linear (Zipf's) behaviour corresponds
258: to the transition to a different type of vocabulary \citep{Mon01}.
259: The author made the analysis using the British National Corpus \citep{BNC}.
260: Although the size of our corpus is far from the size of British National Corpus
261: but anyway, as we show further, such scales already allow for conclusions
262: on some statistical features of the text under consideration.
263: 
264: One can easily notice a slight change in the curve slope when moving to higher-rank region.
265: In order to find the place where this change occurs a detailed analysis is required.
266: We have divided the ranks into domains of 200: from 1 to 200, from 101 to 300,
267: from 201 to 400 and so on.
268: Then, for each domain the best-fit parameters to Zipf's law (\ref{Zipf}) were calculated.
269: 
270: After making the detailed numerical analysis of data for each sub-corpus
271: we noticed the following specific features:
272: \begin{itemize}
273: \item in the official, journalistic and scientific sub-corpora at some rank $r_{\rm max}$
274:       the value of $z$
275:       changes significantly, which corresponds to the transition to a different part
276:       of the vocabulary. The values are $r_{\rm max}\simeq800$ for the official sub-corpus,
277:       $r_{\rm max}\simeq1000$ for the scientific one,
278:       $r_{\rm max}\simeq 1600$ for journalistic sub-corpus.
279: \item in the colloquial corpus the deviation from (\ref{Zipf}) is less significant,
280:       the Zipf's law with $z=1.09$ describes the whole domain of ranges quite well. However,
281:       the numerical analysis allows for stating the value of $r_{\rm max}$ close to that
282:       of the journalistic sub-corpus.
283: \end{itemize}
284: %As the text coverage in these domains of ranks reaches similar values for both cases,
285: %we propose to consider the Kernel Vocabulary limit $r_{\rm max}=963$ for the scientific
286: %and $r_{\rm max}=1622$ for journalistic style, the corresponding text coverage is 70 per cent.
287: 
288: In order to give a better understanding for the behaviour of the Zipf's exponent
289: we propose a visual interpretation in Fig.~\ref{ExpVisual} below.
290: 
291: \begin{figure}[h]\label{ExpVisual}
292: \centerline{\includegraphics[angle=-90,width=100mm,clip]{bl-c-j.eps}}
293: \centerline{\includegraphics[angle=-90,width=100mm,clip]{bnc-s-o.eps}}
294: \caption{Zipf's exponent behaviour showing the transition between different types of
295: vocabulary.}
296: \end{figure}
297: 
298: \subsection{Entropy Comparison}
299: It is interesting to analyse the frequency dependencies due to the entropy $S$:
300: \begin{equation}
301: S_N=-\sum_{r=1}^{N} f_r\,\ln f_r,
302: \end{equation}
303: where $N$ is a big number. By putting $N=3000$ for each sub-corpus we obtained
304: the following values: BP 2.192; CS 2.356; PS 2.368; SS 2.602; OS 2.750.
305: While the smallest value of entropy for the belles-lettres sub-corpus looks a bit unexpected,
306: we propose the following interpretation of the rest data. In physics, the entropy
307: is the measure of disorder in a system. As we know from our experience, official texts are
308: usually hardly-readable, therefore, they need more effort to be understood. In scientific
309: texts a similar statement is a bit less applicable when taking into account the fact of
310: reading the text by addressees --- specialists in the respective field. From this point of
311: view, the journalistic texts must be quite close to the everyday speech --- and we see it
312: from the numbers.
313: 
314: 
315: 
316: 
317: \section{Vocabulary size estimation}
318: Suppose one has the whole language corpus, and its vocabulary size is $\cal R$.
319: This means that $\cal R$ is the maximal possible rank, so frequency $f$ of the
320: next-ranked word $f_{{\cal R}+1}$ will be zero. If one accepts Zipf's dependence
321: (\ref{Zipf}) to be valid, such situation never appears. Let us therefore accept a bit
322: modified function \citep{Lua94}:
323: 
324: \begin{equation}
325: f_r^t=-A+B r^t,
326: \end{equation}
327: where the exponent $t$ is a small positive number.
328: In this case, the value of $\cal R$ is defined as follows: ${\cal R}=(A/B)^{1/t}$.
329: Typical values of $t$ are of order 0.1. Thus, the estimation of the vocabulary size for the
330: specific functional genre gives the values 200 to 700 hundred different words. A more precise
331: estimation will be made after larger corpus is analysed.
332: 
333: \section{Discussion}
334: 
335: We analysed the rank--frequency relations for the middle-sized corpus
336: of the Ukrainian language. The data for the first most ranked words are consistent with other
337: Slavic languages. The presented results allows for the establishing of the
338: Kernel vocabulary and vocabulary size estimation. The entropy was calculated for different
339: functional genres.
340: We hope that our data will be useful when compiling the National corpus
341: of the Ukrainian language. A more precise results will be available after larger corpus
342: is considered.
343: 
344: 
345: 
346: 
347: \begin{thebibliography}{}
348: 
349: %\bibitem[Besters-Dilger 2002]{Bes02}Besters-Dilger, J. (2002).
350: %``Deutsche lexikalische Entlehnungen im Ukrainischen''.
351: %{\it Litteraria Humanitas} XI, 25--49.
352: 
353: \bibitem[BNC http]{BNC}British National Corpus (BNC):\\ ftp://ftp.itri.bton.ac.uk/bnc/;
354: \ \ \ http://www.natcorp.ox.ac.uk
355: 
356: \bibitem[Brown http]{Brown}Brown Standard Corpus:\\
357: \ \ \ http://khnt.hit.uib.no/icame/manuals/brown/INDEX.HTM\#bc9
358: 
359: \bibitem[Buk 2003a]{Buk03a}Buk, S. (2003).
360: ``\v{C}astotnyj slovnyk rozmovno-pobutovoho stylju su\v{c}asnoji ukrajinsjkoji movy''.
361: [Colloquial Genre Frequency Dictionary of Modern Ukrainian Language]
362: {\it Lingvisty\v{c}ni Studiji (Donetsk University)} 11 (Part I), 266--271.
363: %Бук С.
364: %Частотний словник розмовно-побутового стилю сучасно∙ укра∙нсько∙ мови
365: %// Лўнгвўстичнў студў∙: Зб. наук. праць. Випуск 11.
366: %У 2 частинах / Укл.: А.  Загнўтко (наук. ред) та ўн.
367: %Част. Ў.- Донецьк: ДонНУ, 2003.- 350 c.- С. 266-271.
368: 
369: \bibitem[Buk 2003b]{Buk03b}Buk, S. (2003)
370: ``\v{C}astotnyj slovnyk naukovoho stylju su\v{c}asnoji ukrajinsjkoji movy''.
371: [Scientific Genre Frequency Dictionary of Modern Ukrainian Language]
372: {\it Cherkasy University Herald. Ser. philol.} 44, 90--96.
373: %Бук С.
374: %Частотний словник наукового стилю сучасно∙ укра∙нсько∙ мови
375: %// Вўсник Черкаського унўверситету. Серўя фўлологўчнў науки.
376: %Черкаси: Черкаський державний унўверситет ўменў Богдана Хмельницького, 2003.- ??.
377: 
378: \bibitem[Burkhanov 1998]{Bur98}Burkhanov, I. (1998).
379: {\it Lexicography: A Dictionary of Basic Terminology}.
380: Rzesz\'ow: Wydawnictwo wy\.zszej szko\l{}y pedagogicznej.
381: 
382: \bibitem[Cancho \& Sol\'e 2001]{CanSol01}Cancho, R. F., \& Sol\'e, R. V. (2001).
383: ``Two regimes in the frequency of words and the origins of complex lexicons: Zipf's law revisited''.
384: {\it Journal of Quantitative Linguistics} 8(3), 165--173.
385: 
386: \bibitem[HNC http]{HNC}Croatian National Corpus (HNC, 9.156.446 tokens):\\
387: http://www.hnk.ffzg.hr/corpus.htm
388: 
389: \bibitem[FDP http]{UkrPub}\v{C}astotnyj slovnyk publicystyky.
390: [Frequency dictionary of publicism]:
391: http://80.78.37.38/freqcard.aspx?sl=publicist
392: 
393: \bibitem[Demska-Kulchytska 2001]{Dem01}Demska-Kulchytska O. (2001).
394: ``Korpus tekstov ukrainskoj periodiki''.  [Ukrainian periodicals text corpus]
395: In: {\it Issledovanie slavjanskix jazykov v rusle tradicij sravnitelno-istori\v{c}eskogogo
396: i sopostavitel'nogo jazykoznanija}. Moskow. 26--28
397: 
398: %\bibitem[Grigoryan \& Manasyan 1988]{GriMan88}Grigoryan, S., \& Manasyan, N. (1988).
399: %``On some quantitative and linguistical pecularities of the Armenian basic vocabulary''.
400: %{\it Tartu Riikliku \"Ulikooli Toimetised} 827, 62--73.
401: 
402: \bibitem[Juilland et al. 1970]{JuiBro70}Juilland, A., Brodin, D., Davidovich, C. (1970).
403: {\it Frequency Dictionary of French Words}. The Hague--Paris.
404: 
405: %\bibitem[Kornai 2002]{Kor02}Kornai, A. (2002).
406: %``How many words are there?''.
407: %{\it Glottometrics} 4, 61--86.
408: 
409: \bibitem[PWN http]{PWN}Korpus J\k{e}zyka Polskiego Wydawnictwa Naukowego PWN.
410: [Polish language corpus of Scientific publishing house PWN]:
411: http://korpus.pwn.pl/
412: 
413: \bibitem[Kurcz et al. 1990]{KurLew90}Kurcz, I., Lewicki, A., Sambor, J., Szafran, K., Woronczak, J. (1990).
414: {\it S\l{}ownik frekwencyjny polszczyzny wsp\'o\l{}czesnej}.
415: [Frequency Dictionary of Contemporary Polish]. Krak\'ow: PAN, Instytut J\k{e}zyka Polskiego.
416: 
417: \bibitem[Lua 1994]{Lua94}Lua, K. T. (1994).
418: ``Frequency--Rank Curves and Entropy for Chinese Characters and Words''.
419: {\it Computer Processing of Chinese \& Oriental Languages} 8, 37--52.
420: 
421: \bibitem[Montemurro 2001]{Mon01}Montemurro, M. A. (2001).
422: ``Beyond the Zipf--Mandelbrot law in quantitative linguistics''.
423: {\it Physica A} 300, 567--578.
424: 
425: \bibitem[Montemurro \& Zanette 2002]{MonZan02}Montemurro, M. A., \& Zanette, D. H. (2002).
426: ``New perspectives of Zipf's law in linguistics: from single texts to large corpora''.
427: {\it Glottometrics} 4, 87--99.
428: 
429: \bibitem[Muravytska \& Oleksijenko 1974]{Str74}Muravytska, M. P. and Oleksijenko, L. A. (eds.) (1974).
430: {\it Struktura movy i statystyka movlennja}.
431: [Structure of language and statistics of speech].
432: Kyiv: Naukova Dumka.
433: 
434: \bibitem[Ne\v{s}itoj 1987]{Nes87}Ne\v{s}itoj, V. V. (1987).
435: ``About the form of representing rank distributions''.
436: {\it Tartu Riikliku \"Ulikooli Toimetised} 774, 123--134.
437: 
438: \bibitem[Perebyjnis 1967]{Per67}Perebyjnis, V. S. (1967).
439: {\it Statysty\v{c}ni parametry styliv}.
440: [Statistical parameters of styles].
441: Kyiv: Naukova Dumka.
442: 
443: \bibitem[Perebyjnis 1981]{Per81}Perebyjnis, V. S. (ed.) (1981).
444: {\it \v{C}astotnyj slovnyk su\v{c}asnoji ukrajinsjkoji xudo\v{z}njoji prozy}.
445: Kyiv: Naukova Dumka.
446: 
447: %\bibitem[Saloni 1990]{Sal90}Saloni, Z. (ed.) (1990).
448: %{\it S\l{}ownik frekwencyjny polszczyzny wspo\l{}czesnej}. Krakow: Uniwersytet Jagiello\'nski.
449: 
450: %\bibitem[Simon 1955]{Sim55}Simon, H. A. (1955).
451: %``On a Class of Skew Distribution Functions''.
452: %{\it Biometrika} 42, 425--440.
453: 
454: %\bibitem[Tuldava 1980]{Tul80}Tuldava, J. (1980).
455: %``On the analytical expression of the relation between size of vocabulary and size of text''.
456: %{\it Tartu Riikliku \"Ulikooli Toimetised: T\"oid keelestatistika alalt. VI.} 549, 113--144.
457: 
458: \bibitem[Zipf 1949]{Zipf49}Zipf, G. K. (1949).
459: {\it Human behavior and the principle of least effort.}
460: Reading, Mass.: Addison-Wesley.
461: 
462: \end{thebibliography}
463: 
464: 
465: \section*{Appendix}
466: 
467: \bigskip
468: \begin{figure}[h]
469: %\epsfxsize=60mm
470: %\epsfbox{F2.eps}
471: \centerline{\includegraphics[width=120mm,clip]{Translit.eps}}
472: \caption{Ukrainian Transliteration Table.\protect\\
473: This transliteration scheme is free of ambiguity and allows for making bi-directional
474: transliterations. While in some places it seems to be a bit complicated, in the practical
475: applications difficult letter combinations appear very rarely. In addition, it
476: is concordant with some Slavic written systems based on Latin script.}
477: \label{TranslitTable}
478: \end{figure}
479: %\smallskip
480: 
481: \end{document}
482: