1: \documentstyle[a4]{article}
2: \textheight = 0.9 \textheight
3: \title{A new approach to relevancy in Internet searching - the ``Vox Populi Algorithm"}
4: \author{
5: Andreas Schaale ${}^1$, Carsten Wulf-Mathies ${}^2$, S\"onke
6: Lieberam-Schmidt${}^3$
7: \\
8: \\
9: \small \it
10: ${}^1$ Contraco Consulting and Software Ltd., Diepenseer
11: Str. 10, 15732 Waltersdorf, Germany
12: \hfill\\
13: \small \it
14: ${}^2$ T-Online International AG, Waldstr. 3, 64331 Weiterstadt, Germany
15: \hfill\\
16: \small \it ${}^3$ Universit\"at Siegen, FB 5, H\"olderlinstr. 3,
17: 57068 Siegen, Germany}
18: \begin{document}
19: \begin{titlepage}
20: \maketitle
21: \begin{abstract}
22: In this paper we will derive a new algorithm for Internet
23: searching. The main idea of this algorithm is to extend the
24: existing algorithms by a component, which reflects the interests
25: of the users more than existing methods. The ``Vox Populi
26: Algorithm" (VPA) \cite{patent} creates a feedback from the users
27: to the content of the search index. The information derived from
28: the users query analysis is used to modify the existing crawling
29: algorithms. The VPA controls the distribution of the resources of
30: the crawler. Finally, we also discuss methods of suppressing
31: unwanted content (spam). This is necessary in order to enable an
32: efficient performance of the VPA.
33: \end{abstract}
34: \end{titlepage}
35: The retrieval of relevant information from data sources with a
36: very complex structure has become a challenging task since the
37: number of documents in the Internet has reached a level of about
38: multi billions of documents. Only a small part of them is visible
39: in search engines. The problem of organizing and structuring these
40: data into catalogues or searchable databases is of theoretical and
41: significant practical (commercial) interest.
42: \\ \\
43: Let us define the basic components for the mathematical
44: description of the interests of the users, the relevancy of the
45: search results and the crawling process. The users of search
46: engines express their needs for information through the queries
47: which they address to a searchable database (index) $I$. Each of
48: the $k$ queries consists of one or more keywords $q$ addressed to
49: this index. It will be presented as:
50: \begin{equation}
51: {\vec{q}_k} = (q_1, ..., q_n)_k
52: \label{keyworddef}
53: \end{equation}
54: $n$ is the length of the query $k$. The number of keywords per
55: average query is $n \approx 2$ (status in 2003). The users are
56: searching for documents $d_j$ (HTML pages, tables, text processing
57: documents, pictures, multimedia files, ...) containing
58: information. These documents are grouped (organized) in domains
59: $D_k$ presenting sets of documents under a common editorial
60: responsibility and address (URL):
61: \begin{equation}
62: D_k = \bigcup^{(n_k)} d_j^{(k)} \;\;\; \mbox{$n_k$ = number of
63: documents in }D_k
64: \label{domaindef}
65: \end{equation}
66: The number of domains is about 6.4 million in Germany \cite{denic}
67: and the number of documents per domain $n_k$ is in the interval
68: $~10^{0...8}$.
69: \\ \\
70: Each document $d$ contains searchable information, today limited
71: to text information. Content, which is hidden for the {\it
72: today´s} search technology in non indexable formats (bitmaps,
73: scripts etc.) will be neglected here and in the following. A
74: document is characterized by the content of keywords $q$ and the
75: position of the keyword in certain format elements $e_i$
76: (metatags, headers, tables, link text etc.):
77: \begin{equation}
78: d^{(k)} = f(q_1, q_2, ...,e_1, e_2, ...)
79: \label{docdef}
80: \end{equation}
81: During the crawling and indexing process, the image of the
82: document $\hat{d}$ in the searchable index $I$ contains a reduced
83: set of information - the keywords and their position in the format
84: elements $e$ of the document. When a query is addressed to the
85: index $I$ a ranking algorithm generates a set of documents (links)
86: which is ordered by the relevancy of the found documents. In order
87: to describe the document ranking process which generates the set
88: of results on each query, one has to introduce the density $\rho$
89: of keywords within the documents:
90: \begin{equation}
91: \rho^j_i = \frac{n_{q_i}}{n_{e_j}} \label{densitydef}
92: \end{equation}
93: where $n_{q_i}$ is the number of the occurrences of the keyword
94: $q_i$ in the format element $e_j$ and $n_{e_j}$ is the total
95: number of words in this format element.
96: \\ \\
97: Today there exist two basic types of ranking algorithms - the
98: dynamic and the static ranking algorithms. The dynamic rank of a
99: document depends on two factors only - the keywords $q$ of the
100: query and the information content of the documents. Expressed in a
101: "thumb rule": the higher the keyword density in the document the
102: higher is the dynamic rank of this document. The relevancy
103: function $R_d$, defining the dynamic rank of a document, can be
104: written as:
105: \begin{equation}
106: R_d(q_1) \propto \sum_{k=1}^{N} \mu_k \; \rho^k(q_1)
107: \;\;\;\mbox{N - number of format elements} \; e
108: \label{dynamicrank}
109: \end{equation}
110: for a single keyword query. The coefficients $\mu_k$ are free
111: parameters, defining the importance or weight of each format
112: element. For example, the occurrence of a keyword in an URL is
113: usually much more important than in the text itself
114: $\mu_{URL}>\mu_{text}$. Queries with multiple keywords can be
115: written as superpositions of single keyword queries:
116: \begin{equation}
117: R^{n}_d(q_1,q_2,...,q_n) = R^1(q_1)R^1(q_2)\cdot ... \cdot
118: R^1(q_n)
119: \label{multi-dynamicrank}
120: \end{equation}
121: Usually these functions become modified for different purposes,
122: such as suppression of unwanted information (spam). Other
123: modifications can take into account the freshness of the document,
124: the type of the format or other technical parameter.
125: \\ \\
126: The practical work on search engines has shown that using only a
127: document related, dynamical ranking algorithm is insufficient. In
128: order to also include the importance or the popularity of a domain
129: (popularity among the webmasters not necessarily among Internet
130: users), a new type of algorithms was invented - the static ranking
131: \cite{brin1}. The static rank $R_s$ of a document $d_i$ is related
132: to the importance of the corresponding domain, where it is
133: located. The idea of the static rank of a domain $D$ can be
134: expressed symbolically in the following form:
135: \begin{equation}
136: R_s(D) \propto \sum_{j=1}^{N_j} R_s^j
137: \label{staticrank}
138: \end{equation}
139: where the $R_s^j$ is the static rank of the sites linking to the
140: domain $D$. $N_j$ is the total amount of external links to a
141: Domain. In \cite{gloeggler} a more detailed definition of the page
142: rank formula is given:
143: \begin{equation}
144: R_s(D) = (1-d) + d \sum_{j=1}^{N_j} R_s^j M_j^{-1}
145: \label{staticrankdetailed}
146: \end{equation}
147: where $d$ is a free parameter (usually in the region $d~0.85$
148: \cite{gloeggler}) and $M_j$ is the total number of outgoing links
149: of the referring site. A detailed discussion of the page rank
150: algorithm used by Google is also found in \cite{kamvar1} and
151: \cite{kamvar2}.
152: \\ \\
153: The resulting rank of a document is a function of the the dynamic
154: rank (\ref{dynamicrank}) and the static rank (\ref{staticrank}).
155: There is no unique or even optimal way of constructing this
156: function. A reasonable way is to choose the resulting relevancy
157: $R_{ds}$ as a product of the dynamic and static rank:
158: \begin{equation}
159: R_{ds} = R_d(q) \cdot R_s(d_i) \label{ds-rank}
160: \end{equation}
161: Analyzing (\ref{ds-rank}) a usual approach would be using
162: $R_s(D_i)$ instead of $R_s(d_i)$. In practice the static rank of a
163: document depends not only on the static rank of the domain $D$
164: containing $d_i$, but also on the position in the domain (link
165: topology of the domain). At present this kind of search algorithms
166: is in use in every major internet search engine.
167: \\ \\
168: The algorithms described above do indeed meet the needs of the
169: users. This approach is reasonable from an academic point of view
170: and it has produced remarkable results in the past. Today it has
171: become more difficult to make use of the link topology - very
172: often the links are not set according to the content relevancy,
173: but for other (economic) reasons. To the extent that the search
174: engines have become the most important information retrieval tool,
175: they have also become a target of spamming (site owners try to
176: fake the search engines, virtually presenting more important
177: content than there really is). An effective method of detecting a
178: certain type of spam is described in the appendix. Applying filter
179: mechanisms and modifying the parameters of the dynamic and the
180: static relevancy algorithms, one can ``fine tune" the quality of
181: the Internet search engines.
182: \\ \\
183: The two methods described above explicitly do not take into
184: account the most important factor, the interest of the users
185: searching for information. The dynamic and the static relevancy of
186: a document are influenced by the content of the site and by the
187: ``citation" by other sites. There is no methodical component, that
188: reflects the voice of the searching people. This will be done by
189: the ``Vox Populi Algorithm" (people`s voice).
190: \\ \\
191: The main idea of the VPA is to use the information that is
192: extractable from the user query analysis to enhance the quality of
193: the search. This can be done in two different ways, by modifying
194: either the ranking or the crawling algorithm. In this paper the
195: focus is not on the ranking, but on the crawling algorithm. The
196: crawling algorithm defines which domain and how much of the
197: content will be included into the search index. Sites which are
198: not included cannot be found by the best ranking algorithm. At
199: present there is only a small fraction ($< 10\%$) of the Internet
200: sites indexed by the search engines. The much bigger part of the
201: Internet (``Deep Web") is not visible in any of the search
202: engines.
203: \\ \\
204: The source of information is the analysis of the queries
205: $\vec{q}$, reflecting the users interests and needs. The query set
206: $Q$ may contain all single and multiple keyword queries of the
207: users (\ref{keyworddef}). Based on these queries a
208: multidimensional tensor $\Omega$ can be defined, containing the
209: information of the multiple keyword correlations with the
210: dimension $N_{max}$.
211: \begin{equation}
212: dim[\Omega(Q)] = N_{max}
213: \label{Otensor}
214: \end{equation}
215: $N_{max}$ is the maximum length of a query - theoretically it can
216: be infinite. Practically the amount of queries having $>6$
217: keywords is $<1\%$, while the average query consists of about
218: $N=2$ keywords. In order to simplify the further calculations one
219: can reduce the dimension of (\ref{Otensor}) in the following way:
220: \begin{equation}
221: \Omega^{N_{max}}(Q) \rightarrow \Omega^{N=2}(Q) \equiv \Omega
222: \label{reducedOtensor}
223: \end{equation}
224: In this reduction algorithm, the queries with more than two
225: keywords are replaced by two keyword queries, containing all
226: possible paired combinations. For example, a three keyword query
227: is equivalent to 3 two keyword queries and so on.
228: \\ \\
229: The matrix $\Omega$ is a correlation matrix of all keywords of the
230: query set $Q$, which is analyzed. $\Omega$ is a positive and
231: symmetric matrix \footnote{The analysis of the order of the
232: keywords shows a statistical asymmetry for the order of keywords
233: $N(1,2) \neq N(2,1)$. Users interested in the explicit order of
234: the keywords can use the option called ``Exact Phrase", which is
235: available on any modern search engine. Therefore it is reasonable
236: to assume that the order of the keywords is not important for the
237: users when they make simple queries (more than 90\% of all queries
238: are of this type). We will use here the approximation $(1.2) =
239: (2.1)$}. One can calculate the eigenvectors and eigenvalues of
240: $\Omega$, transforming it into the diagonal form:
241: \begin{equation}
242: K^{-1} \Omega K = \Omega^{diag}
243: \label{Otensordiag}
244: \end{equation}
245: The details of the diagonalization procedure are well known, see
246: \cite{bronstein} or any other standard textbook on mathematics. It
247: is now important to understand the practical meaning of the
248: matrices $K$ and $\Omega^{diag}$. The matrix $K$ consists of
249: eigenvectors which are keyword combinations:
250: \begin{equation}
251: K=\left(
252: \begin{array}{c}
253: \vec{e}^1 \\
254: \vec{e}^2 \\
255: \vec{e}^3 \\
256: ...
257: \end{array}
258: \right) \label{Kmatrix}
259: \end{equation}
260: where each eigenvector has the coordinates
261: \begin{equation}
262: \vec{e}^j=(c_1q_1,c_2q_2, ...)^j
263: \label{qvector}
264: \end{equation}
265: similar to the definition (\ref{keyworddef}) the $q_i$ are the
266: keywords and the coefficients $c_i^j$ are positive numbers, giving
267: each keyword some "weight" compared to the other ones (How
268: frequent do the users ask for this keyword?). The coefficients
269: determine the relative importance of a keyword within an
270: eigenvector. A typical eigenvector (or better ``eigenquery") has
271: the form (based on the data \cite{keyworddatenbank}, Aug. 2003).
272: \begin{equation}
273: \vec{e}^j=(``mp3",\;0.73\cdot ``downloads", \; 0.43 \cdot ``free",
274: ...)
275: \label{qvectordemo}
276: \end{equation}
277: This query shows how the {\it average} user is asking, when he is
278: searching for mp3 downloads at no cost. The reduced $(N=3)$
279: keyword matrix of the example above has the form
280: \cite{keyworddatenbank}:
281: \begin{equation}
282: \Omega=
283: \begin{array}{lccc}
284: & mp3 & download & free \\
285: mp3 & 37.2 \% & 8.8 \% & 2.7\% \\
286: download & 8.8 \% & 19.2 \% & 3.6\% \\
287: free & 2.7 \% & 3.6 \% & 13.4\%
288: \end{array}
289: \label{mp3matrix}
290: \end{equation}
291: The difference between the typical keyword search at present and
292: our approach is that the words here have different weights,
293: determining their relative importance for the users.
294: \\ \\
295: Another important information about the significance of keyword
296: combinations is contained in the matrix $\Omega^{diag}$.
297: \begin{equation}
298: \Omega^{diag}=\left(
299: \begin{array}{ccccc}
300: \lambda_1 & 0 & 0 & ... & 0 \\
301: 0 & \lambda_2 & 0 & ... & 0 \\
302: 0 & 0 & ... & ... & ... \\
303: 0 & 0 & ... & \lambda_{N-1} & 0 \\
304: 0 & 0 & ... & 0 & \lambda_N
305: \end{array}
306: \right)
307: \label{Omegadiag}
308: \end{equation}
309: Each eigenvalue $\lambda_i$ corresponds to an eigenvector in
310: (\ref{qvector}). The eigenvalue can be interpreted as the
311: importance of the corresponding eigenvector - it defines the
312: importance of an eigenquery for the users.
313: \\ \\
314: Finally, we have developed the tools for defining how a search
315: engine can use the information of the users to determine, which
316: content should be enhanced or reduced in the index. Based on the
317: described algorithm it is possible to define which content is the
318: ``most wanted" content and which sites deliver this type of
319: content:
320: $$
321: (c_1q_1 + c_2q_2 + ...) \rightarrow \;\;\mbox{search engine}
322: \rightarrow \;\;\mbox{list of ranked domains}
323: $$
324: Crawling the Internet, each domain is given certain resources by
325: the search engine, such as CPU time and memory in the index
326: (alternatively also the number of crawled documents or other
327: parameters, depending on the settings of the search engine).
328: \\ \\
329: The practical realization of the VPA as an extension of an
330: existing Internet search could be performed using the following
331: procedure:
332: \begin{enumerate}
333: \item Generate a ranking of domains, addressing the eigenqueries
334: (\ref{qvector}) to the existing (old) search index, the priority
335: of those domains is defined by the size of eigenvalues.
336: (\ref{Omegadiag}).
337: \item Modify the existing resource ranking
338: list with respect to these eigenvalues.
339: \item Use the new determined ranking of the domains for crawling the Internet
340: according to the modified resource
341: distribution.
342: \item Repeat the cycle.
343: \end{enumerate}
344: In order to determine which sites best fit the eigenqueries, it is
345: useful to calculate a dynamic rank for a whole domain, not just
346: for a single document. A simple method would be to summarize the
347: total score of all documents in one domain:
348: \begin{equation}
349: {\bf R_D}(e_i) \propto \sum_{k=1}^{N_D} R_d^k(e_i)
350: \label{domainrank}
351: \end{equation}
352: Let us assume, that the amount of resources (CPU time, number of
353: documents, data volume etc.) given to each domain, when crawling
354: it, can be expressed in a function $M$, with
355: \begin{equation}
356: M=M(D_k,R_s, ...)
357: \label{M1}
358: \end{equation}
359: In order to apply the VPA one can modify (\ref{M1}) in the
360: following way:
361: \begin{equation}
362: M\rightarrow \hat{M}=M\cdot R_{VPA}
363: \label{M2}
364: \end{equation}
365: The function $R_{VPA}$ defines the VPA correction with regard to
366: the old crawling algorithm. The function $R_{VPA}$ can be
367: presented in different ways. The basic requirement for the
368: function is that it is monotone concerning the parameters
369: $\lambda_i$, which define quantitatively how relevant a query is
370: for the users. Following Occam`s principle of simplicity
371: (Pluralitas non est ponenda sine neccesitate - Entities should not
372: be multiplied unnecessarily) this function should use only a
373: minimum set of free parameters, which will allow the adoption (or
374: ``fine tuning") the algorithm to the local requirements:
375: \begin{equation}
376: R_{VPA}(D_k)=\big( 1+ \alpha \cdot \lambda_k^\beta \big) \;\;\;
377: \alpha,\beta >0
378: \label{RVPA}
379: \end{equation}
380: The parameter $\alpha$ and $\beta$ can be chosen freely. In the
381: limit, the new algorithm generates the existing results in
382: (\ref{M2}).
383: \begin{equation}
384: \lim_{\lambda \rightarrow 0}\hat{M} = M
385: \label{Mlimit}
386: \end{equation}
387: In this paper we have shown how the analysis of queries can be
388: used to enhance the relevant and ``most wanted" content in a
389: search index. In this way the relevancy, experienced by the users
390: of the search should grow - the users will find more of what they
391: are interested in. The existing system of the relevancy ranking of
392: documents or domains can remain unchanged. The algorithm will not
393: replace existing crawling and ranking algorithms, but the VPA will
394: extend them by a qualitatively new component.
395: \newpage
396: {\bf Appendix}
397: \\ \\
398: The static rank algorithm has also become the target of spamming
399: (for example, ``Google bombing" \cite{googlebombing}). This means
400: that webmasters are creating clusters of domains, which consist of
401: very similar sites, referring to a single domain or a document.
402: This kind of spam cluster can consist of many domains, which do
403: not contain any valuable content at all. Because of this the
404: static rank consequently is becoming more and more a measure of
405: the marketing budget or the cleverness of the webmaster of a
406: domain, rather than a measure of ``real" reputation or content
407: quality. As a result of this development, the importance of the
408: static rank as a tool for determining the quality or the relevancy
409: of a site is decreasing.
410: \\ \\
411: We want to propose an algorithm which identifies this kind of
412: spamming. The basic idea of the static rank is reasonable - the
413: more important sites refer (link) to a site, the more important is
414: the site. There is a way to discriminate between ``natural grown"
415: link clusters and ``artificial" ones (spam).
416: \\ \\
417: In order to find a quantitative method which can discriminate
418: between these two types of link clusters, one can introduce the
419: function which describes the statistical distribution of the
420: relevancy $R^j_s$ of the links, pointing to the document $d_i$:
421: \begin{equation}
422: \phi(R^j_s) = e^{-\frac{(R^j_s-R_0)^2}{\sigma^2}}
423: \label{spamcluster}
424: \end{equation}
425: here $R_0$ is the average static rank of all sites, linking to the
426: center of this cluster $d_i$. The parameter $\sigma$ defines the
427: width of the distribution.
428: \\ \\
429: The above mentioned types of clusters can be discriminated using
430: the distribution $\phi$ - natural grown clusters contain links
431: from an inhomogeneous set of sites, for example, the links to a
432: site of a well known university will come from very small
433: (amateur) sites of students, employees and alumnies (with a low
434: page rank), via semi professional institutional sites (spin offs,
435: research partners, ...) up to sites of other high ranked
436: universities or institutes. The artificial link cluster consists
437: of automatically generated sites, each of them usually optimized
438: for different keywords, but having approximately the same static
439: rank. As a result of this it is possible to introduce a ``cut off"
440: criteria based on formula (\ref{spamcluster}). A cluster is most
441: likely spam, if the condition
442: \begin{equation}
443: \sigma_{spam} < \sigma_{critical}
444: \label{spamcut}
445: \end{equation}
446: is fulfilled. Here $\sigma_{critical}$ is an empirical parameter,
447: which can be determined from the analysis of known natural and
448: artificial clusters (or from the software generating the sites of
449: the spam cluster). Estimates have shown that one can expect a
450: result like $\sigma_{natural} >> \sigma_{artificial}$. A short
451: test example can demonstrate this: the distribution of the page
452: ranks of sites linking to the homepage of Steven Hawking
453: \cite{hawking internet} analyzed based on formula
454: (\ref{spamcluster}) have a width of $\sigma^2 = 1.1$ , while the
455: sites belonging to a typical spam cluster have a page rank
456: distribution with $\sigma^2 = 0.5 ... 0.7 $ \footnote{The data of
457: this example are based on the page rank indicator of Google
458: \cite{googlebar}.}. The parameter $\sigma$ can be used for
459: separating between these two type of link clusters. The data of
460: this example are based on the indications of the page rank
461: indicator of Google`s toolbar \cite{googlebar}.
462: \newpage
463: \begin{thebibliography}{99}
464: \bibitem{patent} Patent pending, Reg. Nb. 103-19-427.7, April 29 2003,
465: \bibitem{denic} DENIC, www.denic.de, 24.4.2003
466: \bibitem{brin1} S.Brin and L.Page, The anatomy of a large-scale hypertextual Web search engine,
467: ComputerNetworks 30(1-7), p.107-117, 1998
468: \bibitem{gloeggler} M.Gl\"oggler, Suchmaschinen im Internet,
469: Springer 2003
470: \bibitem{kamvar1} S.D.Kamvar et al., Exploiting the block
471: structure of the web for computing pagerank, Stanford University
472: \bibitem{kamvar2} T.H.Haveliwala and S.D.Kamvar, The second
473: eigenvalue of the Google matrix, Stanford University
474: \bibitem{keyworddatenbank} www.keyword-datenbank.de
475: \bibitem{bronstein} I.N.Bronstein and K.A.Semedjajew, Handbook of
476: Mathematics, 1981
477: \bibitem{menczer1} F.Menczer, Links tell us about the lexical and
478: semantic Web content, cs.IR/0108004v1, 8. Aug. 2001
479: \bibitem{hawking internet} www.hawking.org.uk
480: \bibitem{googlebombing} news.bbc.co.uk/1/hi/sci/tech/1868395.stm
481: \bibitem{googlebar} toolbar.google.com/intl/de/
482: \end{thebibliography}
483: \end{document}
484: