0308:cs0308039/cs0308039

1: \documentstyle[a4]{article}

2: \textheight = 0.9 \textheight

3: \title{A new approach to relevancy in Internet searching - the ``Vox Populi Algorithm"}

4:  \author{

5: Andreas Schaale ${}^1$, Carsten Wulf-Mathies ${}^2$, S\"onke

6: Lieberam-Schmidt${}^3$

7: \\

8: \\

9: \small \it

10:  ${}^1$ Contraco Consulting and Software Ltd., Diepenseer

11: Str. 10, 15732 Waltersdorf, Germany

12: \hfill\\

13: \small \it

14:  ${}^2$ T-Online International AG, Waldstr. 3, 64331 Weiterstadt, Germany

15: \hfill\\

16: \small \it ${}^3$  Universit\"at Siegen, FB 5, H\"olderlinstr. 3,

17: 57068 Siegen, Germany}

18: \begin{document}

19: \begin{titlepage}

20: \maketitle

21: \begin{abstract}

22: In this paper we will derive a new algorithm for Internet

23: searching. The main idea of this algorithm is to extend the

24: existing algorithms by a component, which reflects the interests

25: of the users more than existing methods. The ``Vox Populi

26: Algorithm" (VPA) \cite{patent} creates a feedback from the users

27: to the content of the search index. The information derived from

28: the users query analysis is used to modify the existing crawling

29: algorithms. The VPA controls the distribution of the resources of

30: the crawler. Finally, we also discuss methods of suppressing

31: unwanted content (spam). This is necessary in order to enable an

32: efficient performance of the VPA.

33: \end{abstract}

34: \end{titlepage}

35: The retrieval of relevant information from data sources with a

36: very complex structure has become a challenging task since the

37: number of documents in the Internet has reached a level of about

38: multi billions of documents. Only a small part of them is visible

39: in search engines. The problem of organizing and structuring these

40: data into catalogues or searchable databases is of theoretical and

41: significant practical (commercial) interest.

42: \\ \\

43: Let us define the basic components for the mathematical

44: description of the interests of the users, the relevancy of the

45: search results and the crawling process. The users of search

46: engines express their needs for information through the queries

47: which they address to a searchable database (index) $I$. Each of

48: the $k$ queries consists of one or more keywords $q$ addressed to

49: this index. It will be presented as:

50: \begin{equation}

51: {\vec{q}_k} = (q_1, ..., q_n)_k

52:  \label{keyworddef}

53: \end{equation}

54: $n$ is the length of the query $k$. The number of keywords per

55: average query is $n \approx 2$ (status in 2003). The users are

56: searching for documents $d_j$ (HTML pages, tables, text processing

57: documents, pictures, multimedia files, ...) containing

58: information. These documents are grouped (organized) in domains

59: $D_k$ presenting sets of documents under a common editorial

60: responsibility and address (URL):

61: \begin{equation}

62: D_k = \bigcup^{(n_k)} d_j^{(k)} \;\;\; \mbox{$n_k$ = number of

63: documents in }D_k

64:  \label{domaindef}

65: \end{equation}

66: The number of domains is about 6.4 million in Germany \cite{denic}

67: and the number of documents per domain $n_k$ is in the interval

68: $~10^{0...8}$.

69: \\ \\

70: Each document $d$ contains searchable information, today limited

71: to text information. Content, which is hidden for the {\it

72: today�s} search technology in non indexable formats (bitmaps,

73: scripts etc.) will be neglected here and in the following. A

74: document is characterized by the content of keywords $q$ and the

75: position of the keyword in certain format elements $e_i$

76: (metatags, headers, tables, link text etc.):

77: \begin{equation}

78: d^{(k)} = f(q_1, q_2, ...,e_1, e_2, ...)

79:  \label{docdef}

80: \end{equation}

81: During the crawling and indexing process, the image of the

82: document $\hat{d}$ in the searchable index $I$ contains a reduced

83: set of information - the keywords and their position in the format

84: elements $e$ of the document. When a query is addressed to the

85: index $I$ a ranking algorithm generates a set of documents (links)

86: which is ordered by the relevancy of the found documents. In order

87: to describe the document ranking process which generates the set

88: of results on each query, one has to introduce the density $\rho$

89: of keywords within the documents:

90: \begin{equation}

91: \rho^j_i = \frac{n_{q_i}}{n_{e_j}} \label{densitydef}

92: \end{equation}

93: where $n_{q_i}$ is the number of the occurrences of the keyword

94: $q_i$ in the format element $e_j$ and $n_{e_j}$ is the total

95: number of words in this format element.

96: \\ \\

97: Today there exist two basic types of ranking algorithms - the

98: dynamic and the static ranking algorithms. The dynamic rank of a

99: document depends on two factors only - the keywords $q$ of the

100: query and the information content of the documents. Expressed in a

101: "thumb rule": the higher the keyword density in the document the

102: higher is the dynamic rank of this document. The relevancy

103: function $R_d$, defining the dynamic rank of a document, can be

104: written as:

105: \begin{equation}

106: R_d(q_1) \propto  \sum_{k=1}^{N} \mu_k \; \rho^k(q_1)

107: \;\;\;\mbox{N - number of format elements} \; e

108:  \label{dynamicrank}

109: \end{equation}

110: for a single keyword query. The coefficients $\mu_k$ are free

111: parameters, defining the importance or weight of each format

112: element. For example, the occurrence of a keyword in an URL is

113: usually much more important than in the text itself

114: $\mu_{URL}>\mu_{text}$. Queries with multiple keywords can be

115: written as superpositions of single keyword queries:

116: \begin{equation}

117: R^{n}_d(q_1,q_2,...,q_n) = R^1(q_1)R^1(q_2)\cdot ... \cdot

118: R^1(q_n)

119:  \label{multi-dynamicrank}

120: \end{equation}

121: Usually these functions become modified for different purposes,

122: such as suppression of unwanted information (spam). Other

123: modifications can take into account the freshness of the document,

124: the type of the format or other technical parameter.

125: \\ \\

126: The practical work on search engines has shown that using only a

127: document related, dynamical ranking algorithm is insufficient. In

128: order to also include the importance or the popularity of a domain

129: (popularity among the webmasters not necessarily among Internet

130: users), a new type of algorithms was invented - the static ranking

131: \cite{brin1}. The static rank $R_s$ of a document $d_i$ is related

132: to the importance of the corresponding domain, where it is

133: located. The idea of the static rank of a domain $D$ can be

134: expressed symbolically in the following form:

135: \begin{equation}

136: R_s(D) \propto \sum_{j=1}^{N_j} R_s^j

137: \label{staticrank}

138: \end{equation}

139: where the $R_s^j$ is the static rank of the sites linking to the

140: domain $D$. $N_j$ is the total amount of external links to a

141: Domain. In \cite{gloeggler} a more detailed definition of the page

142: rank formula is given:

143: \begin{equation}

144: R_s(D) = (1-d) + d \sum_{j=1}^{N_j} R_s^j M_j^{-1}

145: \label{staticrankdetailed}

146: \end{equation}

147: where $d$ is a free parameter (usually in the region $d~0.85$

148: \cite{gloeggler}) and $M_j$ is the total number of outgoing links

149: of the referring site. A detailed discussion of the page rank

150: algorithm used by Google is also found in \cite{kamvar1} and

151: \cite{kamvar2}.

152: \\ \\

153: The resulting rank of a document is a function of the the dynamic

154: rank (\ref{dynamicrank}) and the static rank (\ref{staticrank}).

155: There is no unique or even optimal way of constructing this

156: function. A reasonable way is to choose the resulting relevancy

157: $R_{ds}$ as a product of the dynamic and static rank:

158: \begin{equation}

159: R_{ds} = R_d(q) \cdot R_s(d_i)  \label{ds-rank}

160: \end{equation}

161: Analyzing (\ref{ds-rank}) a usual approach would be using

162: $R_s(D_i)$ instead of $R_s(d_i)$. In practice the static rank of a

163: document depends not only on the static rank of the domain $D$

164: containing $d_i$, but also on the position in the domain (link

165: topology of the domain). At present this kind of search algorithms

166: is in use in every major internet search engine.

167: \\ \\

168: The algorithms described above do indeed meet the needs of the

169: users. This approach is reasonable from an academic point of view

170: and it has produced remarkable results in the past. Today it has

171: become more difficult to make use of the link topology - very

172: often the links are not set according to the content relevancy,

173: but for other (economic) reasons. To the extent that the search

174: engines have become the most important information retrieval tool,

175: they have also become a target of spamming (site owners try to

176: fake the search engines, virtually presenting more important

177: content than there really is). An effective method of detecting a

178: certain type of spam is described in the appendix. Applying filter

179: mechanisms and modifying the parameters of the dynamic and the

180: static relevancy algorithms, one can ``fine tune" the quality of

181: the Internet search engines.

182: \\ \\

183: The two methods described above explicitly do not take into

184: account the most important factor, the interest of the users

185: searching for information. The dynamic and the static relevancy of

186: a document are influenced by the content of the site and by the

187: ``citation" by other sites. There is no methodical component, that

188: reflects the voice of the searching people. This will be done by

189: the ``Vox Populi Algorithm" (people`s voice).

190: \\ \\

191: The main idea of the VPA is to use the information that is

192: extractable from the user query analysis to enhance the quality of

193: the search. This can be done in two different ways, by modifying

194: either the ranking or the crawling algorithm. In this paper the

195: focus is not on the ranking, but on the crawling algorithm. The

196: crawling algorithm defines which domain and how much of the

197: content will be included into the search index. Sites which are

198: not included cannot be found by the best ranking algorithm. At

199: present there is only a small fraction ($< 10\%$) of the Internet

200: sites indexed by the search engines. The much bigger part of the

201: Internet (``Deep Web") is not visible in any of the search

202: engines.

203: \\ \\

204: The source of information is the analysis of the queries

205: $\vec{q}$, reflecting the users interests and needs. The query set

206: $Q$ may contain all single and multiple keyword queries of the

207: users (\ref{keyworddef}). Based on these queries a

208: multidimensional tensor $\Omega$ can be defined, containing the

209: information of the multiple keyword correlations with the

210: dimension $N_{max}$.

211: \begin{equation}

212: dim[\Omega(Q)] = N_{max}

213:  \label{Otensor}

214: \end{equation}

215: $N_{max}$ is the maximum length of a query - theoretically it can

216: be infinite. Practically the amount of queries having $>6$

217: keywords is $<1\%$, while the average query consists of about

218: $N=2$ keywords. In order to simplify the further calculations one

219: can reduce the dimension of (\ref{Otensor}) in the following way:

220: \begin{equation}

221: \Omega^{N_{max}}(Q) \rightarrow \Omega^{N=2}(Q) \equiv \Omega

222:  \label{reducedOtensor}

223: \end{equation}

224: In this reduction algorithm, the queries with more than two

225: keywords are replaced by two keyword queries, containing all

226: possible paired combinations. For example, a three keyword query

227: is equivalent to 3 two keyword queries and so on.

228: \\ \\

229: The matrix $\Omega$ is a correlation matrix of all keywords of the

230: query set $Q$, which is analyzed. $\Omega$ is a positive and

231: symmetric matrix \footnote{The analysis of the order of the

232: keywords shows a statistical asymmetry for the order of keywords

233: $N(1,2) \neq N(2,1)$. Users interested in the explicit order of

234: the keywords can use the option called ``Exact Phrase", which is

235: available on any modern search engine. Therefore it is reasonable

236: to assume that the order of the keywords is not important for the

237: users when they make simple queries (more than 90\% of all queries

238: are of this type). We will use here the approximation $(1.2) =

239: (2.1)$}. One can calculate the eigenvectors and eigenvalues of

240: $\Omega$, transforming it into the diagonal form:

241: \begin{equation}

242: K^{-1} \Omega K = \Omega^{diag}

243:  \label{Otensordiag}

244: \end{equation}

245: The details of the diagonalization procedure are well known, see

246: \cite{bronstein} or any other standard textbook on mathematics. It

247: is now important to understand the practical meaning of the

248: matrices $K$ and $\Omega^{diag}$. The matrix $K$ consists of

249: eigenvectors which are keyword combinations:

250: \begin{equation}

251: K=\left(

252: \begin{array}{c}

253: \vec{e}^1 \\

254: \vec{e}^2 \\

255: \vec{e}^3 \\

256: ...

257: \end{array}

258: \right) \label{Kmatrix}

259: \end{equation}

260: where each eigenvector has the coordinates

261: \begin{equation}

262: \vec{e}^j=(c_1q_1,c_2q_2, ...)^j

263:  \label{qvector}

264: \end{equation}

265: similar to the definition (\ref{keyworddef}) the $q_i$ are the

266: keywords and the coefficients $c_i^j$ are positive numbers, giving

267: each keyword some "weight" compared to the other ones (How

268: frequent do the users ask for this keyword?). The coefficients

269: determine the relative importance of a keyword within an

270: eigenvector. A typical eigenvector (or better ``eigenquery") has

271: the form (based on the data \cite{keyworddatenbank}, Aug. 2003).

272: \begin{equation}

273: \vec{e}^j=(``mp3",\;0.73\cdot ``downloads", \; 0.43 \cdot ``free",

274: ...)

275:  \label{qvectordemo}

276: \end{equation}

277: This query shows how the {\it average} user is asking, when he is

278: searching for mp3 downloads at no cost. The reduced $(N=3)$

279: keyword matrix of the example above has the form

280: \cite{keyworddatenbank}:

281: \begin{equation}

282: \Omega=

283: \begin{array}{lccc}

284:           & mp3       & download  & free     \\

285: mp3       & 37.2 \%   & 8.8 \%    & 2.7\%    \\

286: download  & 8.8 \%    & 19.2 \%   & 3.6\%    \\

287: free      & 2.7 \%    & 3.6 \%    & 13.4\%

288: \end{array}

289: \label{mp3matrix}

290: \end{equation}

291: The difference between the typical keyword search at present and

292: our approach is that the words here have different weights,

293: determining their relative importance for the users.

294: \\ \\

295: Another important information about the significance of keyword

296: combinations is contained in the matrix $\Omega^{diag}$.

297: \begin{equation}

298: \Omega^{diag}=\left(

299: \begin{array}{ccccc}

300: \lambda_1 & 0         & 0   & ...           & 0 \\

301: 0         & \lambda_2 & 0   & ...           & 0  \\

302: 0         & 0         & ... &  ...          & ...  \\

303: 0         & 0         & ... & \lambda_{N-1} & 0 \\

304:  0        & 0         & ... & 0             & \lambda_N

305: \end{array}

306: \right)

307:  \label{Omegadiag}

308: \end{equation}

309: Each eigenvalue $\lambda_i$ corresponds to an eigenvector in

310: (\ref{qvector}). The eigenvalue can be interpreted as the

311: importance of the corresponding eigenvector - it defines the

312: importance of an eigenquery for the users.

313: \\ \\

314: Finally, we have developed the tools for defining how a search

315: engine can use the information of the users to determine, which

316: content should be enhanced or reduced in the index. Based on the

317: described algorithm it is possible to define which content is the

318: ``most wanted" content and which sites deliver this type of

319: content:

320: $$

321: (c_1q_1 + c_2q_2 + ...) \rightarrow \;\;\mbox{search engine}

322: \rightarrow \;\;\mbox{list of ranked domains}

323: $$

324: Crawling the Internet, each domain is given certain resources by

325: the search engine, such as CPU time and memory in the index

326: (alternatively also the number of crawled documents or other

327: parameters, depending on the settings of the search engine).

328: \\ \\

329: The practical realization of the VPA as an extension of an

330: existing Internet search could be performed using the following

331: procedure:

332: \begin{enumerate}

333: \item Generate a ranking of domains, addressing the eigenqueries

334: (\ref{qvector}) to the existing (old) search index, the priority

335: of those domains is defined by the size of eigenvalues.

336: (\ref{Omegadiag}).

337:  \item Modify the existing resource ranking

338: list with respect to these eigenvalues.

339:  \item Use the new determined ranking of the domains for crawling the Internet

340:  according to the modified resource

341:  distribution.

342:  \item Repeat the cycle.

343: \end{enumerate}

344: In order to determine which sites best fit the eigenqueries, it is

345: useful to calculate a dynamic rank for a whole domain, not just

346: for a single document. A simple method would be to summarize the

347: total score of all documents in one domain:

348: \begin{equation}

349: {\bf R_D}(e_i) \propto  \sum_{k=1}^{N_D} R_d^k(e_i)

350:  \label{domainrank}

351: \end{equation}

352: Let us assume, that the amount of resources (CPU time, number of

353: documents, data volume etc.) given to each domain, when crawling

354: it, can be expressed in a function $M$, with

355: \begin{equation}

356: M=M(D_k,R_s, ...)

357:  \label{M1}

358: \end{equation}

359: In order to apply the VPA one can modify (\ref{M1}) in the

360: following way:

361: \begin{equation}

362: M\rightarrow \hat{M}=M\cdot R_{VPA}

363:  \label{M2}

364: \end{equation}

365: The function $R_{VPA}$ defines the VPA correction with regard to

366: the old crawling algorithm. The function $R_{VPA}$ can be

367: presented in different ways. The basic requirement for the

368: function is that it is monotone concerning the parameters

369: $\lambda_i$, which define quantitatively how relevant a query is

370: for the users. Following Occam`s principle of simplicity

371: (Pluralitas non est ponenda sine neccesitate - Entities should not

372: be multiplied unnecessarily) this function should use only a

373: minimum set of free parameters, which will allow the adoption (or

374: ``fine tuning") the algorithm to the local requirements:

375: \begin{equation}

376: R_{VPA}(D_k)=\big( 1+ \alpha \cdot \lambda_k^\beta \big) \;\;\;

377: \alpha,\beta >0

378:  \label{RVPA}

379: \end{equation}

380: The parameter $\alpha$ and $\beta$ can be chosen freely. In the

381: limit, the new algorithm generates the existing results in

382: (\ref{M2}).

383: \begin{equation}

384: \lim_{\lambda \rightarrow 0}\hat{M} = M

385:  \label{Mlimit}

386: \end{equation}

387: In this paper we have shown how the analysis of queries can be

388: used to enhance the relevant and ``most wanted" content in a

389: search index. In this way the relevancy, experienced by the users

390: of the search should grow - the users will find more of what they

391: are interested in. The existing system of the relevancy ranking of

392: documents or domains can remain unchanged. The algorithm will not

393: replace existing crawling and ranking algorithms, but the VPA will

394: extend them by a qualitatively new component.

395: \newpage

396: {\bf Appendix}

397: \\ \\

398: The static rank algorithm has also become the target of spamming

399: (for example, ``Google bombing" \cite{googlebombing}). This means

400: that webmasters are creating clusters of domains, which consist of

401: very similar sites, referring to a single domain or a document.

402: This kind of spam cluster can consist of many domains, which do

403: not contain any valuable content at all. Because of this the

404: static rank consequently is becoming more and more a measure of

405: the marketing budget or the cleverness of the webmaster of a

406: domain, rather than a measure of ``real" reputation or content

407: quality. As a result of this development, the importance of the

408: static rank as a tool for determining the quality or the relevancy

409: of a site is decreasing.

410: \\ \\

411: We want to propose an algorithm which identifies this kind of

412: spamming. The basic idea of the static rank is reasonable - the

413: more important sites refer (link) to a site, the more important is

414: the site. There is a way to discriminate between ``natural grown"

415: link clusters and ``artificial" ones (spam).

416: \\ \\

417: In order to find a quantitative method which can discriminate

418: between these two types of link clusters, one can introduce the

419: function which describes the statistical distribution of the

420: relevancy $R^j_s$ of the links, pointing to the document $d_i$:

421: \begin{equation}

422: \phi(R^j_s) = e^{-\frac{(R^j_s-R_0)^2}{\sigma^2}}

423: \label{spamcluster}

424: \end{equation}

425: here $R_0$ is the average static rank of all sites, linking to the

426: center of this cluster $d_i$. The parameter $\sigma$ defines the

427: width of the distribution.

428: \\ \\

429: The above mentioned types of clusters can be discriminated using

430: the distribution $\phi$ - natural grown clusters contain links

431: from an inhomogeneous set of sites, for example, the links to a

432: site of a well known university will come from very small

433: (amateur) sites of students, employees and alumnies (with a low

434: page rank), via semi professional institutional sites (spin offs,

435: research partners, ...) up to sites of other high ranked

436: universities or institutes. The artificial link cluster consists

437: of automatically generated sites, each of them usually optimized

438: for different keywords, but having approximately the same static

439: rank. As a result of this it is possible to introduce a ``cut off"

440: criteria based on formula (\ref{spamcluster}). A cluster is most

441: likely spam, if the condition

442: \begin{equation}

443: \sigma_{spam} < \sigma_{critical}

444: \label{spamcut}

445: \end{equation}

446: is fulfilled. Here $\sigma_{critical}$ is an empirical parameter,

447: which can be determined from the analysis of known natural and

448: artificial clusters (or from the software generating the sites of

449: the spam cluster). Estimates have shown that one can expect a

450: result like $\sigma_{natural} >> \sigma_{artificial}$. A short

451: test example can demonstrate this: the distribution of the page

452: ranks of sites linking to the homepage of Steven Hawking

453: \cite{hawking internet} analyzed based on formula

454: (\ref{spamcluster}) have a width of $\sigma^2 = 1.1$ , while the

455: sites belonging to a typical spam cluster have a page rank

456: distribution with $\sigma^2 = 0.5 ... 0.7 $ \footnote{The data of

457: this example are based on the page rank indicator of Google

458: \cite{googlebar}.}. The parameter $\sigma$ can be used for

459: separating between these two type of link clusters. The data of

460: this example are based on the indications of the page rank

461: indicator of Google`s toolbar \cite{googlebar}.

462: \newpage

463: \begin{thebibliography}{99}

464: \bibitem{patent} Patent pending, Reg. Nb. 103-19-427.7, April 29 2003,

465: \bibitem{denic} DENIC, www.denic.de, 24.4.2003

466: \bibitem{brin1} S.Brin and L.Page, The anatomy of a large-scale hypertextual Web search engine,

467: ComputerNetworks 30(1-7), p.107-117, 1998

468: \bibitem{gloeggler} M.Gl\"oggler, Suchmaschinen im Internet,

469: Springer 2003

470: \bibitem{kamvar1} S.D.Kamvar et al., Exploiting the block

471: structure of the web for computing pagerank, Stanford University

472: \bibitem{kamvar2} T.H.Haveliwala and S.D.Kamvar, The second

473: eigenvalue of the Google matrix, Stanford University

474: \bibitem{keyworddatenbank} www.keyword-datenbank.de

475: \bibitem{bronstein} I.N.Bronstein and K.A.Semedjajew, Handbook of

476: Mathematics, 1981

477: \bibitem{menczer1} F.Menczer, Links tell us about the lexical and

478: semantic Web content, cs.IR/0108004v1, 8. Aug. 2001

479: \bibitem{hawking internet} www.hawking.org.uk

480: \bibitem{googlebombing} news.bbc.co.uk/1/hi/sci/tech/1868395.stm

481: \bibitem{googlebar} toolbar.google.com/intl/de/

482: \end{thebibliography}

483: \end{document}

484: