0102:cs0102002/body.tex

1: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

2: % INTRODUCTION

3: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

4: \section{Introduction}

5: \label{intro}

6:

7: There are an estimated 1 billion pages accessible on the world wide web with

8: 1.5 million pages being added daily.

9: Describing and organizing this vast amount of content is essential

10: for realizing the web's full potential as an information resource.

11: Accomplishing this in a meaningful way will require consistent

12: use of metadata and other descriptive data structures such as

13: semantic linking\cite{bernerslee}.

14: Categorization is an important ingredient as is

15: evident from the popularity of web

16: directories such as Yahoo!\cite{yahoo}, Looksmart\cite{looksmart}, and the

17: Open Directory Project\cite{dmoz}.  However these resources have been

18: created by large teams of human editors

19: and represent only one type of classification scheme that, while widely

20: useful, can never be suitable to all applications.  Classification is a

21: fundamental intellectual task, and we take it as an

22: axiom that it is important and indeed essential for

23: organizing and understanding web content.

24:

25: Automated classification is needed for at least two important reasons.

26: The first is the sheer scale of resources available on the web and their

27: ever-changing nature.  It is simply not feasible to keep up with

28: the fast pace of growth and change on the web

29: through a manual classification effort

30: without expending immense time and effort.

31: The second reason is that classification itself is a subjective

32: activity.  Different classification schemes are needed for different

33: applications.  No single classification scheme is suitable for

34: all applications.  Therefore different types of classification schemes,

35: representing different facets of knowledge, may need to be applied

36: in an ongoing fashion as new applications demand them.

37: Domain specific classification

38: schemes, which can be quickly applied to large amounts of content using

39: automated methods, hold great

40: promise for generating effective metadata.

41:

42: Classification should be considered within the larger context of

43: subject-based metadata.  Specific fields in metadata records often

44: correspond to different classification schemes.

45: The effective use of rich metadata will be important for establishing

46: and leveraging the power of the semantic web.  If web content shifts

47: from primarily text-based to primarily multimedia oriented,

48: metadata will become even more important.  Structured metadata can

49: serve as a driver for many applications such as knowledge based

50: search and retrieval, reasoning engines, intelligent agents,

51: and multi-faceted organization of information.  However metadata

52: creation can be tedious and time consuming.  Automated methods, such

53: as the one described in this paper, can be useful for facilitating

54: metadata creation.

55:

56: In this paper we discuss some practical issues for applying methods of

57: automated classification to web content.  Rather than take a

58: one size fits all approach we advocate the use of targeted specific

59: classification tasks, relevant to solving specific problems.

60: In section \ref{theweb} we discuss the nature of web content

61: and its implications for automated categorization.

62: Extracting good features that can accurately discrimintate between

63: different categories is an important part of any text categorization system.

64: While it is possible and desirable to exploit metadata in the

65: current web environment, we find that its use is far from widespread.

66: In section \ref{setup} we describe a specialized system for

67: automatically classifying

68: web sites into industry categories.   This system

69: can serve as a generalized framework for efficient automated

70: categorization of web content that includes targeted spidering,

71: domain specific classification, and a trainable general purpose

72: text categorization engine.

73: In section \ref{results} we present the results of our controlled

74: experiments.  We show how text features extracted from different parts of

75: web pages effect classification accuracy, and demonstrate that metatags

76: provide the best results.  We also compare the use of training data

77: obtained from a different domain versus training data drawn from the

78: target domain.  We find that training examples taken from the

79: content to be classified give better results, but using training data

80: from a different domain can suffice in cases where assembling new

81: data from scratch is not feasible.

82: Related work is discussed in section \ref{relatedwork}.

83: In section \ref{conclusions} we state our conclusions and make

84: suggestions for further research.

85:

86: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

87: % THE WEB

88: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

89: \section{Text Categorization of Web Content}

90: \label{theweb}

91:

92: The current state of the web differs markedly from the vision of the

93: semantic web as outlined by Tim Berners-Lee\cite{bernerslee}.

94: While web content is machine

95: readable for the most part\footnote{The trend toward multimedia assets

96: puts the future of this assumption in some doubt, but dealing with

97: the problem of non-text information is beyond the scope of this paper.},

98: it is far from machine understandable.

99: Furthermore the ability for computers to understand written human

100: language is still quite limited at this point in time.  Therefore,

101: in this work we have adopted a text categorization approach that

102: relies heavily on word-based indexing and statistical classification,

103: rather than

104: sophisticated natural language processing and knowledge-based inferencing.

105: This approach is capable of giving very good results in a way that is

106: robust and makes few assumptions about the content to be analyzed. This

107: is an important consideration given the heterogenous nature of web content.

108:

109: One the main challenges with classifying web pages is the

110: wide variation in their content and quality.

111: Most text categorization

112: methods rely on the existence of good quality texts, especially for

113: training\cite{lewis92}.

114: Unlike many of the well-known collections typically studied in

115: automated text classification experiments (i.e. TREC, Reuters-22578, OSHUMED),

116: in comparison the web lacks homogeneity and regularness.

117: To make matters worse,

118: much of the existing web page content is based in images,

119: plug-in applications, or other non-text media.  The usage of metadata

120: is inconsistent or non-existent.  In this section we survey

121: the landscape of web content, and its relation to the

122: requirements of text categorization systems.

123:

124: \subsection{Analysis of Web Content}

125:

126: In an attempt to characterize the nature of the content to be

127: classified, we performed a rudimentary quantitative analysis.

128: Our results were obtained by analyzing a collection of 29,998

129: web domains obtained from a random dump of the database

130: of a well-known domain name registration company.

131: Of course these results

132: reflect the biases of our small samples and don't necessarily generalize to

133: the web as a whole, however they should be reflective of the issues

134: at hand.  Since our classification method is text based, it is important

135: to know the amount and quality of the text based features that typically

136: appear in web sites.  Existing standards for web content tend to be

137: \textit{de facto} and loosely enforced if at all.

138: One convention that holds for the vast majority of web sites is that

139: the top level entry point is an HTML web page, so we take this to be our

140: primary source of text features.

141: Besides the body text

142: which is generally free form in a typical HTML page, it is

143: common to include a title and possibly a set of keywords and description

144: metatags.   One of the more promising sources

145: of text features should be found in web page metadata.

146:

147: In Table \ref{metawords} we show the percentage of web sites with a certain

148: number of words for each type of metatag.

149: We analyzed a sample of 19195 domains with live web sites and counted

150: the number of words used in the content attribute of the

151: \texttt{<META name=``keywords''>} and \texttt{<META name=``description''>}

152: tags as well as \texttt{<TITLE>} tags.  We also counted free text

153: found within the \texttt{<BODY>} tag, excluding all other HTML tags.

154:

155: \begin{table*}[!hp]

156: \caption{Percentage of Web Pages with Words in HTML Tags}

157: \label{metawords}

158: \begin{center}

159: \begin{tabular}{crrrr}

160: \hline

161: Tag Type & 0 words & 1-10 words & 11-50 words & 51+ words \\

162: \hline

163: Title & 4\% & 89\% & 6\% & 1\% \\

164: Meta-Description & 68\% & 8\% & 21\% & 3\% \\

165: Meta-Keywords & 66\% & 5\% & 19\% & 10\% \\

166: Body Text & 17\% & 5\% & 21\% & 57\% \\

167: \hline

168: \end{tabular}

169: \end{center}

170: \end{table*}

171:

172: The most obvious source of text is within the body of the web page.

173: We noticed that about 17\% of top level web pages had no usable body

174: text.  These cases include pages that only contain frame sets,

175: images, or plug-ins (our user agent followed redirects whenever

176: possible).  Almost a quarter of web pages contained 11-50 words,

177: and the majority of web pages contained over 50 words.

178:

179: Though title tags are common the amount of text is relatively small with

180: 89\% of the titles containing only 1-10 words.

181: Also, the titles often contain only names or terms such as

182: ``home page'', which are not particularly helpful for subject classification.

183:

184: Metatags for keywords and descriptions are used by several major search

185: engines, where they play an important role in the ranking and

186: display of search results.  Despite this, only about a third of

187: web sites were found to contain these tags.

188: As it turns out, metatags can be useful when they exist

189: because they contain text specifically intended to aid in the

190: identification of a web site's subject areas\footnote{The possibilities

191: for misuse/abuse of these tags to improve search engine rankings are well

192: known; however, we found these practices to be not very widespread in our

193: sample and of little consequence.}.  Most of the time these metatags

194: contained between 11 and 50 words, with a smaller percentage containing

195: more than 50 words (in contrast to the number of words in the body

196: text which tended to contain more than 50 words).

197:

198: The lack of widespread use of metatags, despite the apparent incentive

199: to improve search engine rankings, is instructive.  Since metadata

200: is usually not part of the presentation of the content and its

201: benefit is somewhat intangible, it tends to be neglected.  Creating metadata

202: can be a tedious and unwelcome task.  Therefore methods to facilitate the

203: creation of quality metadata, especially automated methods, are greatly

204: needed.

205:

206: \subsection{Good Text Features}

207: \label{goodfeatures}

208:

209: Feature selection is an important part of building an automated

210: classification system.  Without a proper set of features, the

211: classifier will not be able to accurately

212: discriminate between different categories.

213: The feature set must be sufficiently broad to acommodate the wide

214: variations that can occur even within instances of the same class.  On the

215: other hand the number of features needs to be constrained to reduce noise

216: and to limit the burden on system resources.

217:

218: In reference\cite{lewis92} it is argued that for the purposes of automated

219: text categorization, features should be:

220: \begin{enumerate}

221: \item Relatively few in number

222: \item Moderate in frequency of assignment

223: \item Low in redundancy

224: \item Low in noise

225: \item Related in semantic scope to the classes to be assigned

226: \item Relatively unambiguous in meaning

227: \end{enumerate}

228:

229: Due to the wide variety of purpose and scope of current web content,

230: items 4 and 5 are difficult requirements to meet for most

231: classification tasks.  For subject

232: classification, metatags seem to meet those requirements better

233: than other sources of text such as titles and body text.  However

234: the lack of widespread use of metatags is a problem if

235: coverage of the majority of web content is desired.  In the long term,

236: automated categorization could really benefit if greater

237: attention is paid to the creation and usage of rich metadata and

238: explicit semantic structures,

239: especially if the above requirements are taken into consideration.

240: In the short term, one must implement a strategy for obtaining

241: good text features from the existing HTML and natural language

242: cues that takes the above requirements as well as the goals

243: of the classification task into consideration.  Techniques for shallow

244: parsing and information extraction are useful in this regard.

245:

246: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

247: % SETUP

248: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

249: \section{Experimental Setup}

250: \label{setup}

251:

252: We constructed a full scale automated classification system and

253: performed several experiments using real world data in order to

254: gauge system performance and test ideas.

255: The goal of our targeted domain specific task was to

256: rapidly classify web sites (domain names)

257: into broad industry categories. In this section we describe the

258: main ingredients of our classification experiments including the data,

259: architecture, and evaluation measures.

260:

261: \subsection{Classification Scheme}

262:

263: The categorization scheme used was the

264: top level of the 1997 North American Industrial Classification System

265: (NAICS) \cite{naics}, which consists of 21 broad industry categories

266: shown in Table \ref{tnaics}.

267:

268: \begin{table*}[!htbp]

269: \caption{Top level NAICS Categories}

270: \label{tnaics}

271: \begin{center}

272: \begin{tabular}{cl}

273: \hline

274: NAICS code & NAICS Description \\

275: \hline

276: 11 & Agriculture, Forestry, Fishing, and Hunting \\

277: 21 & Mining \\

278: 22 & Utilities \\

279: 23 & Construction \\

280: 31-33 & Manufacturing \\

281: 42 &  Wholesale Trade \\

282: 44-45 &  Retail Trade \\

283: 48-49 &  Transportation and Warehousing \\

284: 51 &  Information \\

285: 52 &  Finance and Insurance \\

286: 53 &  Real Estate and Rental and Leasing \\

287: 54 &  Professional, Scientific and Technical Services \\

288: 55 & Management of Companies and Enterprises \\

289: 56 &  Administrative and Support, \\

290:    &  Waste Management and Remediation Services \\

291: 61 & Educational Services \\

292: 62 & Health Care and Social Assistance \\

293: 71 & Arts, Entertainment and Recreation \\

294: 72 & Accommodation and Food Services \\

295: 81 & Other Services (except Public Administration) \\

296: 92 & Public Administration \\

297: 99 & Unclassified Establishments \\

298: \hline

299: \end{tabular}

300: \end{center}

301: \end{table*}

302:

303: Some of our resources had been previously classified using the older

304: 1987 Standard Industrial Classification (SIC) system.  In these cases

305: we used the published mappings\cite{naics} to convert all

306: assigned SIC categories to their NAICS equivalents.  The full

307: NAICS has six levels of hierarchy and contains

308: several thousand subcategories.  For our experiments all lower level

309: NAICS subcategories were generalized up to the appropriate

310: top level category (though the entire classification scheme could

311: have been utilized by our system if a finer grained categorization

312: was desired).

313:

314: NAICS and SIC are examples of authoritative controlled vocabularies.

315: Using a published standardized classification scheme can be a good idea

316: in order to take advantage of the many person hours of time it takes

317: to construct something like this.  In addition, it may be possible to

318: take advantage of existing content already classified by the scheme as

319: a source of training data.

320:

321: \subsection{Targeted Spidering}

322: \label{spider}

323:

324: Based on the results of section \ref{theweb}, it is obvious that selection

325: of adequate text features is an important issue and certainly

326: not to be taken for granted.  To balance the

327: needs of our text-based classifier against the speed and storage limitations of

328: a large-scale crawling effort, we took an approach for spidering

329: web sites and gathering text that was targeted to the classification task

330: at hand.

331:

332: In some preliminary tests we found the best classifier accuracy

333: was obtained by using only the contents of the keywords and

334: description metatags as the source of text features.  Adding

335: body text decreased classification accuracy.  However, due to

336: the lack of widespread usage of metatags limiting ourselves

337: to these features was not practical, and other sources of

338: text such as titles and body text were needed to provide

339: adequate coverage of web sites.  Therefore our targeted spidering approach

340: attempted to gather the higher quality text features from metatags

341: and only resorted to lower quality texts if needed.

342:

343: Our opportunistic spider began at the top level page of the web site

344: and attempted to extract useful text from metatags and titles

345: if they exist, and then followed links for frame sets if they existed.

346: It also followed any hyperlinks

347: that contained key substrings in their anchor text

348: such as \emph{product}, \emph{services},

349: \emph{about}, \emph{info}, \emph{press}, and \emph{news}, and again

350: looked for metatag content in those pages.

351: These substrings were chosen based on

352: an \emph{ad hoc} frequency analysis and the assumption that they tend to

353: point to content that is useful for deducing an industry classification.

354: Only if no metatag content was found did the spider

355: gather the actual body text of the web page.  All extracted text was

356: concatenated into a single representative document for the site

357: that was submited to the classification engine.

358: For efficiency we ran several spiders in parallel, each working

359: on different lists of individual domain names.

360:

361: What we were attempting to do by following a restricted set of hyperlinks,

362: was to take advantage of the

363: current web's \emph{implicit} semantic structure.

364: One the advantages of moving towards an \emph{explicit} semantic

365: structure for hypertext documents\cite{bernerslee} is that an

366: opportunistic spidering

367: approach could really benefit from a formalized description of the

368: semantic relationships between linked web pages.  This would allow

369: spiders to more easily find the most relevant resources without having to

370: crawl the entire network of the web.

371:

372: \subsection{Test Data}

373:

374: From our initial list of 29,998 domain names we used our targeted spider

375: to determine which sites were live and extracted

376: text features using the approach outlined in section \ref{spider}.

377: Of those, 13,557 domain names had usable text content and were pre-classified

378: according to one or more industry categories\footnote{Industry classifications

379: for domain names were provided by InfoUSA and Dunn \& Bradstreet.}.  From

380: this set of data we drew samples for training, testing and validation.

381:

382: \subsection{Training Data}

383: \label{ts}

384:

385: We took two approaches to constructing training sets for our

386: classifiers.  In the

387: first approach we used a combination of 426 NAICS category labels

388: (including subcategories) and 1504 U.S. Securities and Exchange Commission

389: (SEC) 10-K filings\footnote{SEC 10-K filings are annual reports

390: required of all U.S. public companies that describe business

391: activities for the year.  Each public company is also

392: assigned an SIC category.}

393: for public companies\cite{dolin99} as training examples.

394: In the second approach we used a set of 3618 pre-classified

395: domain names along with text for each domain obtained using our spider.

396:

397: The first approach can be considered as using ``prior knowledge''

398: obtained in a different domain.  It is interesting to see how knowledge from

399: a different domain generalizes to the problem of classifying web sites.

400: Furthermore it is

401: often the case that training examples can be difficult to obtain (thus

402: the need for an automated solution in the first place).  The

403: second approach is the more conventional classification by example.

404: In our case it was made possible by the fact that our database

405: of domain names was pre-classified according one or more industry categories.

406:

407:

408: \subsection{Classifier Architecture}

409:

410: Our text classifier consisted of three modules: the targeted spider for

411: extracting text features associated with a web site,

412: an information retrieval engine for comparing queries to

413: training examples, and a decision algorithm for assigning categories.

414:

415: Our spider was designed to quickly process a large database of

416: top level web domain names (e.g. domain.com, domain.net, etc.).

417: As described in section \ref{spider} we implemented an opportunistic

418: spider targeted to finding high quality text from pages that described

419: the business area, products, or services of a commercial web site.

420: After accumulating text features, a query was submitted to the

421: text classifier.  The domain name and any automatically

422: assigned categories were logged in a central database.

423: Several spiders could be run in parallel for efficient use of system

424: resources.

425:

426: Our information retrieval engine was based on Latent Sematic Indexing

427: (LSI)\cite{lsi}.  LSI is a variation of the vector space model of

428: information retrieval that uses the technique of singular value

429: decomposition (SVD) to reduce the dimensionality of the vector space.

430: Words that tend to co-occur in the same document share large projections

431: along directions in the reduced space.  Theoretically this reduces

432: noise due to redundant or spurious word usage, and automatically

433: derives relationships

434: between words and the inherent concepts.  Cosine similarity is computed

435: in the reduced vector space, which amounts to concept based matching rather

436: than word based.  For example queries containing the word ``car'' will

437: match documents containing only the word ``automobile'' provided the

438: relationship between the words and concept has been established in the corpus.

439:

440: In a previous work\cite{dolin99} it was shown that LSI provided better

441: accuracy with fewer training set documents per category than standard

442: TF-IDF weighting.  Queries were compared to training

443: set documents based on their cosine similarity, and a ranked list of

444: matching documents and scores was forwarded to the decision module.

445:

446: In the decision module,

447: we used a K-nearest neighbor algorithm for ranking categories and

448: assigned the top ranking category to the web site.  This type of classifier

449: tends to perform well compared to other methods\cite{yang}, is robust,

450: and tolerant of noisy data (all are important qualities when dealing with

451: web content).  In addition the algorithm is capable of producing good

452: results even when the amount of training data is limited.  The decision

453: module also is responsible for thresholding and presenting the final

454: set of automatically assigned categories.

455:

456: \subsection{Evaluation Measures}

457:

458: System evaluation was carried out using the standard precision,

459: recall, and F1 measures\cite{rijsbergen}\cite{lewis91}.

460: Precision is the number of correct categories assigned divided

461: by the total number of categories assigned, and serves as a measure

462: of classification accuracy.  The higher the precision the smaller the amount

463: of false positives.  Recall is the number of correct categories

464: assigned divided by the total number of known correct categories.

465: Higher recall means a smaller amount of missed categories.  In theory,

466: scores of 1 are desirable for both precision and recall.  In practice

467: even human assigned classifications may only achieve scores between

468: 0.7 and 0.9, depending on the classification task.  This is because

469: to some extent classification is a subjective task and there are

470: usually ``grey areas'' in a classification scheme.

471:

472: The F1 measure combines precision and recall with equal importance

473: into a single parameter for optimization and is defined as

474: \begin{equation}

475: F1 = \frac{2 P R}{P + R}

476: \end{equation}

477: where P is precision and R is recall.

478:

479: We computed global estimates

480: of system performance using both micro-averaging (results are computed

481: based on global sums over all decisions) and

482: macro-averaging (results are computed on a per-category basis,

483: then averaged over categories).  Micro-averaged

484: scores tend to be dominated by the most commonly used categories,

485: while macro-averaged scores tend to be dominated by the performance

486: in rarely used categories.  This distinction was relevant to our problem,

487: because it turned out that the vast majority of commercial web sites

488: are associated with the Manufacturing (31-33) category.

489:

490: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

491: % RESULTS

492: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

493: \section{Results}

494: \label{results}

495:

496: In our first experiment we varied the sources of text features

497: for 1125 pre-classified web domains.  We

498: constructed separate test sets

499: based on text extracted from the body text, metatags

500: (keywords and descriptions),

501: and a combination of both.  The training set consisted of SEC documents

502: and NAICS category descriptions.

503: Results are shown in Table \ref{ptf}.

504:

505: \begin{table}[!htbp]

506: \caption{Performance vs. Text Features}

507: \label{ptf}

508: \begin{center}

509: \begin{tabular}{cccc}

510: \hline

511: Sources of Text & micro P & micro R & micro F1 \\

512: \hline

513: Body & 0.47 & 0.34 & 0.39 \\

514: Body + Metatags & 0.55 & 0.34 & 0.42 \\

515: Metatags & 0.64 & 0.39 & 0.48 \\

516: \hline

517: \end{tabular}

518: \end{center}

519: \end{table}

520:

521: Using metatags as the only source of text features resulted in

522: the most accurate classifications.  Precision decreases noticeably

523: when only the body text was used.  It is interesting that including

524: the body text along with the metatags also resulted in less accurate

525: classifications.  These results influenced the design of our spider

526: which extracted metatags first and foremost, while only grabbing

527: body text as a last resort.

528: The usefulness of metadata as a source of high quality

529: text features should not be suprising since it meets most of the

530: criteria listed in \ref{goodfeatures}.

531:

532: In our second experiment we compared classifiers constructed from

533: the two different training sets described in section \ref{ts}.

534: The results are shown in Table \ref{pts}.

535:

536: \begin{table*}[!htbp]

537: \caption{Performance vs. Training Set}

538: \label{pts}

539: \begin{center}

540: \begin{tabular}{ccccccc}

541: \hline

542: Classifier & micro P & micro R & micro F1 & macro P & macro R & macro F1\\

543: \hline

544: SEC-NAICS & 0.66 & 0.35 & 0.45 & 0.23 & 0.18 & 0.09 \\

545: Web Pages & 0.71 & 0.75 & 0.73 & 0.70 & 0.37 & 0.40 \\

546: \hline

547: \end{tabular}

548: \end{center}

549: \end{table*}

550:

551: The SEC-NAICS training set achieved respectable micro-averaged scores,

552: but the macro-averaged scores were low.  One reason for this is that

553: this classifier generalizes well in categories that are

554: common to the business and web domains (31-33, 23, 51),

555: but has trouble with recall in

556: categories that are not well represented in the business domain

557: (71, 92) and poor precision in categories that are not as common in the web

558: domain (54, 52, 56).

559:

560: The training set constructed from web site text performed better

561: overall.  Macro-averaged recall was much lower than micro-averaged

562: recall.  This can be partially explained by the following example.

563: The categories Wholesale Trade (42) and Retail Trade (44-45) have

564: a subtle difference especially when it comes to web page text

565: which tends to focus on products and services delivered rather

566: than the Retail vs. Wholesale distinction.  In our training set, category

567: 42 was much more common than 44-45, and the former tended to be assigned

568: in place of the latter, resulting in low recall for 44-45.  Other

569: rare categories also tended to have low recall (e.g. 23, 56, 81).

570:

571: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

572: % RELATED WORK

573: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

574: \section{Related Work}

575: \label{relatedwork}

576:

577: Some automatically-constructed, large-scale web directories have

578: been deployed as commercial services such as

579: Northern Light\cite{northernlight},

580: Inktomi Directory Engine\cite{inktomi}, Thunderstone

581: Web Site Catalog\cite{thunderstone}.  Details about these

582: systems are generally unavailable because of their proprietary

583: nature.  It is interesting that these directories tend not to

584: be as popular as their manually constructed counterparts.

585:

586: A system for automated discovery and classification of domain

587: specific web resources is described as part of the DESIRE II

588: project\cite{desire1}\cite{desire2}.  Their classification

589: algorithm weights terms from metatags higher than titles and

590: headings, which are weighted higher than plain body text.

591: They also describe the use of classification software as a

592: topic filter for harvesting a subject specific web index.

593: Another system, Pharos (part of the Alexandria Digital

594: Library Project), is a scalable

595: architecture for searching heterogeneous information sources

596: that leverages the use of metadata\cite{dolin96} and

597: automated classification\cite{dolin98}.

598:

599: The hyperlink structure of the web can be exploited for automated

600: classification by using the anchor text and other context

601: from linking documents as a source of text features\cite{attardi}.

602: Approaches to efficient web spidering\cite{cho}\cite{rennie} have

603: been investigated and are especially important for very large-scale

604: crawling efforts.

605:

606: A complete system for automatically building searchable databases of

607: domain specific web resources using a combination of

608: techniques such as automated classification, targeted spidering, and

609: information extraction is described in reference\cite{mccallum}.

610:

611: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

612: % CONCLUSIONS

613: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

614: \section{Conclusions}

615: \label{conclusions}

616:

617: Automated methods of knowledge discovery, including classification,

618: will be important for establishing the semantic web.

619: Classification is a basic intellectual task and is challenging to

620: automate due to its somewhat subjective nature.  However it is possible

621: to achieve results with automated methods that meet or exceed manual results.

622:

623: A single classification

624: scheme can never be adequate for all applications.

625: We advocate a pragmatic approach including targeted techniques and

626: specialized domain knowledge to

627: be applied to specific classification tasks.  The result is an

628: efficient and optimized system for the task at hand.

629: In this paper we described a practical system for automatically

630: classifying web sites into industry

631: categories that gives good results.  This type of system can

632: be applied to any domain specific classification scheme.  All that is needed

633: is to define the categories, assemble the training data,

634: and configure the spider to extract the appropriate features.  The spider

635: may be constructed to follow specific types of links, or extract sections

636: of web page content that are most useful for a given domain.

637:

638: From the results in Table \ref{ptf}

639: we concluded that metatags were the best source

640: of quality text features, at least compared to the body text.  However

641: by limiting ourselves to metatags we would not be able to classify the

642: large majority web sites.  Therefore we opted for a targeted spider

643: that extracted metatag text first, looked for pages that

644: described business activities, and

645: then degraded to other text only

646: if necessary.  It seems clear that text contained in structured

647: metadata fields results in better automated categorization.  If the

648: web moves toward a more formal semantic structure as outlined by

649: Tim Berners-Lee\cite{bernerslee}, then automated methods can benefit.

650: If more and different kinds of automated

651: classification tasks can be accomplished more accurately, the

652: web can be made to be more useful as well as more usable.

653:

654: Rich metadata for web content is a key to better searching,

655: better organization and managment of content, and improved

656: intelligent agents capable of discovering and acting

657: upon the knowledge embedded in the vast online resources.

658: However, as we have shown,

659:  creation of metadata remains a bottleneck despite strong

660: incentives such as better rankings in search engine results.  It

661: seems that the only way to ensure widespread use of quality metadata

662: is to make the process of metadata creation as painless as possible.

663: Automated methods that can reliably and accurately generate metadata

664: from existing content hold much promise in this regard.  Furthermore

665: metadata needs to be multi-faceted, current, and extensible.  Only

666: automated systems can keep pace with the rate of generation of new

667: web content that we see today.

668:

669: We outline our basic approach

670: for building a targeted automated categorization solution for web

671: content:

672: \begin{itemize}

673: \item \textbf{Knowledge Gathering} - It is important to have a

674: clear understanding

675: of the domain to be classified and the quality of the content involved.  The

676: web is a heterogenous environment, but within given domains patterns and

677: commonalities can emerge.  Taking advantage of specialized knowledge can

678: improve classification results.

679: \item \textbf{Targeted Spidering} - For each classification task

680: different features

681: will be important.  However, due to the lack of homogeneity in web content,

682: the existence of key features can be quite inconsistent.  A targeted spidering

683: approach tries to gather as many key features as possible with as

684: little effort as possible.  In the future this type of approach can

685: benefit greatly from a web structure that encourages the use of

686: metadata and semantically-typed links.  It would be interesting to

687: do a more detailed analysis of semantic spidering and its effect on

688: system performance.

689: \item \textbf{Training} - The best training data comes from the

690: domain to be classified, since that gives the best chance

691: for identifying the key features.  In cases where it's

692: not feasible to assemble enough training data in the target domain,

693: it may be possible to achieve acceptable results using training data

694: gathered from a different domain.  This can be true for web content

695: which can be unstructured, uncontrolled, immense, and hence difficult

696: to assemble quality training data.  However, controlled

697: collections of pre-classfied electronic documents can be obtained

698: in many important domains (financal, legal, medical, etc.) and

699: applied to automated categorization of web content.

700: \item \textbf{Classification} - In addition to being

701: as accurate as possible, the classification method needs to

702: be efficient, scalable, robust, and tolerant of noisy data.  Classification

703: algorithms that utilize the link structure of the web, including

704: formalized semantic linking structures should be further investigated.

705: \end{itemize}

706:

707: Non-text content such as images, applets, plugins, music and video

708: are becoming more and more prevalent on the web.  Devising automated

709: methods that can deal with this kind of content is an important area

710: for further investigation.  Again, effective use of metadata can be a

711: good way to help manage these types of non-text assets.

712:

713: Better acceptance of metadata is one key to the future of the semantic web.

714: However, creation of quality metadata is tedious and is itself a

715: prime candidate for automated methods.  A preliminary method such

716: as the one outlined in the paper can serve as the basis for

717: bootstrapping\cite{boot} a more sophisticated classifier that takes full

718: advantage of the semantic web, and so on.

719:

720: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

721: % ACKNOWLEDGE

722: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

723: \section{Acknowledgements}

724: \label{acknowledgements}

725: I would like to thank for Bill Wohler for

726: collaboration on system design and software implementation, and

727: Roger Avedon, Mark Butler, and Ron Daniel for useful discussions.

728: Special thanks to Network Solutions Inc. for providing classified domain names.

729:

730: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

731: % BIBLIOGRAPHY

732: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

733: \begin{thebibliography}{99}

734:

735: \bibitem{bernerslee} T. Berners-Lee. Semantic Web Road Map. \\

736: http://www.w3.org/DesignIssues/Semantic.html, 1998.

737:

738: \bibitem{yahoo} Yahoo!, http://www.yahoo.com/

739:

740: \bibitem{looksmart} Looksmart, http://www.looksmart.com/

741:

742: \bibitem{dmoz} Open Directory Project, http://www.dmoz.org/

743:

744: \bibitem{lewis92} D. Lewis. Text Representation for Intelligent Text

745: Retrieval: A Classification-Oriented View. In P. Jacobs, editor,

746: \emph{Text-Based Intelligent Systems}, Chapter 9.  Lawrence Erlbaum, 1992.

747:

748: \bibitem{naics} North American Industrial Classification System (NAICS) -

749: United States, 1997. \\

750: http://www.census.gov/epcd/www/naics.html

751:

752: \bibitem{dolin99} R. Dolin, J. Pierre, M. Butler, and R. Avedon.  Practical

753: Evaluation of IR within Automated Classification Systems. \emph{Eighth

754: International Conference of Information and Knowledge Management}, 1999.

755:

756: \bibitem{lsi} S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and

757: R. Harshman. Indexing by latent semantic analysis.  \emph{Journal

758: of the American Society for Information Science}, 41 (6):391-407, 1990.

759:

760: \bibitem{rijsbergen} C.J. van Rijsbergen. \emph{Information Retrieval}.

761: Butterworths, London, 1979.

762:

763: \bibitem{lewis91} D. Lewis. Evaluating Text Categorization. In

764: \emph{Proceedings of the Speech and Natural Language Workshop},

765: 312-318, Morgan Kaufmann 1991.

766:

767: \bibitem{yang} Y. Yang and X. Liu.  A re-examination of text

768: categorization methods.  In \emph{Proceedings of the 22nd Annual

769: ACM SIGIR Conference on Research and Development in Information

770: Retrieval}, 42-49, 1999.

771:

772: \bibitem{northernlight} Northern Light, http://www.northernlight.com/

773:

774: \bibitem{inktomi} Inktomi Directory Engine, \\

775: http://www.inktomi.com/products/portal/directory/

776:

777: \bibitem{thunderstone} Thunderstone Web Site Catalog, \\

778: http://search.thunderstone.com/texis/websearch/about.html

779:

780:

781: \bibitem{desire1} A. Ardo, T. Koch, and L. Nooden. The construction

782: of a robot-generated subject index. \emph{EU Project

783: DESIRE II D3.6a, Working Paper 1} 1999. \\

784: http://www.lub.lu.se/desire/DESIRE36a-WP1.html

785:

786: \bibitem{desire2} T. Kock and A. Ardo.  Automatic classification of

787: full-text HTML-documents from one specific subject area. \emph{EU Project

788: DESIRE II D3.6a, Working Paper 2} 2000. \\

789: http://www.lub.lu.se/desire/DESIRE36a-WP2.html

790:

791: \bibitem{dolin96} R. Dolin, D. Agrawal, L. Dillon, and A. El Abbadi.

792: Pharos: A Scalable Distributed Architecture for Locating Heterogeneous

793: Information Sources Version. In \emph{In Proceedings of the 6th International

794: Conference on Information and Knowledge Management}, 1997.

795:

796: \bibitem{dolin98} R. Dolin, D. Agrawal, A. El Abbadi, and J. Pearlman.

797: Using Automated Classification for Summarizing and Selecting Heterogeneous

798: Information Sources. In \emph{D-Lib Magazine}, January, 1998.

799:

800: \bibitem{attardi} G. Attardi, A. Gulli, and F. Sebastiani.

801: Automatic Web Page Categorization by Link and Context Analysis.

802: In Chris Hutchison and Gaetano Lanzarone (eds.),

803: \emph{Proceedings of THAI'99, European Symposium on Telematics, Hypermedia and Artificial Intelligence}, 105-119, 1999.

804:

805:

806: \bibitem{cho} J. Cho, H. Garcia-Molina, and L. Page. Efficient

807: crawling through URL ordering.  \emph{In Computer Networks and

808: ISDN Systems (WWW7)}, Vol. 30, 1998.

809:

810: \bibitem{rennie} J. Rennie and A. McCallum. Using Reinforcement

811: Learning to Spider the Web Efficiently. \emph{Proceedings of the

812: Sixteenth International Conference on Machine Learning}, 1999.

813:

814: \bibitem{mccallum} A. McCallum, K. Nigam, J. Rennie, and K. Seymore.

815: A Machine Learning Approach to Building Domain-Specific Search Engines.

816: \emph{The Sixteenth International Joint Conference on

817: Artificial Intelligence}, 1999.

818:

819: \bibitem{boot} R. Jones, A. McCallum, K. Nigam, and E. Riloff.

820: Bootstrapping for Text Learning Tasks.  In \emph{IJCAI-99

821: Workshop on Text Mining: Foundations, Techniques and Applications},

822: 52-63, 1999.

823:

824: \end{thebibliography}

825:

826:

827:

828:

829:

830:

831:

832:

833: