cs0102002/body.tex
1: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2: % INTRODUCTION
3: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
4: \section{Introduction}
5: \label{intro}
6: 
7: There are an estimated 1 billion pages accessible on the world wide web with
8: 1.5 million pages being added daily.  
9: Describing and organizing this vast amount of content is essential
10: for realizing the web's full potential as an information resource.
11: Accomplishing this in a meaningful way will require consistent
12: use of metadata and other descriptive data structures such as
13: semantic linking\cite{bernerslee}.
14: Categorization is an important ingredient as is 
15: evident from the popularity of web
16: directories such as Yahoo!\cite{yahoo}, Looksmart\cite{looksmart}, and the 
17: Open Directory Project\cite{dmoz}.  However these resources have been 
18: created by large teams of human editors
19: and represent only one type of classification scheme that, while widely
20: useful, can never be suitable to all applications.  Classification is a
21: fundamental intellectual task, and we take it as an
22: axiom that it is important and indeed essential for
23: organizing and understanding web content.
24: 
25: Automated classification is needed for at least two important reasons.
26: The first is the sheer scale of resources available on the web and their
27: ever-changing nature.  It is simply not feasible to keep up with
28: the fast pace of growth and change on the web
29: through a manual classification effort
30: without expending immense time and effort.
31: The second reason is that classification itself is a subjective
32: activity.  Different classification schemes are needed for different
33: applications.  No single classification scheme is suitable for
34: all applications.  Therefore different types of classification schemes,
35: representing different facets of knowledge, may need to be applied
36: in an ongoing fashion as new applications demand them. 
37: Domain specific classification
38: schemes, which can be quickly applied to large amounts of content using
39: automated methods, hold great
40: promise for generating effective metadata.
41: 
42: Classification should be considered within the larger context of
43: subject-based metadata.  Specific fields in metadata records often
44: correspond to different classification schemes.
45: The effective use of rich metadata will be important for establishing
46: and leveraging the power of the semantic web.  If web content shifts
47: from primarily text-based to primarily multimedia oriented,
48: metadata will become even more important.  Structured metadata can
49: serve as a driver for many applications such as knowledge based
50: search and retrieval, reasoning engines, intelligent agents,
51: and multi-faceted organization of information.  However metadata
52: creation can be tedious and time consuming.  Automated methods, such
53: as the one described in this paper, can be useful for facilitating
54: metadata creation.
55: 
56: In this paper we discuss some practical issues for applying methods of
57: automated classification to web content.  Rather than take a
58: one size fits all approach we advocate the use of targeted specific
59: classification tasks, relevant to solving specific problems.
60: In section \ref{theweb} we discuss the nature of web content
61: and its implications for automated categorization.  
62: Extracting good features that can accurately discrimintate between
63: different categories is an important part of any text categorization system.
64: While it is possible and desirable to exploit metadata in the 
65: current web environment, we find that its use is far from widespread.
66: In section \ref{setup} we describe a specialized system for 
67: automatically classifying
68: web sites into industry categories.   This system
69: can serve as a generalized framework for efficient automated 
70: categorization of web content that includes targeted spidering,
71: domain specific classification, and a trainable general purpose
72: text categorization engine.
73: In section \ref{results} we present the results of our controlled
74: experiments.  We show how text features extracted from different parts of
75: web pages effect classification accuracy, and demonstrate that metatags
76: provide the best results.  We also compare the use of training data
77: obtained from a different domain versus training data drawn from the
78: target domain.  We find that training examples taken from the
79: content to be classified give better results, but using training data
80: from a different domain can suffice in cases where assembling new
81: data from scratch is not feasible.
82: Related work is discussed in section \ref{relatedwork}. 
83: In section \ref{conclusions} we state our conclusions and make 
84: suggestions for further research.
85: 
86: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
87: % THE WEB
88: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
89: \section{Text Categorization of Web Content}
90: \label{theweb}
91: 
92: The current state of the web differs markedly from the vision of the
93: semantic web as outlined by Tim Berners-Lee\cite{bernerslee}.  
94: While web content is machine
95: readable for the most part\footnote{The trend toward multimedia assets
96: puts the future of this assumption in some doubt, but dealing with
97: the problem of non-text information is beyond the scope of this paper.}, 
98: it is far from machine understandable.
99: Furthermore the ability for computers to understand written human
100: language is still quite limited at this point in time.  Therefore,
101: in this work we have adopted a text categorization approach that
102: relies heavily on word-based indexing and statistical classification, 
103: rather than
104: sophisticated natural language processing and knowledge-based inferencing.
105: This approach is capable of giving very good results in a way that is
106: robust and makes few assumptions about the content to be analyzed. This
107: is an important consideration given the heterogenous nature of web content.
108: 
109: One the main challenges with classifying web pages is the 
110: wide variation in their content and quality.  
111: Most text categorization 
112: methods rely on the existence of good quality texts, especially for 
113: training\cite{lewis92}.
114: Unlike many of the well-known collections typically studied in
115: automated text classification experiments (i.e. TREC, Reuters-22578, OSHUMED),
116: in comparison the web lacks homogeneity and regularness.  
117: To make matters worse,
118: much of the existing web page content is based in images, 
119: plug-in applications, or other non-text media.  The usage of metadata
120: is inconsistent or non-existent.  In this section we survey
121: the landscape of web content, and its relation to the 
122: requirements of text categorization systems.
123: 
124: \subsection{Analysis of Web Content}
125: 
126: In an attempt to characterize the nature of the content to be
127: classified, we performed a rudimentary quantitative analysis.
128: Our results were obtained by analyzing a collection of 29,998
129: web domains obtained from a random dump of the database
130: of a well-known domain name registration company.  
131: Of course these results
132: reflect the biases of our small samples and don't necessarily generalize to
133: the web as a whole, however they should be reflective of the issues
134: at hand.  Since our classification method is text based, it is important
135: to know the amount and quality of the text based features that typically
136: appear in web sites.  Existing standards for web content tend to be
137: \textit{de facto} and loosely enforced if at all.  
138: One convention that holds for the vast majority of web sites is that
139: the top level entry point is an HTML web page, so we take this to be our
140: primary source of text features.
141: Besides the body text
142: which is generally free form in a typical HTML page, it is
143: common to include a title and possibly a set of keywords and description
144: metatags.   One of the more promising sources 
145: of text features should be found in web page metadata.
146: 
147: In Table \ref{metawords} we show the percentage of web sites with a certain
148: number of words for each type of metatag.
149: We analyzed a sample of 19195 domains with live web sites and counted
150: the number of words used in the content attribute of the
151: \texttt{<META name=``keywords''>} and \texttt{<META name=``description''>} 
152: tags as well as \texttt{<TITLE>} tags.  We also counted free text
153: found within the \texttt{<BODY>} tag, excluding all other HTML tags.
154: 
155: \begin{table*}[!hp]
156: \caption{Percentage of Web Pages with Words in HTML Tags}
157: \label{metawords}
158: \begin{center}
159: \begin{tabular}{crrrr}
160: \hline
161: Tag Type & 0 words & 1-10 words & 11-50 words & 51+ words \\
162: \hline
163: Title & 4\% & 89\% & 6\% & 1\% \\
164: Meta-Description & 68\% & 8\% & 21\% & 3\% \\
165: Meta-Keywords & 66\% & 5\% & 19\% & 10\% \\
166: Body Text & 17\% & 5\% & 21\% & 57\% \\
167: \hline
168: \end{tabular}
169: \end{center}
170: \end{table*}
171: 
172: The most obvious source of text is within the body of the web page.
173: We noticed that about 17\% of top level web pages had no usable body
174: text.  These cases include pages that only contain frame sets,
175: images, or plug-ins (our user agent followed redirects whenever
176: possible).  Almost a quarter of web pages contained 11-50 words,
177: and the majority of web pages contained over 50 words.
178: 
179: Though title tags are common the amount of text is relatively small with
180: 89\% of the titles containing only 1-10 words.
181: Also, the titles often contain only names or terms such as 
182: ``home page'', which are not particularly helpful for subject classification.
183: 
184: Metatags for keywords and descriptions are used by several major search
185: engines, where they play an important role in the ranking and
186: display of search results.  Despite this, only about a third of
187: web sites were found to contain these tags.
188: As it turns out, metatags can be useful when they exist
189: because they contain text specifically intended to aid in the
190: identification of a web site's subject areas\footnote{The possibilities 
191: for misuse/abuse of these tags to improve search engine rankings are well 
192: known; however, we found these practices to be not very widespread in our 
193: sample and of little consequence.}.  Most of the time these metatags
194: contained between 11 and 50 words, with a smaller percentage containing
195: more than 50 words (in contrast to the number of words in the body
196: text which tended to contain more than 50 words).
197: 
198: The lack of widespread use of metatags, despite the apparent incentive
199: to improve search engine rankings, is instructive.  Since metadata 
200: is usually not part of the presentation of the content and its
201: benefit is somewhat intangible, it tends to be neglected.  Creating metadata
202: can be a tedious and unwelcome task.  Therefore methods to facilitate the
203: creation of quality metadata, especially automated methods, are greatly
204: needed.
205: 
206: \subsection{Good Text Features}
207: \label{goodfeatures}
208: 
209: Feature selection is an important part of building an automated
210: classification system.  Without a proper set of features, the
211: classifier will not be able to accurately 
212: discriminate between different categories.
213: The feature set must be sufficiently broad to acommodate the wide
214: variations that can occur even within instances of the same class.  On the
215: other hand the number of features needs to be constrained to reduce noise 
216: and to limit the burden on system resources.
217: 
218: In reference\cite{lewis92} it is argued that for the purposes of automated
219: text categorization, features should be:
220: \begin{enumerate}
221: \item Relatively few in number
222: \item Moderate in frequency of assignment
223: \item Low in redundancy
224: \item Low in noise
225: \item Related in semantic scope to the classes to be assigned
226: \item Relatively unambiguous in meaning
227: \end{enumerate}
228: 
229: Due to the wide variety of purpose and scope of current web content,
230: items 4 and 5 are difficult requirements to meet for most
231: classification tasks.  For subject
232: classification, metatags seem to meet those requirements better
233: than other sources of text such as titles and body text.  However
234: the lack of widespread use of metatags is a problem if 
235: coverage of the majority of web content is desired.  In the long term, 
236: automated categorization could really benefit if greater
237: attention is paid to the creation and usage of rich metadata and
238: explicit semantic structures,
239: especially if the above requirements are taken into consideration.
240: In the short term, one must implement a strategy for obtaining
241: good text features from the existing HTML and natural language
242: cues that takes the above requirements as well as the goals
243: of the classification task into consideration.  Techniques for shallow
244: parsing and information extraction are useful in this regard.
245: 
246: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
247: % SETUP
248: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
249: \section{Experimental Setup}
250: \label{setup}
251: 
252: We constructed a full scale automated classification system and
253: performed several experiments using real world data in order to
254: gauge system performance and test ideas.
255: The goal of our targeted domain specific task was to 
256: rapidly classify web sites (domain names)
257: into broad industry categories. In this section we describe the
258: main ingredients of our classification experiments including the data,
259: architecture, and evaluation measures.
260: 
261: \subsection{Classification Scheme}
262: 
263: The categorization scheme used was the
264: top level of the 1997 North American Industrial Classification System
265: (NAICS) \cite{naics}, which consists of 21 broad industry categories
266: shown in Table \ref{tnaics}.
267: 
268: \begin{table*}[!htbp]
269: \caption{Top level NAICS Categories}
270: \label{tnaics}
271: \begin{center}
272: \begin{tabular}{cl}
273: \hline
274: NAICS code & NAICS Description \\
275: \hline
276: 11 & Agriculture, Forestry, Fishing, and Hunting \\
277: 21 & Mining \\
278: 22 & Utilities \\
279: 23 & Construction \\
280: 31-33 & Manufacturing \\
281: 42 &  Wholesale Trade \\
282: 44-45 &  Retail Trade \\
283: 48-49 &  Transportation and Warehousing \\
284: 51 &  Information \\
285: 52 &  Finance and Insurance \\
286: 53 &  Real Estate and Rental and Leasing \\
287: 54 &  Professional, Scientific and Technical Services \\
288: 55 & Management of Companies and Enterprises \\
289: 56 &  Administrative and Support, \\
290:    &  Waste Management and Remediation Services \\
291: 61 & Educational Services \\
292: 62 & Health Care and Social Assistance \\
293: 71 & Arts, Entertainment and Recreation \\
294: 72 & Accommodation and Food Services \\
295: 81 & Other Services (except Public Administration) \\
296: 92 & Public Administration \\
297: 99 & Unclassified Establishments \\
298: \hline
299: \end{tabular}
300: \end{center}
301: \end{table*}
302: 
303: Some of our resources had been previously classified using the older
304: 1987 Standard Industrial Classification (SIC) system.  In these cases
305: we used the published mappings\cite{naics} to convert all
306: assigned SIC categories to their NAICS equivalents.  The full
307: NAICS has six levels of hierarchy and contains
308: several thousand subcategories.  For our experiments all lower level
309: NAICS subcategories were generalized up to the appropriate
310: top level category (though the entire classification scheme could
311: have been utilized by our system if a finer grained categorization
312: was desired).
313: 
314: NAICS and SIC are examples of authoritative controlled vocabularies.  
315: Using a published standardized classification scheme can be a good idea 
316: in order to take advantage of the many person hours of time it takes
317: to construct something like this.  In addition, it may be possible to
318: take advantage of existing content already classified by the scheme as
319: a source of training data.
320: 
321: \subsection{Targeted Spidering}
322: \label{spider}
323: 
324: Based on the results of section \ref{theweb}, it is obvious that selection
325: of adequate text features is an important issue and certainly
326: not to be taken for granted.  To balance the
327: needs of our text-based classifier against the speed and storage limitations of
328: a large-scale crawling effort, we took an approach for spidering
329: web sites and gathering text that was targeted to the classification task
330: at hand.  
331: 
332: In some preliminary tests we found the best classifier accuracy
333: was obtained by using only the contents of the keywords and
334: description metatags as the source of text features.  Adding
335: body text decreased classification accuracy.  However, due to
336: the lack of widespread usage of metatags limiting ourselves
337: to these features was not practical, and other sources of
338: text such as titles and body text were needed to provide
339: adequate coverage of web sites.  Therefore our targeted spidering approach
340: attempted to gather the higher quality text features from metatags
341: and only resorted to lower quality texts if needed.
342: 
343: Our opportunistic spider began at the top level page of the web site
344: and attempted to extract useful text from metatags and titles
345: if they exist, and then followed links for frame sets if they existed.  
346: It also followed any hyperlinks
347: that contained key substrings in their anchor text
348: such as \emph{product}, \emph{services},
349: \emph{about}, \emph{info}, \emph{press}, and \emph{news}, and again
350: looked for metatag content in those pages.  
351: These substrings were chosen based on
352: an \emph{ad hoc} frequency analysis and the assumption that they tend to
353: point to content that is useful for deducing an industry classification.
354: Only if no metatag content was found did the spider
355: gather the actual body text of the web page.  All extracted text was
356: concatenated into a single representative document for the site
357: that was submited to the classification engine.
358: For efficiency we ran several spiders in parallel, each working
359: on different lists of individual domain names.
360: 
361: What we were attempting to do by following a restricted set of hyperlinks,
362: was to take advantage of the
363: current web's \emph{implicit} semantic structure.
364: One the advantages of moving towards an \emph{explicit} semantic
365: structure for hypertext documents\cite{bernerslee} is that an
366: opportunistic spidering
367: approach could really benefit from a formalized description of the
368: semantic relationships between linked web pages.  This would allow
369: spiders to more easily find the most relevant resources without having to
370: crawl the entire network of the web.
371: 
372: \subsection{Test Data}
373: 
374: From our initial list of 29,998 domain names we used our targeted spider
375: to determine which sites were live and extracted
376: text features using the approach outlined in section \ref{spider}.
377: Of those, 13,557 domain names had usable text content and were pre-classified
378: according to one or more industry categories\footnote{Industry classifications
379: for domain names were provided by InfoUSA and Dunn \& Bradstreet.}.  From 
380: this set of data we drew samples for training, testing and validation.
381: 
382: \subsection{Training Data}
383: \label{ts}
384: 
385: We took two approaches to constructing training sets for our
386: classifiers.  In the
387: first approach we used a combination of 426 NAICS category labels 
388: (including subcategories) and 1504 U.S. Securities and Exchange Commission
389: (SEC) 10-K filings\footnote{SEC 10-K filings are annual reports
390: required of all U.S. public companies that describe business
391: activities for the year.  Each public company is also
392: assigned an SIC category.} 
393: for public companies\cite{dolin99} as training examples.  
394: In the second approach we used a set of 3618 pre-classified
395: domain names along with text for each domain obtained using our spider.
396: 
397: The first approach can be considered as using ``prior knowledge''
398: obtained in a different domain.  It is interesting to see how knowledge from
399: a different domain generalizes to the problem of classifying web sites.  
400: Furthermore it is
401: often the case that training examples can be difficult to obtain (thus
402: the need for an automated solution in the first place).  The
403: second approach is the more conventional classification by example.
404: In our case it was made possible by the fact that our database
405: of domain names was pre-classified according one or more industry categories.
406: 
407: 
408: \subsection{Classifier Architecture}
409: 
410: Our text classifier consisted of three modules: the targeted spider for 
411: extracting text features associated with a web site, 
412: an information retrieval engine for comparing queries to
413: training examples, and a decision algorithm for assigning categories.
414: 
415: Our spider was designed to quickly process a large database of
416: top level web domain names (e.g. domain.com, domain.net, etc.).
417: As described in section \ref{spider} we implemented an opportunistic
418: spider targeted to finding high quality text from pages that described
419: the business area, products, or services of a commercial web site.  
420: After accumulating text features, a query was submitted to the
421: text classifier.  The domain name and any automatically
422: assigned categories were logged in a central database.
423: Several spiders could be run in parallel for efficient use of system
424: resources.
425: 
426: Our information retrieval engine was based on Latent Sematic Indexing 
427: (LSI)\cite{lsi}.  LSI is a variation of the vector space model of
428: information retrieval that uses the technique of singular value
429: decomposition (SVD) to reduce the dimensionality of the vector space.
430: Words that tend to co-occur in the same document share large projections
431: along directions in the reduced space.  Theoretically this reduces
432: noise due to redundant or spurious word usage, and automatically 
433: derives relationships
434: between words and the inherent concepts.  Cosine similarity is computed
435: in the reduced vector space, which amounts to concept based matching rather
436: than word based.  For example queries containing the word ``car'' will
437: match documents containing only the word ``automobile'' provided the
438: relationship between the words and concept has been established in the corpus.
439: 
440: In a previous work\cite{dolin99} it was shown that LSI provided better
441: accuracy with fewer training set documents per category than standard
442: TF-IDF weighting.  Queries were compared to training
443: set documents based on their cosine similarity, and a ranked list of
444: matching documents and scores was forwarded to the decision module.
445: 
446: In the decision module,
447: we used a K-nearest neighbor algorithm for ranking categories and
448: assigned the top ranking category to the web site.  This type of classifier
449: tends to perform well compared to other methods\cite{yang}, is robust,
450: and tolerant of noisy data (all are important qualities when dealing with
451: web content).  In addition the algorithm is capable of producing good
452: results even when the amount of training data is limited.  The decision
453: module also is responsible for thresholding and presenting the final
454: set of automatically assigned categories.
455: 
456: \subsection{Evaluation Measures}
457: 
458: System evaluation was carried out using the standard precision,
459: recall, and F1 measures\cite{rijsbergen}\cite{lewis91}.  
460: Precision is the number of correct categories assigned divided
461: by the total number of categories assigned, and serves as a measure
462: of classification accuracy.  The higher the precision the smaller the amount
463: of false positives.  Recall is the number of correct categories
464: assigned divided by the total number of known correct categories.
465: Higher recall means a smaller amount of missed categories.  In theory,
466: scores of 1 are desirable for both precision and recall.  In practice
467: even human assigned classifications may only achieve scores between
468: 0.7 and 0.9, depending on the classification task.  This is because
469: to some extent classification is a subjective task and there are
470: usually ``grey areas'' in a classification scheme.
471: 
472: The F1 measure combines precision and recall with equal importance
473: into a single parameter for optimization and is defined as
474: \begin{equation}
475: F1 = \frac{2 P R}{P + R}
476: \end{equation}
477: where P is precision and R is recall.
478: 
479: We computed global estimates
480: of system performance using both micro-averaging (results are computed
481: based on global sums over all decisions) and 
482: macro-averaging (results are computed on a per-category basis,
483: then averaged over categories).  Micro-averaged
484: scores tend to be dominated by the most commonly used categories,
485: while macro-averaged scores tend to be dominated by the performance
486: in rarely used categories.  This distinction was relevant to our problem,
487: because it turned out that the vast majority of commercial web sites
488: are associated with the Manufacturing (31-33) category.
489: 
490: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
491: % RESULTS
492: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
493: \section{Results}
494: \label{results}
495: 
496: In our first experiment we varied the sources of text features
497: for 1125 pre-classified web domains.  We
498: constructed separate test sets
499: based on text extracted from the body text, metatags 
500: (keywords and descriptions),
501: and a combination of both.  The training set consisted of SEC documents
502: and NAICS category descriptions.
503: Results are shown in Table \ref{ptf}.
504: 
505: \begin{table}[!htbp]
506: \caption{Performance vs. Text Features}
507: \label{ptf}
508: \begin{center}
509: \begin{tabular}{cccc}
510: \hline
511: Sources of Text & micro P & micro R & micro F1 \\
512: \hline
513: Body & 0.47 & 0.34 & 0.39 \\
514: Body + Metatags & 0.55 & 0.34 & 0.42 \\
515: Metatags & 0.64 & 0.39 & 0.48 \\ 
516: \hline
517: \end{tabular}
518: \end{center}
519: \end{table}
520: 
521: Using metatags as the only source of text features resulted in
522: the most accurate classifications.  Precision decreases noticeably
523: when only the body text was used.  It is interesting that including
524: the body text along with the metatags also resulted in less accurate
525: classifications.  These results influenced the design of our spider
526: which extracted metatags first and foremost, while only grabbing
527: body text as a last resort.
528: The usefulness of metadata as a source of high quality
529: text features should not be suprising since it meets most of the
530: criteria listed in \ref{goodfeatures}.
531: 
532: In our second experiment we compared classifiers constructed from
533: the two different training sets described in section \ref{ts}.  
534: The results are shown in Table \ref{pts}.
535: 
536: \begin{table*}[!htbp]
537: \caption{Performance vs. Training Set}
538: \label{pts}
539: \begin{center}
540: \begin{tabular}{ccccccc}
541: \hline
542: Classifier & micro P & micro R & micro F1 & macro P & macro R & macro F1\\
543: \hline
544: SEC-NAICS & 0.66 & 0.35 & 0.45 & 0.23 & 0.18 & 0.09 \\
545: Web Pages & 0.71 & 0.75 & 0.73 & 0.70 & 0.37 & 0.40 \\
546: \hline
547: \end{tabular}
548: \end{center}
549: \end{table*}
550: 
551: The SEC-NAICS training set achieved respectable micro-averaged scores,
552: but the macro-averaged scores were low.  One reason for this is that
553: this classifier generalizes well in categories that are
554: common to the business and web domains (31-33, 23, 51),
555: but has trouble with recall in
556: categories that are not well represented in the business domain
557: (71, 92) and poor precision in categories that are not as common in the web
558: domain (54, 52, 56).
559: 
560: The training set constructed from web site text performed better
561: overall.  Macro-averaged recall was much lower than micro-averaged
562: recall.  This can be partially explained by the following example.
563: The categories Wholesale Trade (42) and Retail Trade (44-45) have
564: a subtle difference especially when it comes to web page text
565: which tends to focus on products and services delivered rather
566: than the Retail vs. Wholesale distinction.  In our training set, category
567: 42 was much more common than 44-45, and the former tended to be assigned
568: in place of the latter, resulting in low recall for 44-45.  Other
569: rare categories also tended to have low recall (e.g. 23, 56, 81).
570: 
571: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
572: % RELATED WORK
573: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
574: \section{Related Work}
575: \label{relatedwork}
576: 
577: Some automatically-constructed, large-scale web directories have
578: been deployed as commercial services such as 
579: Northern Light\cite{northernlight},
580: Inktomi Directory Engine\cite{inktomi}, Thunderstone
581: Web Site Catalog\cite{thunderstone}.  Details about these
582: systems are generally unavailable because of their proprietary
583: nature.  It is interesting that these directories tend not to
584: be as popular as their manually constructed counterparts.
585: 
586: A system for automated discovery and classification of domain
587: specific web resources is described as part of the DESIRE II 
588: project\cite{desire1}\cite{desire2}.  Their classification
589: algorithm weights terms from metatags higher than titles and
590: headings, which are weighted higher than plain body text.
591: They also describe the use of classification software as a
592: topic filter for harvesting a subject specific web index.
593: Another system, Pharos (part of the Alexandria Digital
594: Library Project), is a scalable
595: architecture for searching heterogeneous information sources
596: that leverages the use of metadata\cite{dolin96} and 
597: automated classification\cite{dolin98}.
598: 
599: The hyperlink structure of the web can be exploited for automated
600: classification by using the anchor text and other context
601: from linking documents as a source of text features\cite{attardi}.
602: Approaches to efficient web spidering\cite{cho}\cite{rennie} have
603: been investigated and are especially important for very large-scale
604: crawling efforts.
605: 
606: A complete system for automatically building searchable databases of
607: domain specific web resources using a combination of
608: techniques such as automated classification, targeted spidering, and
609: information extraction is described in reference\cite{mccallum}.
610: 
611: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
612: % CONCLUSIONS
613: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
614: \section{Conclusions}
615: \label{conclusions}
616: 
617: Automated methods of knowledge discovery, including classification,
618: will be important for establishing the semantic web.
619: Classification is a basic intellectual task and is challenging to
620: automate due to its somewhat subjective nature.  However it is possible
621: to achieve results with automated methods that meet or exceed manual results.
622: 
623: A single classification 
624: scheme can never be adequate for all applications.
625: We advocate a pragmatic approach including targeted techniques and
626: specialized domain knowledge to
627: be applied to specific classification tasks.  The result is an
628: efficient and optimized system for the task at hand.
629: In this paper we described a practical system for automatically
630: classifying web sites into industry
631: categories that gives good results.  This type of system can
632: be applied to any domain specific classification scheme.  All that is needed
633: is to define the categories, assemble the training data,
634: and configure the spider to extract the appropriate features.  The spider
635: may be constructed to follow specific types of links, or extract sections
636: of web page content that are most useful for a given domain.
637: 
638: From the results in Table \ref{ptf} 
639: we concluded that metatags were the best source
640: of quality text features, at least compared to the body text.  However
641: by limiting ourselves to metatags we would not be able to classify the
642: large majority web sites.  Therefore we opted for a targeted spider
643: that extracted metatag text first, looked for pages that
644: described business activities, and
645: then degraded to other text only
646: if necessary.  It seems clear that text contained in structured 
647: metadata fields results in better automated categorization.  If the
648: web moves toward a more formal semantic structure as outlined by
649: Tim Berners-Lee\cite{bernerslee}, then automated methods can benefit.  
650: If more and different kinds of automated 
651: classification tasks can be accomplished more accurately, the
652: web can be made to be more useful as well as more usable.
653: 
654: Rich metadata for web content is a key to better searching,
655: better organization and managment of content, and improved
656: intelligent agents capable of discovering and acting
657: upon the knowledge embedded in the vast online resources.
658: However, as we have shown,
659:  creation of metadata remains a bottleneck despite strong
660: incentives such as better rankings in search engine results.  It
661: seems that the only way to ensure widespread use of quality metadata
662: is to make the process of metadata creation as painless as possible.
663: Automated methods that can reliably and accurately generate metadata
664: from existing content hold much promise in this regard.  Furthermore
665: metadata needs to be multi-faceted, current, and extensible.  Only
666: automated systems can keep pace with the rate of generation of new
667: web content that we see today.
668: 
669: We outline our basic approach
670: for building a targeted automated categorization solution for web
671: content:
672: \begin{itemize}
673: \item \textbf{Knowledge Gathering} - It is important to have a 
674: clear understanding
675: of the domain to be classified and the quality of the content involved.  The
676: web is a heterogenous environment, but within given domains patterns and
677: commonalities can emerge.  Taking advantage of specialized knowledge can
678: improve classification results.
679: \item \textbf{Targeted Spidering} - For each classification task 
680: different features
681: will be important.  However, due to the lack of homogeneity in web content,
682: the existence of key features can be quite inconsistent.  A targeted spidering
683: approach tries to gather as many key features as possible with as
684: little effort as possible.  In the future this type of approach can
685: benefit greatly from a web structure that encourages the use of 
686: metadata and semantically-typed links.  It would be interesting to
687: do a more detailed analysis of semantic spidering and its effect on
688: system performance.
689: \item \textbf{Training} - The best training data comes from the
690: domain to be classified, since that gives the best chance
691: for identifying the key features.  In cases where it's
692: not feasible to assemble enough training data in the target domain,
693: it may be possible to achieve acceptable results using training data
694: gathered from a different domain.  This can be true for web content
695: which can be unstructured, uncontrolled, immense, and hence difficult
696: to assemble quality training data.  However, controlled
697: collections of pre-classfied electronic documents can be obtained
698: in many important domains (financal, legal, medical, etc.) and
699: applied to automated categorization of web content.
700: \item \textbf{Classification} - In addition to being
701: as accurate as possible, the classification method needs to
702: be efficient, scalable, robust, and tolerant of noisy data.  Classification 
703: algorithms that utilize the link structure of the web, including
704: formalized semantic linking structures should be further investigated. 
705: \end{itemize}
706: 
707: Non-text content such as images, applets, plugins, music and video
708: are becoming more and more prevalent on the web.  Devising automated
709: methods that can deal with this kind of content is an important area
710: for further investigation.  Again, effective use of metadata can be a
711: good way to help manage these types of non-text assets.
712: 
713: Better acceptance of metadata is one key to the future of the semantic web.
714: However, creation of quality metadata is tedious and is itself a
715: prime candidate for automated methods.  A preliminary method such
716: as the one outlined in the paper can serve as the basis for
717: bootstrapping\cite{boot} a more sophisticated classifier that takes full
718: advantage of the semantic web, and so on.
719: 
720: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
721: % ACKNOWLEDGE
722: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
723: \section{Acknowledgements}
724: \label{acknowledgements}
725: I would like to thank for Bill Wohler for
726: collaboration on system design and software implementation, and
727: Roger Avedon, Mark Butler, and Ron Daniel for useful discussions.  
728: Special thanks to Network Solutions Inc. for providing classified domain names.
729: 
730: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
731: % BIBLIOGRAPHY
732: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
733: \begin{thebibliography}{99}
734: 
735: \bibitem{bernerslee} T. Berners-Lee. Semantic Web Road Map. \\
736: http://www.w3.org/DesignIssues/Semantic.html, 1998.
737: 
738: \bibitem{yahoo} Yahoo!, http://www.yahoo.com/
739: 
740: \bibitem{looksmart} Looksmart, http://www.looksmart.com/
741: 
742: \bibitem{dmoz} Open Directory Project, http://www.dmoz.org/
743: 
744: \bibitem{lewis92} D. Lewis. Text Representation for Intelligent Text
745: Retrieval: A Classification-Oriented View. In P. Jacobs, editor,
746: \emph{Text-Based Intelligent Systems}, Chapter 9.  Lawrence Erlbaum, 1992.
747: 
748: \bibitem{naics} North American Industrial Classification System (NAICS) -
749: United States, 1997. \\
750: http://www.census.gov/epcd/www/naics.html
751: 
752: \bibitem{dolin99} R. Dolin, J. Pierre, M. Butler, and R. Avedon.  Practical
753: Evaluation of IR within Automated Classification Systems. \emph{Eighth
754: International Conference of Information and Knowledge Management}, 1999.
755: 
756: \bibitem{lsi} S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and
757: R. Harshman. Indexing by latent semantic analysis.  \emph{Journal
758: of the American Society for Information Science}, 41 (6):391-407, 1990.
759: 
760: \bibitem{rijsbergen} C.J. van Rijsbergen. \emph{Information Retrieval}.
761: Butterworths, London, 1979.
762: 
763: \bibitem{lewis91} D. Lewis. Evaluating Text Categorization. In 
764: \emph{Proceedings of the Speech and Natural Language Workshop},
765: 312-318, Morgan Kaufmann 1991.
766: 
767: \bibitem{yang} Y. Yang and X. Liu.  A re-examination of text
768: categorization methods.  In \emph{Proceedings of the 22nd Annual
769: ACM SIGIR Conference on Research and Development in Information
770: Retrieval}, 42-49, 1999.
771: 
772: \bibitem{northernlight} Northern Light, http://www.northernlight.com/
773: 
774: \bibitem{inktomi} Inktomi Directory Engine, \\
775: http://www.inktomi.com/products/portal/directory/
776: 
777: \bibitem{thunderstone} Thunderstone Web Site Catalog, \\
778: http://search.thunderstone.com/texis/websearch/about.html
779: 
780: 
781: \bibitem{desire1} A. Ardo, T. Koch, and L. Nooden. The construction 
782: of a robot-generated subject index. \emph{EU Project
783: DESIRE II D3.6a, Working Paper 1} 1999. \\
784: http://www.lub.lu.se/desire/DESIRE36a-WP1.html
785: 
786: \bibitem{desire2} T. Kock and A. Ardo.  Automatic classification of 
787: full-text HTML-documents from one specific subject area. \emph{EU Project
788: DESIRE II D3.6a, Working Paper 2} 2000. \\
789: http://www.lub.lu.se/desire/DESIRE36a-WP2.html
790: 
791: \bibitem{dolin96} R. Dolin, D. Agrawal, L. Dillon, and A. El Abbadi.
792: Pharos: A Scalable Distributed Architecture for Locating Heterogeneous 
793: Information Sources Version. In \emph{In Proceedings of the 6th International 
794: Conference on Information and Knowledge Management}, 1997.
795: 
796: \bibitem{dolin98} R. Dolin, D. Agrawal, A. El Abbadi, and J. Pearlman.
797: Using Automated Classification for Summarizing and Selecting Heterogeneous 
798: Information Sources. In \emph{D-Lib Magazine}, January, 1998.
799: 
800: \bibitem{attardi} G. Attardi, A. Gulli, and F. Sebastiani.  
801: Automatic Web Page Categorization by Link and Context Analysis.  
802: In Chris Hutchison and Gaetano Lanzarone (eds.),
803: \emph{Proceedings of THAI'99, European Symposium on Telematics, Hypermedia and Artificial Intelligence}, 105-119, 1999.
804: 
805: 
806: \bibitem{cho} J. Cho, H. Garcia-Molina, and L. Page. Efficient 
807: crawling through URL ordering.  \emph{In Computer Networks and 
808: ISDN Systems (WWW7)}, Vol. 30, 1998.
809: 
810: \bibitem{rennie} J. Rennie and A. McCallum. Using Reinforcement 
811: Learning to Spider the Web Efficiently. \emph{Proceedings of the 
812: Sixteenth International Conference on Machine Learning}, 1999.
813: 
814: \bibitem{mccallum} A. McCallum, K. Nigam, J. Rennie, and K. Seymore.
815: A Machine Learning Approach to Building Domain-Specific Search Engines.
816: \emph{The Sixteenth International Joint Conference on 
817: Artificial Intelligence}, 1999.
818: 
819: \bibitem{boot} R. Jones, A. McCallum, K. Nigam, and E. Riloff.
820: Bootstrapping for Text Learning Tasks.  In \emph{IJCAI-99
821: Workshop on Text Mining: Foundations, Techniques and Applications}, 
822: 52-63, 1999.
823: 
824: \end{thebibliography}
825: 
826: 
827: 
828: 
829: 
830: 
831: 
832: 
833: