Web Based Content Extraction

Modern web documents contain far more data than their main content. Navigation menus, advertisements, functional or design elements are typical examples of additional contents which extend, enrich or simply come along with the main content. Content Extraction (CE) is the process of determining the main content of an HTML document. In the last years, several CE heuristics have been formulated.

Content Extraction and Template Detection

The algorithms for CE can be subdivided into two catergories: single document CE and multi document Template Detection (TD) approaches.
Template Detection algorithms use collections of web documents to determine the structure of a common underlying template.  Content Extraction algorithms instead operate on a single document and use heuristics to determine the main content.

Further Reading

  • Ziv Bar-Yossef and Sridhar Rajagopalan. Template detection via data mining and its applications. In WWW ’02: Proceedings of the 11th International Conference on World Wide Web, pages 580–591, New York, NY, USA, 2002. ACM Press.
  • Deepayan Chakrabarti, Ravi Kumar, and Kunal Punera. Page-level template detection via isotonic smoothing. In WWW ’07: Proceedings of the 16th International Conference on World Wide Web, pages 61–70, New York, NY, USA, 2007. ACM Press.
  • Sandip Debnath, Prasenjit Mitra, and C. Lee Giles. Automatic extraction of informative blocks from webpages. In SAC ’05: Proceedings of the 2005 ACM Symposium on Applied Computing, pages 1722–1726, New York, NY, USA, 2005. ACM Press.
  • Sandip Debnath, Prasenjit Mitra, and C. Lee Giles. Identifying content blocks from web documents. In Foundations of Intelligent Systems, Lecture Notes in Computer Science, pages 285–293, 2005.
  • Aidan Finn, Nicholas Kushmerick, and Barry Smyth. Fact or fiction: Content classification for digital libraries. In DELOS Workshop: Personalisation and Recommender Systems in Digital Libraries, 2001.
  • David Gibson, Kunal Punera, and Andrew Tomkins. The volume and evolution of web page templates. In WWW ’05: Special Interest Tracks and Posters of the 14th International Conference on World Wide Web, pages 830–839, New York, NY, USA, 2005. ACM Press.
  • Thomas Gottron. Evaluating content extraction on HTML documents. In ITA ’07: Proceedings of the 2nd International Conference on Internet Technologies and Applications, pages 123–132, September 2007.
  • Thomas Gottron. Bridging the gap: From multi document template detection to single document content extraction. In EuroIMSA ’08: Proceedings of the IASTED Conference on Internet and Multimedia Systems and Applications 2008, pages 66–71. ACTA Press, Calgary, March 2008.
  • Thomas Gottron. Clustering template based web documents. In ECIR ’08: Proceedings of the 30th European Conference on Information Retrieval, pages 40–51. Springer, March 2008.
  • Thomas Gottron. Content code blurring: A new approach to content extraction. In TIR ’08: Proceedings of the 5th International Workshop on Text Information Retrieval, pages 29 – 33. IEEE Computer Society, September 2008.
  • Suhit Gupta, Hila Becker, Gail Kaiser, and Salvatore Stolfo. Verifying genre-based clustering approach to content extraction. In WWW ’06: Proceedings of the 15th International Conference on World Wide Web, pages 875–876, New York, NY, USA, 2006. ACM Press.
  • Suhit Gupta, Gail Kaiser, David Neistadt, and Peter Grimm. DOM-based content extraction of HTML documents. In WWW ’03: Proceedings of the 12th International Conference on World Wide Web, pages 207–214, New York, NY, USA, 2003. ACM Press.
  • Suhit Gupta, Gail Kaiser, and Salvatore Stolfo. Extracting context to improve accuracy for HTML content extraction. In WWW ’05: Special Interest Tracks and Posters of the 14th International conference on World Wide Web, pages 1114–1115, New York, NY, USA, 2005. ACM Press.
  • Suhit Gupta, Gail E. Kaiser, Peter Grimm, Michael F. Chiang, and Justin Starren. Automating Content Extraction of HTML Documents. World Wide Web, 8(2):179–224, 2005
  • Hung-Yu Kao, Ming-Syan Chen, Shian-Hua Lin, and Jan-Ming Ho. Entropy-based link analysis for mining web informative structures. In CIKM ’02: Proceedings of the eleventh international conference on Information and knowledge management, pages 574–581, New York, NY, USA, 2002. ACM Press.
  • Hung-Yu Kao, Jan-Ming Ho, and Ming-Syan Chen. WISDOM: Web Intrapage Informative Structure Mining Based on Document Object Model. IEEE Transactions on Knowledge and Data Engineering, 17(5):614–627, 2005.
  • Hung-Yu Kao, Shian-Hua Lin, Jan-Ming Ho, and Ming-Syan Chen. Mining Web Informative Structures and Contents Based on Entropy Analysis. IEEE Transactions on Knowledge and Data Engineering, 16(1):41–55, 2004.
  • Shian-Hua Lin and Jan-Ming Ho. Discovering informative content blocks from web documents. In KDD ’02: Proceedings of the eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 588–593, New York, NY, USA, 2002. ACM Press.
  • Ling Ma, Nazli Goharian, Abdur Chowdhury, and Misun Chung. Extracting unstructured data from template generated web documents. In CIKM ’03: Proceedings of the twelfth international conference on Information and knowledge management, pages 512–515, New York, NY, USA, 2003. ACM Press.
  • Constantine Mantratzis, Mehmet Orgun, and Steve Cassidy. Separating XHTML content from navigation clutter using DOM-structure block analysis. In HYPERTEXT ’05: Proceedings of the sixteenth ACM conference on Hypertext and hypermedia, pages 145–147, New York, NY, USA, 2005. ACM Press.
  • David Pinto, Michael Branstein, Ryan Coleman, W. Bruce Croft, Matthew King, Wei Li, and Xing Wei. QuASM: a system for question answering using semi-structured data. In JCDL ’02: Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries, pages 46–55, New York, NY, USA, 2002. ACM Press.
  • A. F. R. Rahman, H. Alam, and R. Hartono. Content extraction from HTML documents. In WDA 2001: Proceedings of the First International Workshop on Web Document Analysis, pages 7–10, 2001.
  • D. C. Reis, P. B. Golgher, A. S. da Silva, and A. F. Laender. Automatic web news extraction using tree edit distance. In WWW ’04: Proceedings of the 13th International Conference on World Wide Web, pages 502–511, New York, NY, USA, 2004. ACM Press.
  • Karane Vieira, Altigran S. da Silva, Nick Pinto, Edleno S. de Moura, João M. B. Cavalcanti, and Juliana Freire. A fast and robust method for web page template detection and removal. In CIKM ’06: Proceedings of the 15th ACM international conference on Information and knowledge management, pages 258–267, New York, NY, USA, 2006. ACM Press.
  • Fabio Vitali, Angelo di Iorio, and Elisa Ventura Campori. Rule-based structural analysis of web pages. In DAS 2004: Proceedings of the 6th International Workshop on Document Analysis Systems, volume 3163 of Lecture Notes in Computer Science, pages 425–437. Springer, July 2004.
  • Tim Weninger and William H. Hsu. Text extraction from the web via text-tag-ratio. In TIR ’08: Proceedings of the 5th International Workshop on Text Information Retrieval, pages 23 – 28. IEEE Computer Society, September 2008.
  • Guizhen Yang, I. V. Ramakrishnan, and Michael Kifer. On the complexity of schema inference from web pages in the presence of nullable data attributes. In CIKM ’03: Proceedings of the twelfth International Conference on Information and Knowledge Management, pages 224–231, New York, NY, USA, 2003. ACM Press.
  • Lan Yi, Bing Liu, and Xiaoli Li. Eliminating noisy information in web pages for data mining. In KDD ’03: Proceedings of the ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 296–305, New York, NY, USA, 2003. ACM Press.

Contact

Thomas Gottron
Mail: gottron-AT-uni-mainz.de