Web Page Archiving and Content Analysis


Mark Finlayson

Product Owner(s):

Andres Cremesini


Masoud Sadjadi


Web Page Archiving and Content Analysis 1.0. Before this, no solution existed to batch download large sets of web pages into a format that allows easy programmatic access to pages’ component parts. Now, one can use the tool built during this project to download, from a provided set of URLs, a faithful snapshot of news articles or blog posts, with all multimedia in original formats and with original file names, encapsulate this snapshot in a single file from which images, videos, or other multimedia can be easily extracted, enable the archive to be easily opened in a browser for viewing and browsing, with links to external sites preserved, but with links to local multimedia resources pointing to in the in-archive artifacts, identify the main textual content of the article, and extract that into a separate file with appropriate encodings, identify the posting, publishing, and/or correction date of the article

Team Members

Mark Fajet

Juan Alvarado