Extracting information from a hierarchy of web pages, how does one go about

Vincent G · Oct 27, 2010

doing it quickly? I need to access and upload information that is presently available in a bunch of web pages. From a root web page, there are several HTML tags that point to second level pages, which in turn points to 3rd level -- those third level pages do contain data that I need to extract -- but more specifically, they also do refer to one or several 4th level pages that contain the final information that needs extracting.
It is actually a pity that the nice, user friendly query format of the existing pages are intended to make accessing the data easier, except that the intent there is to allow someone to get to one specific bit of data, while I need to get all of the information, filter out what is not relevant, and organize the rest to make a database that an off-line program can access and use.

My need here is basically to avoid doing thousands of copy and paste operations manually. I would like for a tool to follow the path and links, access the page information, and append that into a file that could then be edited or otherwise filtered.
I understand that there are several free web harvesters tools available that could do something along those lines, and since this is something that is unlikely to have to be repeated in the future, the investment in learning all the bells and whistles of a very powerful and sophisticated system is not considered to be worthwhile at this point; quick and dirty is good enough. Once the data is in a single file, the proper formatting and filtering (and removal of the HTML tags, if the data is acquired as source code page) is something that is comparatively a small issue, easily addressed.

Can anyone offer a suggested approach / tool / package that could perform that task?
The other restrictions are that I will be doing the deed from a Windows XP environment (I lost LILO when I had to reinstall the Windows OS, didn't have time to fit that) although I have Cygwin.

To summarize the architecture to crawl and harvest, making an analogy to geographic data:

root with list of countries -> country pages, with list of cities -> city pages, with list of boroughs -> boroughs pages

The information that needs extracting (preferably it will feature a URL tag so that boundaries between data blocks could be tracked in the resulting file) is that of the "city pages", and that in the "boroughs pages".

Extracting information from a hierarchy of web pages, how does one go about

Vincent G

New member