Retrieving data or source code from linked webpages
I'm working with many webpages that are basically lists of names with each name being a link to that persons own webpage. Is there a simple way to retrieve the data or the source code of every one of these linked pages, starting from the name list page, in a single operation? Either actual data, or the source code of all of the webpages in a single text file that I can parse myself would be very useful. Thanks.
Re: Retrieving data or source code from linked webpages
If by "source code" you mean the HTML source, you can do this with wget.
1 Attachment(s)
Re: Retrieving data or source code from linked webpages
Even tho I am sorry for replying a little late to this, I believe others could benefit from this..
I had very good results with "webcopier pro 5.4", available here http://www.maximumsoft.com/products/.../overview.html
You can download entire websites for offline browsing. There are options to even disguise the client as an old IE or Firefox front-end & to avoid suspicion you can randomize the speed & links to download per hour or day.. I think full-version DLs are also available on popular portals..
Attachment 147599
Another free utility is HTTrack.. Not with very encouraging results for me personally
Re: Retrieving data or source code from linked webpages
Not sure if I've understood you correctly but if you're looking to gather anything from a website, IDM has sitegrabber which creates the task on your specifications and does it very efficiently.
Re: Retrieving data or source code from linked webpages
oh no, sitegrabber just gathers the content.. If u want a truly working offline site, with all java & API scripting.. that's what webcopierpro & their ilk do..
Re: Retrieving data or source code from linked webpages
Quote:
Originally Posted by
anon
If by "source code" you mean the HTML source, you can do this with wget.
I tried this with wget, and while it did recursively grab every page, it gave me a text file in which everything ran on as a single line. Is there a way to make it store the HTML source in a formatted form so that a program could read and filter it line by line?
Re: Retrieving data or source code from linked webpages
Quote:
Originally Posted by
leonmyerson
I tried this with wget, and while it did recursively grab every page, it gave me a text file in which everything ran on as a single line.
Might be because the HTML is being delivered using Unix-style newlines instead of Windows-style. Make your parser recognize both 0x0A and 0x0D0A as line breaks if that's the case (or preprocess the documents with a simple search-and-replace script).