Retrieving data or source code from linked webpages

Printable View

08-28-2014, 02:03 PM
leonmyerson

Retrieving data or source code from linked webpages

I'm working with many webpages that are basically lists of names with each name being a link to that persons own webpage. Is there a simple way to retrieve the data or the source code of every one of these linked pages, starting from the name list page, in a single operation? Either actual data, or the source code of all of the webpages in a single text file that I can parse myself would be very useful. Thanks.
08-29-2014, 01:41 AM
anon

Re: Retrieving data or source code from linked webpages

If by "source code" you mean the HTML source, you can do this with wget.
09-16-2014, 08:47 PM
scdas141

1 Attachment(s)

Re: Retrieving data or source code from linked webpages

Even tho I am sorry for replying a little late to this, I believe others could benefit from this..

I had very good results with "webcopier pro 5.4", available here http://www.maximumsoft.com/products/.../overview.html

You can download entire websites for offline browsing. There are options to even disguise the client as an old IE or Firefox front-end & to avoid suspicion you can randomize the speed & links to download per hour or day.. I think full-version DLs are also available on popular portals..

Attachment 147599

Another free utility is HTTrack.. Not with very encouraging results for me personally
09-17-2014, 04:58 AM
whatcdfan

Re: Retrieving data or source code from linked webpages

Not sure if I've understood you correctly but if you're looking to gather anything from a website, IDM has sitegrabber which creates the task on your specifications and does it very efficiently.
09-17-2014, 06:50 PM
scdas141

Re: Retrieving data or source code from linked webpages

oh no, sitegrabber just gathers the content.. If u want a truly working offline site, with all java & API scripting.. that's what webcopierpro & their ilk do..
09-18-2014, 03:21 AM
leonmyerson

Re: Retrieving data or source code from linked webpages

Quote:

Originally Posted by anon

If by "source code" you mean the HTML source, you can do this with wget.

I tried this with wget, and while it did recursively grab every page, it gave me a text file in which everything ran on as a single line. Is there a way to make it store the HTML source in a formatted form so that a program could read and filter it line by line?
09-18-2014, 02:34 PM
anon

Re: Retrieving data or source code from linked webpages

Quote:

Originally Posted by leonmyerson

I tried this with wget, and while it did recursively grab every page, it gave me a text file in which everything ran on as a single line.

Might be because the HTML is being delivered using Unix-style newlines instead of Windows-style. Make your parser recognize both 0x0A and 0x0D0A as line breaks if that's the case (or preprocess the documents with a simple search-and-replace script).