PDA

View Full Version : Retrieving data or source code from linked webpages



leonmyerson
08-28-2014, 02:03 PM
I'm working with many webpages that are basically lists of names with each name being a link to that persons own webpage. Is there a simple way to retrieve the data or the source code of every one of these linked pages, starting from the name list page, in a single operation? Either actual data, or the source code of all of the webpages in a single text file that I can parse myself would be very useful. Thanks.

anon
08-29-2014, 01:41 AM
If by "source code" you mean the HTML source, you can do this with wget.

scdas141
09-16-2014, 08:47 PM
Even tho I am sorry for replying a little late to this, I believe others could benefit from this..

I had very good results with "webcopier pro 5.4", available here http://www.maximumsoft.com/products/wc_pro/overview.html



You can download entire websites for offline browsing. There are options to even disguise the client as an old IE or Firefox front-end & to avoid suspicion you can randomize the speed & links to download per hour or day.. I think full-version DLs are also available on popular portals..

147599



Another free utility is HTTrack (www.httrack.com/).. Not with very encouraging results for me personally

whatcdfan
09-17-2014, 04:58 AM
Not sure if I've understood you correctly but if you're looking to gather anything from a website, IDM has sitegrabber which creates the task on your specifications and does it very efficiently.

scdas141
09-17-2014, 06:50 PM
oh no, sitegrabber just gathers the content.. If u want a truly working offline site, with all java & API scripting.. that's what webcopierpro & their ilk do..

leonmyerson
09-18-2014, 03:21 AM
If by "source code" you mean the HTML source, you can do this with wget.

I tried this with wget, and while it did recursively grab every page, it gave me a text file in which everything ran on as a single line. Is there a way to make it store the HTML source in a formatted form so that a program could read and filter it line by line?

anon
09-18-2014, 02:34 PM
I tried this with wget, and while it did recursively grab every page, it gave me a text file in which everything ran on as a single line.

Might be because the HTML is being delivered using Unix-style newlines instead of Windows-style. Make your parser recognize both 0x0A and 0x0D0A as line breaks if that's the case (or preprocess the documents with a simple search-and-replace script).