Results 1 to 7 of 7

Thread: Retrieving data or source code from linked webpages

  1. #1
    I'm working with many webpages that are basically lists of names with each name being a link to that persons own webpage. Is there a simple way to retrieve the data or the source code of every one of these linked pages, starting from the name list page, in a single operation? Either actual data, or the source code of all of the webpages in a single text file that I can parse myself would be very useful. Thanks.

  2. Internet, Programming and Graphics   -   #2
    If by "source code" you mean the HTML source, you can do this with wget.
    "I just remembered something that happened a long time ago."

  3. Internet, Programming and Graphics   -   #3
    Even tho I am sorry for replying a little late to this, I believe others could benefit from this..

    I had very good results with "webcopier pro 5.4", available here http://www.maximumsoft.com/products/.../overview.html



    You can download entire websites for offline browsing. There are options to even disguise the client as an old IE or Firefox front-end & to avoid suspicion you can randomize the speed & links to download per hour or day.. I think full-version DLs are also available on popular portals..

    wc_pro_big_box.png



    Another free utility is HTTrack.. Not with very encouraging results for me personally

  4. Internet, Programming and Graphics   -   #4
    whatcdfan's Avatar A.W.A BT Rep: +2
    Join Date
    Jun 2010
    Posts
    1,522
    Not sure if I've understood you correctly but if you're looking to gather anything from a website, IDM has sitegrabber which creates the task on your specifications and does it very efficiently.

  5. Internet, Programming and Graphics   -   #5
    oh no, sitegrabber just gathers the content.. If u want a truly working offline site, with all java & API scripting.. that's what webcopierpro & their ilk do..

  6. Internet, Programming and Graphics   -   #6
    Quote Originally Posted by anon View Post
    If by "source code" you mean the HTML source, you can do this with wget.
    I tried this with wget, and while it did recursively grab every page, it gave me a text file in which everything ran on as a single line. Is there a way to make it store the HTML source in a formatted form so that a program could read and filter it line by line?

  7. Internet, Programming and Graphics   -   #7
    Quote Originally Posted by leonmyerson View Post
    I tried this with wget, and while it did recursively grab every page, it gave me a text file in which everything ran on as a single line.
    Might be because the HTML is being delivered using Unix-style newlines instead of Windows-style. Make your parser recognize both 0x0A and 0x0D0A as line breaks if that's the case (or preprocess the documents with a simple search-and-replace script).
    "I just remembered something that happened a long time ago."

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •