Results 1 to 6 of 6

Thread: Wanted: Coder to capture/parse web page data - will pay $

  1. #1
    Join Date
    Mar 2003
    Location
    Arizona
    Posts
    519

    Default Wanted: Coder to capture/parse web page data - will pay $

    I need a good scripter that can capture data from a web page. The data seems to be coming down in

    XML format and I cannot find how to access it.

    Here's the 30 second rundown on how I get to the data now...

    Login to a web page (I have a membership)
    Select the search
    Input the date range and select search button
    Output is displayed in a table with links to the actual data
    I must then click on each link, which brings up a screen with all the data I need.

    I must do this for each dataset... there are hundreds.

    I'd like to automate this process and capture the data into a database.

    If you can accomplish this task in short order then please contact me and we can discuss terms of

    the agreement.

    thanks!

    livin @ cox.net
    Aaron
    ----------
    My Setup:
    XBMC Media Center, Whole-House Audio, Paradigm-Onkyo-Parasound-Velodyne, 65" Mits 1080p DLP, EventGhost, Homeseer

  2. #2
    Join Date
    Dec 2001
    Posts
    11,560

    Default

    Can you post a file with the xml data in it?

  3. #3
    Join Date
    Mar 2003
    Location
    Arizona
    Posts
    519

    Default

    Hi Mike,
    If I knew where the XML info was I'd not need a coder

    Seriously though... I cannot find it. Everything is done via https and asp and I have no idea how to get to the data.

    One problem is that the web page does not allow you to view the source. So I cannot see any links per se.


    Attached Search.jpg is a screen shot of what I see after I do the search. Then I need to click on the number to get the specific data for each person. That individual data is what I'd like to place in the database.
    Once I click on the number, the existing window is used and replaced by the Person.jpg (only partial data in screenshot) info.

    I'm a decent hack and could pull the data if it was obvious to where it is pulled to... temp file, etc. I used WebCopier to see the asp pages and I have attached a ZIP with them in it.


    thanks!
    Aaron
    Attached Images Attached Images
    Attached Files Attached Files
    Aaron
    ----------
    My Setup:
    XBMC Media Center, Whole-House Audio, Paradigm-Onkyo-Parasound-Velodyne, 65" Mits 1080p DLP, EventGhost, Homeseer

  4. #4
    Join Date
    Dec 2001
    Posts
    11,560

    Default

    Hm, this is out of my league

  5. #5
    Join Date
    May 2001
    Location
    Chestnut Hill, MA, USA
    Posts
    527

    Default

    I believe that the WebCopier files you attached just say that it did not do a good enough job of impersonating IE for the site to accept it and send back data. It's just showing what it would if you browsed there with an unsupported browser.

    So, one approach would be to find a better trace tool. If your supposition is correct that the data is ultimately XML with client-side JavaScript formatting, then finding some internal URL that returns the right data may be all there is to it. Perhaps using Ethereal to look at the packets would help. Perhaps there is a FireFox plug-in that logs every URL that the browser visits and maybe even some of the data; that's the sort of odd thing that people develop for it.

    Alternatively, ignore what goes over the wire and concentrate on the ultimate goal of getting the data. Specifically, get a top-notch HTML screen scraper. A good one should (1) do a very good job of looking like a real browser to the site, (2) have a way of automating the navigation that a non-coder can orchestrate, and (3) extract the data into a structured format (so, HTML tables become Access databases). I am no longer qualified to say what the state of the art is here. But a quick web search shows some that are commercial products and you said you were willing to pay.

    Finally, without wanting to seem alarmist, I should remind you to consider what the provider of this data thinks of what you are doing. Perhaps you already know that they don't care, so long as you don't impose a burden on their tech support. On the other hand, the draconian DCMA allows them a good deal of leeway in claiming that you have circumvented some kind of copyrighted data protection. It even allows prosecution of someone who helps you figure out how to get around it, although there is no actual precedent for that.

  6. #6
    Join Date
    Mar 2003
    Location
    Arizona
    Posts
    519

    Default

    MMcM,
    That is good info, thank you.

    As for the data provider, I'm sure they do not want to make it easy to get the data more then a few at a time... but I pay them a subscription fee, thus own the license (right) to get it and use it for my business.
    Aaron
    ----------
    My Setup:
    XBMC Media Center, Whole-House Audio, Paradigm-Onkyo-Parasound-Velodyne, 65" Mits 1080p DLP, EventGhost, Homeseer

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •