PDA

View Full Version : Wanted: Coder to capture/parse web page data - will pay $



Aaron
September 2nd, 2005, 11:14 AM
I need a good scripter that can capture data from a web page. The data seems to be coming down in

XML format and I cannot find how to access it.

Here's the 30 second rundown on how I get to the data now...

Login to a web page (I have a membership)
Select the search
Input the date range and select search button
Output is displayed in a table with links to the actual data
I must then click on each link, which brings up a screen with all the data I need.

I must do this for each dataset... there are hundreds.

I'd like to automate this process and capture the data into a database.

If you can accomplish this task in short order then please contact me and we can discuss terms of

the agreement.

thanks!

livin @ cox.net

Promixis
September 2nd, 2005, 12:28 PM
Can you post a file with the xml data in it?

Aaron
September 2nd, 2005, 12:32 PM
Hi Mike,
If I knew where the XML info was I'd not need a coder ;)

Seriously though... I cannot find it. Everything is done via https and asp and I have no idea how to get to the data.

One problem is that the web page does not allow you to view the source. So I cannot see any links per se.


Attached Search.jpg is a screen shot of what I see after I do the search. Then I need to click on the number to get the specific data for each person. That individual data is what I'd like to place in the database.
Once I click on the number, the existing window is used and replaced by the Person.jpg (only partial data in screenshot) info.

I'm a decent hack and could pull the data if it was obvious to where it is pulled to... temp file, etc. I used WebCopier to see the asp pages and I have attached a ZIP with them in it.


thanks!
Aaron

Promixis
September 2nd, 2005, 01:11 PM
Hm, this is out of my league :(

MMcM
September 3rd, 2005, 02:48 PM
I believe that the WebCopier files you attached just say that it did not do a good enough job of impersonating IE for the site to accept it and send back data. It's just showing what it would if you browsed there with an unsupported browser.

So, one approach would be to find a better trace tool. If your supposition is correct that the data is ultimately XML with client-side JavaScript formatting, then finding some internal URL that returns the right data may be all there is to it. Perhaps using Ethereal to look at the packets would help. Perhaps there is a FireFox plug-in that logs every URL that the browser visits and maybe even some of the data; that's the sort of odd thing that people develop for it.

Alternatively, ignore what goes over the wire and concentrate on the ultimate goal of getting the data. Specifically, get a top-notch HTML screen scraper. A good one should (1) do a very good job of looking like a real browser to the site, (2) have a way of automating the navigation that a non-coder can orchestrate, and (3) extract the data into a structured format (so, HTML tables become Access databases). I am no longer qualified to say what the state of the art is here. But a quick web search shows some that are commercial products and you said you were willing to pay.

Finally, without wanting to seem alarmist, I should remind you to consider what the provider of this data thinks of what you are doing. Perhaps you already know that they don't care, so long as you don't impose a burden on their tech support. On the other hand, the draconian DCMA allows them a good deal of leeway in claiming that you have circumvented some kind of copyrighted data protection. It even allows prosecution of someone who helps you figure out how to get around it, although there is no actual precedent for that.

Aaron
September 3rd, 2005, 03:54 PM
MMcM,
That is good info, thank you.

As for the data provider, I'm sure they do not want to make it easy to get the data more then a few at a time... but I pay them a subscription fee, thus own the license (right) to get it and use it for my business.