user-pic

Using Scraper with rubbish HTML

Vote 0 Votes

Trying to update my Spore action stream to cope now the full game is out and EA have done more with the web services, but they've not done a great job of it so need to somehow fill in the gaps!

Anyway, the problem I am trying to resolve at the moment is when using the scraper, i want to grab the second td, and not the first. Here is an example of a Spore achievement:

        <b>Civilization stage unlocked</b>

        Play enough of the Tribe stage to unlock the Civilization stage

         Tue September 23, 2008

I am getting the image fine and the title, I just want that description also, and using

- tr td
- TEXT

is grabbing the first td, not the second. It would help if EA had done nice HTML, without layout tables, but there's not a lot we can do about that really!

Any ideas about how to grab the contents of the second td?

Thanks

5 Replies

| Add a Reply
  • I had expected that with no preview, so here is the first HTML block as it should be above (and in fact no changes, looks like markdown processing is different for post entries and comments):

    <tr>
        <td>
            <img height="60" align="middle" width="60" alt="Achievement" src="/static/war/images/achievements/0x64d3560.png"/>
        </td>
        <td>
            <b>Civilization stage unlocked</b>
            <br/>
            Play enough of the Tribe stage to unlock the Civilization stage
            <br/>
            <span style="font-size: 9px; font-weight: bold;"> Tue September 23, 2008</span>
        </td>
    </tr>
    
  • If that's the way that they structured the HTML, I think you'll have to use a custom collector.

    • dang, i was hoping to avoid that, I know very little Perl indeed!

      Would it be considered off form to parse the information via my website (so that I can do the coding in ASP) and return that to the stream?

      Or is there anyone here that could do that in Perl easily?

  • Further to this, it seems it is possible with XPath, however the scraper is changing what I type in. It is converting this:

    td[position()=2]
    

    to this:

    td[@position()=2]
    

    It looks like this is happening in the actual web::scraper module and not in the action stream plugin, so not sure how to fix.

    That said, if the web::scraper is just doing an xpath on the html, would it be possible to do that xpath query myself in the action stream?

  • I'd give the XPath option a shot. I had forgotten about that as an option. Is there any way that you can get them to deliver this information as a feed?

Add a Reply

Forum Groups

151 405

Last Topic: MT Interface Missing by Sherri on Nov 10, 2008

36 144

Last Topic: Installation can't finish by Drazend on Nov 10, 2008

34 93

Last Topic: Creating your own Plug-in by jondauz on Nov 5, 2008

10 33

Last Topic: new licensing confusion by Neil Epstein on Aug 14, 2008

code.sixapart.com

62 226

Last Topic: Callback after blog publishing. by Tomato Interactive on Oct 27, 2008

34 98

Last Topic: Ajax Rating Plugin by kiran on Oct 17, 2008