fbc's avatar
Level 2

Webscraping help with Goutte.

I have installed Goutte. The docs show extracting data like so:

// Get the latest post in this category and display the titles
$crawler->filter('h2 > a')->each(function ($node) {
    print $node->text()."\n";
});

however the webpage I'm trying to scrap for one value has multiple tables and values on one line like so:

        <TD VALIGN="TOP" WIDTH="40%">
            <TABLE BORDER="1" WIDTH="100%">
                <TR>
                    <TH COLSPAN="4"><CENTER><B>GENERATION</B></CENTER></TH>
                </TR>
                <TR>
                    <TD BGCOLOR="#336699"> <P ALIGN=RIGHT><FONT SIZE="-2" FACE="Arial,Helvetica" COLOR="White">GROUP</FONT></TD>
                    <TD BGCOLOR="#336699"> <P ALIGN=RIGHT><FONT SIZE="-2" FACE="Arial,Helvetica" COLOR="White">MC</FONT></TD>
                    <TD BGCOLOR="#336699"> <P ALIGN=RIGHT><FONT SIZE="-2" FACE="Arial,Helvetica" COLOR="White">TNG</FONT></TD>
                    <TD BGCOLOR="#336699"> <P ALIGN=RIGHT><FONT SIZE="-2" FACE="Arial,Helvetica" COLOR="White">DCR</FONT></TD>
                </TR>
                <TR><TD>COAL</TD><TD>5723</TD><TD>3514</TD><TD>70</TD></TR>
<TR><TD>GAS</TD><TD>7657</TD><TD>5406</TD><TD>77</TD></TR>
<TR><TD>HYDRO</TD><TD>894</TD><TD>160</TD><TD>220</TD></TR>
<TR><TD>OTHER</TD><TD>438</TD><TD>228</TD><TD>0</TD></TR>
<TR><TD>WIND</TD><TD>1445</TD><TD>212</TD><TD>0</TD></TR>
<TR><TD>TOTAL</TD><TD>16157</TD><TD>9520</TD><TD>367</TD></TR>

            </TABLE>
        </TD>

so I need to extract the SECOND values on each of the lines:

                <TR><TD>COAL</TD><TD>5723</TD><TD>3514</TD><TD>70</TD></TR>
<TR><TD>GAS</TD><TD>7657</TD><TD>5406</TD><TD>77</TD></TR>
<TR><TD>HYDRO</TD><TD>894</TD><TD>160</TD><TD>220</TD></TR>
<TR><TD>OTHER</TD><TD>438</TD><TD>228</TD><TD>0</TD></TR>
<TR><TD>WIND</TD><TD>1445</TD><TD>212</TD><TD>0</TD></TR>

I assume may need something like this? But this will give be every value I just need the second value on the Coal line.

$coal->filter('TABLE > TR > TD')->each(function ($node) {
    $coalvalue = $node->text();
});

UPDATE: I read that DOMCRAWLER is used and these control filtering. https://symfony.com/doc/current/components/dom_crawler.html

0 likes
4 replies
DevMaster's avatar

Hello @fbc , did you solve this problem in the end?

I've used Goutte a lot and would love to help; though I might need more details.

I suspect you'll need to get all TD elements in each row and simply access the 2nd TD element.

fbc's avatar
Level 2

@DEVMASTER - I'm on my way to solving it, I've already gathered all the elements I think I need to solve it.

I'll be testing a sample controller in a moment.

I really appreciate your willingness to help. If I run into an error somewhere with my code. I will definitely post it here and tag you on it.. Thank you.

fbc's avatar
Level 2

@DEVMASTER - I can't seem to extract the proper values with:

Route::get('hdtuto', function() {

    $crawler = Goutte::request('GET', 'http://ets.aeso.ca/ets_web/ip/Market/Reports/CSDReportServlet');

    $aeso_data = $crawler->filter('TABLE > TR > TD');

    dd($aeso_data);

});
DevMaster's avatar

I think you might be looking for something like this (which is plain commandline PHP):

<?php

include './vendor/autoload.php';

use Goutte\Client as Client;

$client = new Client();

$response = $client->request('GET', 'http://ets.aeso.ca/ets_web/ip/Market/Reports/CSDReportServlet');

$response->filter('table')->each(function($node){

    $node->filter('th')->each(function($node){
        echo 'Table: ' . $node->text() . PHP_EOL;
    });

    $node->filter('td:nth-child(2)')->each(function($node){
        echo $node->text() . PHP_EOL;
    });

});

Please or to participate in this conversation.