fbc

Webscraping help with Goutte.

Posted 4 months ago by fbc

I have installed Goutte. The docs show extracting data like so:

// Get the latest post in this category and display the titles
$crawler->filter('h2 > a')->each(function ($node) {
    print $node->text()."\n";
});

however the webpage I'm trying to scrap for one value has multiple tables and values on one line like so:

        <TD VALIGN="TOP" WIDTH="40%">
            <TABLE BORDER="1" WIDTH="100%">
                <TR>
                    <TH COLSPAN="4"><CENTER><B>GENERATION</B></CENTER></TH>
                </TR>
                <TR>
                    <TD BGCOLOR="#336699"> <P ALIGN=RIGHT><FONT SIZE="-2" FACE="Arial,Helvetica" COLOR="White">GROUP</FONT></TD>
                    <TD BGCOLOR="#336699"> <P ALIGN=RIGHT><FONT SIZE="-2" FACE="Arial,Helvetica" COLOR="White">MC</FONT></TD>
                    <TD BGCOLOR="#336699"> <P ALIGN=RIGHT><FONT SIZE="-2" FACE="Arial,Helvetica" COLOR="White">TNG</FONT></TD>
                    <TD BGCOLOR="#336699"> <P ALIGN=RIGHT><FONT SIZE="-2" FACE="Arial,Helvetica" COLOR="White">DCR</FONT></TD>
                </TR>
                <TR><TD>COAL</TD><TD>5723</TD><TD>3514</TD><TD>70</TD></TR>
<TR><TD>GAS</TD><TD>7657</TD><TD>5406</TD><TD>77</TD></TR>
<TR><TD>HYDRO</TD><TD>894</TD><TD>160</TD><TD>220</TD></TR>
<TR><TD>OTHER</TD><TD>438</TD><TD>228</TD><TD>0</TD></TR>
<TR><TD>WIND</TD><TD>1445</TD><TD>212</TD><TD>0</TD></TR>
<TR><TD>TOTAL</TD><TD>16157</TD><TD>9520</TD><TD>367</TD></TR>

            </TABLE>
        </TD>

so I need to extract the SECOND values on each of the lines:

                <TR><TD>COAL</TD><TD>5723</TD><TD>3514</TD><TD>70</TD></TR>
<TR><TD>GAS</TD><TD>7657</TD><TD>5406</TD><TD>77</TD></TR>
<TR><TD>HYDRO</TD><TD>894</TD><TD>160</TD><TD>220</TD></TR>
<TR><TD>OTHER</TD><TD>438</TD><TD>228</TD><TD>0</TD></TR>
<TR><TD>WIND</TD><TD>1445</TD><TD>212</TD><TD>0</TD></TR>

I assume may need something like this? But this will give be every value I just need the second value on the Coal line.

$coal->filter('TABLE > TR > TD')->each(function ($node) {
    $coalvalue = $node->text();
});

UPDATE: I read that DOMCRAWLER is used and these control filtering. https://symfony.com/doc/current/components/dom_crawler.html

Please sign in or create an account to participate in this conversation.