Jan 4, 2023

Level 13

How to use Http pool in Laravel?

This is a follow-up question on the earlier discussion: https://laracasts.com/discuss/channels/laravel/how-to-speed-up-this-script-to-verify-90k-ulrs-for-their-http-status-code , which I'm still working on. I've decided to make use of the concurrent requests feature offered by Laravel's Http client. I'm however, not able to put the code together. Would appreciate your help.

public function handle()
    {
        DB::table('internal_links')->distinct('href')->orderBy('id')->chunk(50, function($urls) {
            // How do I write the following block to have my "$urls as $url) go into the $pool->head($url) ?
            $responses = Http::pool(fn (Pool $pool) => [
                $pool->head('http://url1'),
                $pool->head('http://url2')
            ]);
        });
        return Command::SUCCESS;
    }

webrobert

3 years ago

Level 51

this code needs a bit of a refactor, but it works and it was close at hand...

public function getEventAndInvitees(string $calendlyId)
{
    $responses = Http::pool(fn (Pool $pool) => [
        $pool->withToken(config('services.calendly.token'))
             ->get("https://api.calendly.com/scheduled_events/{$calendlyId}"),
        $pool->withToken(config('services.calendly.token'))
             ->get("https://api.calendly.com/scheduled_events/{$calendlyId}/invitees?status=active")
    ]);

    if ( ! $responses[0]->ok() || ! $responses[1]->ok() ) {
        Abort('503', 'Houston we have a problem');
    }

    return $responses;
}

and then in my controller

  $responses = (new Calendly())->getEventAndInvitees($calendlyId);
  $event     = $responses[0]->collect()['resource'];
  $invitee   = $responses[1]->collect()['collection'][0];

webrobert

3 years ago

Level 51

just saw your comment, something like this....

$responses = Http::pool(function(Pool $pool) use($urls) {
   foreach ($urls as $url) {
     $pool->head($url);
   }
});

tisuchi

3 years ago

Level 70

@webrobert I just simplified your method a bit.


public function getEventAndInvitees(string $calendlyId)
{
    $eventUrl = "https://api.calendly.com/scheduled_events/{$calendlyId}";
    $inviteesUrl = "https://api.calendly.com/scheduled_events/{$calendlyId}/invitees?status=active";
    $token = config('services.calendly.token');

    $responses = Http::withToken($token)->pool([
        $eventUrl,
        $inviteesUrl
    ]);

    if ( ! $responses[0]->ok() || ! $responses[1]->ok() ) {
        abort('503', 'Houston we have a problem');
    }

    return $responses;
}

1 like

webrobert

3 years ago

Level 51

@tisuchi, oh nice! I didn't realize you could use pool that way.

thebigk

3 years ago

Level 13

I'm getting following error -

Call to undefined method GuzzleHttp\Exception\ConnectException::status()

My code is -

public function handle()
    {
        DB::table('internal_links')->distinct('href')->orderBy('id')->chunk(2, function($urls) {

            $responses = Http::pool(function (Pool $pool) use ($urls) {
                foreach($urls as $url) {
                    $pool->get($url->href);
                }
            });

            foreach($responses as $response) {
                dd($response->status());
            }

        });
        return Command::SUCCESS;
    }

webrobert

3 years ago

Level 51

@thebigk have you tried the other code example for get requests...

DB::table('internal_links')
  ->distinct('href')
  ->orderBy('id')
  ->chunk(2, function($urls) {

      $responses = Http::pool($urls);

      foreach($responses as $response) {
          dd($response->status());
      }

  });
return Command::SUCCESS;

webrobert

3 years ago

Level 51

I also wonder, ConnectException exception is thrown in the event of a networking error. do you get it straight away or is it after some number of requests? Maybe too many too fast?

thebigk

3 years ago

Level 13

@webrobert The exception is thrown if the URL is invalid or the domain doesn't exist. That's the reason I need to find out a way to handle the exceptions. It's not about the speed of requests.

webrobert

3 years ago

Level 51

The exception is thrown if the URL is invalid or the domain doesn't exist. That's the reason I need to find out a way to handle the exceptions.

@thebigk, how would we know?

You originally wrote

I've decided to make use of the concurrent requests feature offered by Laravel's Http client. I'm however, not able to put the code together. Would appreciate your help.

Not how do I handle the exceptions. 🤷🏽‍♂️

thebigk

3 years ago

Level 13

@webrobert - I tried running that code for the suspicious URLs. :-/

webrobert

3 years ago

Level 51

https://laravel.com/docs/9.x/http-client#error-handling

$response->onError(callable $yourURLwasBogus);

thebigk

3 years ago

Level 13

@webrobert Yep, already read that. However, this doesn't seem to work with concurrent requests.

Where do I handle it in the following code?

DB::table('internal_links')
  ->distinct('href')
  ->orderBy('id')
  ->chunk(2, function($urls) {

      $responses = Http::pool($urls);

      foreach($responses as $response) {
          dd($response->status()); // This throws error. 
      }

  });
return Command::SUCCESS;

webrobert

3 years ago

Level 51

@thebigk, I see. Okay, how about this...

$responses = Http::pool(function (\Illuminate\Http\Client\Pool $pool) use($urls) {
    foreach ($urls as $url) { $pool->head($url); }
});

$goodUrls = collect($responses)
    ->map( fn($response) => $response instanceof \Illuminate\Http\Client\Response
        ? (string) $response->effectiveUri()
        : null 
    )
    ->filter();


dd($goodUrls);

then you can just compare agains the collection and deal with bad urls..

EDIT:

Here is a bit of a refactor. I think this is my final pass for now...

$urls = collect([
    'http://google.com',
    'http://googsdfsdle.com',
    'http://yelp.com',
    'http://testwwerer.com',
]);

$responses = Http::pool(function (\Illuminate\Http\Client\Pool $pool) use($urls) {
    $urls->each( fn($url) => $pool->head($url) );
});

$goodKeys= collect($responses)
    ->map( fn($response) => 
        $response instanceof \Illuminate\Http\Client\Response 
        && $response->ok() 
    )
    ->filter();

$badUrls = $urls->diffKeys($goodKeys)->all();

webrobert

3 years ago

Level 51

@thebigk how did it go?

thebigk

3 years ago

Level 13

@webrobert - I didn't give this a try; but I found a code on StackOverflow a few hours ago and tweaked it to save the 'status' to the database. I can't figure out the 90% of the code though -

public function validate_urls(array $urls, int $max_connections, int $timeout_ms, bool $consider_http_300_redirect_as_error, bool $return_fault_reason) : array
    {
        $consider_http_300_redirect_as_error = true;
        $urls = array_unique($urls); // remove duplicates.
        $ret = array();
        $mh = curl_multi_init();
        $workers = array();
        $work = function () use (&$ret, &$workers, &$mh, &$return_fault_reason, $consider_http_300_redirect_as_error) {
            // > If an added handle fails very quickly, it may never be counted as a running_handle
            while (1) {
                curl_multi_exec($mh, $still_running);
                if ($still_running < count($workers)) {
                    break;
                }
                $cms=curl_multi_select($mh, 10);
                //var_dump('sr: ' . $still_running . " c: " . count($workers)." cms: ".$cms);
            }
            while (false !== ($info = curl_multi_info_read($mh))) {
                //echo "NOT FALSE!";
                //var_dump($info);
                {
                    if ($info['msg'] !== CURLMSG_DONE) {
                        continue;
                    }
                    if ($info['result'] !== CURLM_OK) {
                        if ($return_fault_reason) {
                            $ret[$workers[(int)$info['handle']]] = array(false, $info['result'], "curl_exec error " . $info['result'] . ": " . curl_strerror($info['result']));
                        }
                    } elseif (CURLE_OK !== ($err = curl_errno($info['handle']))) {
                        if ($return_fault_reason) {
                            $ret[$workers[(int)$info['handle']]] = array(false, $err, "curl error " . $err . ": " . curl_strerror($err));
                        }
                    } else {
                        $code = (string)curl_getinfo($info['handle'], CURLINFO_HTTP_CODE);
                        if ($code[0] === "3") {
                            if ($consider_http_300_redirect_as_error == true) {
                                if ($return_fault_reason) {
                                    $ret[$workers[(int)$info['handle']]] = array(false, -1, "got a http " . $code . " redirect, which is considered an error");
                                }
                            } else {
                                if ($return_fault_reason) {
                                    $ret[$workers[(int)$info['handle']]] = array(true, 0, "got a http " . $code . " redirect, which is considered a success");
                                } else {
                                    $ret[] = $workers[(int)$info['handle']];
                                }
                            }
                        } elseif ($code[0] === "2") {
                            if ($return_fault_reason) {
                                $ret[$workers[(int)$info['handle']]] = array(true, 0, "got a http " . $code . " code, which is considered a success");
                            } else {
                                $ret[] = $workers[(int)$info['handle']];
                            }
                        } else {
                            // all non-2xx and non-3xx are always considered errors (500 internal server error, 400 client error, 404 not found, etcetc)
                            if ($return_fault_reason) {
                                $ret[$workers[(int)$info['handle']]] = array(false, -1, "got a http " . $code . " code, which is considered an error");
                            }
                        }
                    }
                    curl_multi_remove_handle($mh, $info['handle']);
                    assert(isset($workers[(int)$info['handle']]));
                    unset($workers[(int)$info['handle']]);
                    curl_close($info['handle']);
                }
            }
            //echo "NO MORE INFO!";
        };
        foreach ($urls as $url) {
            while (count($workers) >= $max_connections) {
                //echo "TOO MANY WORKERS!\n";
                $work();
            }
            $neww = curl_init($url);
            if (!$neww) {
                trigger_error("curl_init() failed! probably means that max_connections is too high and you ran out of resources", E_USER_WARNING);
                if ($return_fault_reason) {
                    $ret[$url] = array(false, -1, "curl_init() failed");
                }
                continue;
            }
            $workers[(int)$neww] = $url;
            curl_setopt_array($neww, array(
                CURLOPT_NOBODY => 1,
                CURLOPT_SSL_VERIFYHOST => 0,
                CURLOPT_SSL_VERIFYPEER => 0,
                CURLOPT_TIMEOUT_MS => $timeout_ms
            ));
            curl_multi_add_handle($mh, $neww);
            //curl_multi_exec($mh, $unused_here); LIKELY TO BE MUCH SLOWER IF DONE IN THIS LOOP: TOO MANY SYSCALLS
        }
        while (count($workers) > 0) {
            //echo "WAITING FOR WORKERS TO BECOME 0!";
            //var_dump(count($workers));
            $work();
        }
        curl_multi_close($mh);
        return $ret;
    }

webrobert

3 years ago

Level 51

@thebigk ahh we’ll run my code. It’s tested and works. Plus it’s way shorter than that.

thebigk

3 years ago

Level 13

@webrobert - In this code, I think we aren't checking for valid URLs and URLs that don't have domains associated with them (dead urls).

From what I understand, it only checks for ok() responses. I'm also wondering - is there a possibility of adding rate limiting to this?

Also, where exactly would I make entry into the database for each URL?

$urls = collect([
    'http://google.com',
    'http://googsdfsdle.com',
    'http://yelp.com',
    'http://testwwerer.com',
]);

$responses = Http::pool(function (\Illuminate\Http\Client\Pool $pool) use($urls) {
    $urls->each( fn($url) => $pool->head($url) );
});

$goodKeys= collect($responses)
    ->map( fn($response) => 
        $response instanceof \Illuminate\Http\Client\Response 
        && $response->ok() 
    )
    ->filter();

$badUrls = $urls->diffKeys($goodKeys)->all();

webrobert

3 years ago

Level 51

@thebigk the issue was that when a bad url is entered head doesn’t throw the error. But it’s an exception is there instead of a response. So you can’t check if it’s ok on an exception. So all we do it check for a response instance first and if it has one then it was a valid url. Otherwise it was BS.

Then you can compare the keys to get the good or bad urls. And can Act on them.

What are you wanting to do next? Save urls or remove them from the database?

thebigk

3 years ago

Level 13

@webrobert I used your code as follows -

public function handle()
    {
        // Delete this after it does what it does.
        DB::table('internal_links')->distinct('href')->chunkById(10, function($urls) {
            $urls_collection = collect();
            foreach($urls as $url) {
                $urls_collection->push($url->href);
            }
            $urls = $urls_collection;
            $responses = Http::pool(function (\Illuminate\Http\Client\Pool $pool) use($urls) {
                $urls->each( fn($url) => $pool->head($url) );
            });

        $goodKeys= collect($responses)
            ->map( fn($response) =>
                $response instanceof \Illuminate\Http\Client\Response
                && $response->ok()
            )
            ->filter();

        dump($goodKeys);

        $badUrls = $urls->diffKeys($goodKeys)->all();
        dd($badUrls);
        });
        return Command::SUCCESS;
    }

This gives me goodKeys as [true, true...] and an array of the bad URLs. I think this should do the job. I just checked and I'm going to use it on ~50K external URLs (internals will be dealt with separately).

How can I speed this up without affecting performance?
My plan is to discard all the bad ULRs (anything that is not 200). But how do I get the URLs of the good responses so that I can update the database accordingly?

webrobert

3 years ago

Level 51

@thebigk

DB::table('internal_links')->distinct('href')->chunkById(10, function($urls) {

    $responses = Http::pool(function (\Illuminate\Http\Client\Pool $pool) use($urls) {
        $urls->each( fn($url) => $pool->head($url->href) );
    });

    $goodKeys= collect($responses)
        ->map( fn($response) =>
            $response instanceof \Illuminate\Http\Client\Response
            && $response->ok()
        )
        ->filter();

    // good internal_links
    $urls->intersectByKeys($goodKeys)->each(function ($url) {
        // do stuff
    });

    // bad internal_links
    $urls->diffKeys($goodKeys)->each(function ($url) {
        // do stuff
    });
});

return Command::SUCCESS;

DB::table('internal_links')->distinct('href')->chunkById(10, function($urls) {

    $responses = Http::pool(function (\Illuminate\Http\Client\Pool $pool) use($urls) {
        $urls->each( fn($url) => $pool->head($url->href) );
    });

    $goodKeys = collect($responses)
        ->map( fn($response) =>
            $response instanceof \Illuminate\Http\Client\Response
            && $response->ok()
        );

    $urls->each(function ($url, $key) use($goodKeys) {
        $goodKeys[$key]
            ? dump("save {$url->href} its good")
            : dump("kill {$url->href} its bad");
    });
});

webrobert

3 years ago

Level 51

hmmm in terms of speed... are you running this on the console and waiting? I think if you used workers. You could make a series of jobs and run multiple workers at once. So in theory you could process multiple batches at the same time. I dont know how often you have to run this process. I Seem to remember having issues before with overloading the network with too many requests. I think too much traffic. But I don't actually know the threshold for that. I dont know the break point. Probably environment specific. Perhaps it's best as a new question But if you dont run this often Im not sure id mess with it.

thebigk

3 years ago

Level 13

The code runs on localhost - and I'm not sure if that can have any issues with the network. The code will only ping the external URLs; because I can manage all the internal ones with Excel. That takes the total count of URLs to handle to ~50K.

Of course, this is a one-time job and I'm not bothered about the performance. I've the code run for several hours now and a few thousand URLs have been processed.

In future, I plan to write a crawler that will run every hour and do the following -

Visit a page.
Extract all the URLs on that page.
Ping the URL and check for HTTP status.
If the status is 200, leave that URL. For others, delete the URL from the text.

The ultimate goal will be to have a cleaner linking structure. I'm guessing fewer, but working links would be rewarded over more, yet several broken internal and external links.

webrobert

3 years ago

Level 51

@thebigk, yeah Id let it run. Unless there is anything else I think the last piece of code closes this thread if you want to mark a best answer.

My only concern with this is if there is an issue you dont know about it. Because we only capture urls that hit 200. We can't be certain there wasn't some other issue. So until you know better how the code behaves. I might actually open it up a bit more.

// make requests
$responses = Http::pool( fn(Pool $pool) =>
    $urls->each( fn($url) => $pool->head($url))
);

// check responses
$keyedResponses = collect($responses)->map( fn($response) => match (true) {

    $response instanceof Response
    && $response->ok()
        => 'good url',

    $response instanceof ConnectException
    && Str::contains($response->getMessage(), 'cURL error 6: Could not resolve host:')
        => 'Could not resolve',

    default
    // maybe double check this url. 
        => 'something else happened'

});

// process results
$urls->each(fn ($item, $key) => match($keyedResponses[$key]) {
    'good url' => dump("save $item its good"), // InternalUrls::update([ ... ]),
    'Could not resolve' => dump("kill $item its bad"), // mark as bad
    'something else happened' => dump("something else happened with $item") // mark for recheck
});

I assume you store the bad links and mark the as such, I might create another mark, for ones to double check. Anyway just something to consider. When I do this kind of piping.. I have a process/stage column that tracks the item so I know where it is in the process. And can call methods on it.

Nielson

1 year ago

Level 14

It kinda saddens me that he didn't even thanked you for your time or mark the reply as best answer... :/

Please or to participate in this conversation.