Jun 24, 2024

Level 10

Job timeouts whilst streaming large file from S3 to Cloudflare R2 despite increasing timeout to 40mins

Hi,

I'm attempting to stream PDF data from S3 to Cloudflare R2, using Laravel Filesystem along with league/flysystem-aws-s3-v3, this process is done via a queue. Most of the time it works perfectly but then very rarely there will be a failed transfer. On failure, I release it back to the job queue with the hope it will complete successfully next time.

The file link from S3 is revealed by an API. I've tried to debug as much as I can to see what might be causing the process to fail and narrowed it down to saving to R2 $disk->put(), is there a way I can get more information from this? It may not necessarily provide a fix but would good to be able to provide feedback back to client.

Thank you!

P.s. had to space out url scheme as the forum still thinks this is my first day/post, I've been here for a while now, does anyone else get this problem?

$apiResponse = Http::get('h t t p s://url.com?', [
								'key' => 'keystring',
								'fileId' => $orderProduct->file_name,
							]);

if($apiResponse->failed()){
		 if($apiResponse->clientError()){
				Log::error($apiResponse->clientError());
		}
		 if($apiResponse->serverError()){
				Log::error($apiResponse->serverError());
		 }
		Log::error('Released back to jobs and retry at '.now()->addMinutes(360));
		$this->release(now()->addMinutes(360));
}

if($apiResponse->successful()){
		Log::info('200 status range response for order');

		$fileLink = $apiResponse->body();
		$targetFile = $this->createFilename();

		$disk = Storage::disk('r2');
		$result = $disk->put(
                                        $targetFile,
                                        fopen($fileLink, 'r'));

		if($result == true){
				Log::info('Move to Cloudflare successful');
		}else{
				Log::error('Move to Cloudflare FAILED');
				Log::error('Released back to jobs and retry at '.now()->addMinutes(360));
				$this->release(now()->addMinutes(360));
		}
 }

UPDATE: I set throw to true in the filesystem definition and placed the download portion of the code in a try/catch but still no error returned. The job still fails after 3 attempts.

try{
	$disk = Storage::disk('r2');
    $result = $disk->put(
                     	$targetFile,
                        fopen($fileLink, 'r'));
} catch (Throwable $e){
	Log::error('Could not Upload file '.$e->getMessage());
}

So this has lead me to thinking about timeouts and whether the files being streamed and uploaded is particularly large. I examined the particular PDF and it's 1.6GB in size. I've looked at the failed_jobs table and can see in the exception that it does indeed timeout. So I increased timeouts by 10 minutes each time and the current time is 40 minutes (Supervisor also set to 40 minutes timeout)

//app/Jobs/MyJob.php
public $timeout = 2400;
//config/queue.php
 'retry_after' => 2400,

It's still failing! I'm about to increase it to an hour and see what happens, but surely 40mins should be enough?! The site is on it's own nix VM, there's nothing particularly taxing going on with the machine. Should I be approaching the stream/upload in a different way? I'm considering changing the code and actually downloading the file locally before uploading but streaming is more efficient on memory.

Is there a way when a job runs to see how many attempts it has made previously? For example if the job is being attempted for the 3rd time (3rd before failure) then I could insert a condition which downloads and uploads instead of streaming (I want to make streaming the default option as it works the majority of the time and only revert to traditional download/upload as last resort).

PeteBatin

2 years ago

Level 10

Just increased the timeout to 60mins, wish me luck!

For reference if it helps with diagnosis/advice,

Illuminate\Queue\TimeoutExceededException: App\Jobs\MyJob has timed out. in /var/www/vhosts/domain.co.uk/subdomain.domain.co.uk/vendor/laravel/framework/src/Illuminate/Queue/TimeoutExceededException.php:15

PeteBatin

2 years ago

Level 10

So after setting it 90 minutes it still fails (supervisor set to 91 minutes). As the site is production and I need to get the order pushed through for now I'm going to bypass half the process and download the PDF to local disk and then do a straight upload to see if that works.

It would be great to have a solution for the future so if anyone can give some advise on the following I'd appreciate it.

Should I be handling the transport of large files differently? Is there a better way?
Is there a way during an iteration of a job (for example it's 3rd attempt) to get a count value of the amount of times the job has previously run so that I can use that to switch the process to an alternative method?

martinbean

2 years ago

Level 80

@petebatin It sounds like something else is the issue. It doesn’t take ~90 minutes to download a PDF file unless the PDF file is the largest PDF file in history. It sounds more like a network connection is hanging instead.

PeteBatin

2 years ago

Level 10

@martinbean yeah I'm starting to look into this too. The PDF file is only 1.6GB, large but not huge.

I've inserted an IF which detects the particular file when running the job and I've inserted code that cuts out the stream from AWS S3 part. I've manually downloaded the file from S3 and uploaded it to local disk and attempted the same code (Laravel Filesystem/flysystem put) but it's already on it's 2nd attempt. So I've now changed that part of the code to use putObject from local disk using the aws/aws-sdk-php-laravel which is in use on another version of this service.

I'm making one last ditch attempt at this, I've increased all the timeouts to 120mins (PHP execution and input time), supervisor, the queue timeout and added the following to my Apache directives MaxKeepAliveRequests 7200 KeepAlive On KeepAliveTimeout 7200 Then finally in my nginx directives keepalive_timeout 7200s;

I'm about to restart supervisor and see what happens. After this I'm just going to AWS CLI it and hope it doesn't happen again on a different file again too soon!

martinbean

2 years ago

Level 80

@PeteBatin Upload a small plain text file to S3, and then create a job to try and download that file:

public function handle(): void
{
    $contents = Storage::get('test.txt');
}

If that job hangs then you know it’s a networking issue, as a job shouldn’t time out trying to fetch a tiny text file.

PeteBatin

2 years ago

Level 10

@martinbean Good idea but it's an existing process and there are many files being streamed from S3 to Cloudflare R2 hourly, so it does work but fails on occasion with large files. There were two files originally the first was 1.2GB which was failing but worked when I increased the timeout. So that left me with the 1.6GB and applied the same logic of incrementing the timeout until I got to the ridiculous amount of 120mins!!

My change of tactic seem to work in the end, pre-downloading the file and uploading to local disk so it didn't have to stream from S3 and then using aws/aws-sdk-php-laravel instead of Laravel Storage/Flysytem.

At the moment I had to hardcode the file name in the job so when it was detected it switched to alternative method. I'd like to make this an automated feature, do you know how I might be able to obtain and utilise the attempts value of the job? Ideally I want it to try the standard version twice and then on the 3rd attempt switch to the alternative.

P.S. I really appreciate your responses and time so fair, I was starting to feel a bit lost lol. I put the question to SO also but it hasn't had any responses so far, the desperation was starting to set in!

martinbean

2 years ago

Level 80

My change of tactic seem to work in the end, pre-downloading the file and uploading to local disk so it didn't have to stream from S3 and then using aws/aws-sdk-php-laravel instead of Laravel Storage/Flysytem.

@PeteBatin That doesn’t really make sense, though. You can’t download the file… so you download the file?

PeteBatin

2 years ago

Level 10

@martinbean It's downloadable, always has been. Don't think I said I couldn't download it.

The issue has been with streaming the file from S3 to R2, which is said to be more efficient than downloading and then uploading.

The streaming was for reason still unknown to me taking too long, enough to timeout.

So I switched from streaming (downloading/uploading simultaneously without storing on local disk), to downloading to local disk and uploading as separate actions. I manually downloaded the file from S3 and uploaded to the server's local disk, negating the need for the script to spend time downloading and instead concentrate on uploading.

Please or to participate in this conversation.