mrosenblatt's avatar

Handling piped email cleanly

Hey there,

I've got a script that accepts incoming email via pipe using php-mime-mail-parser. It works great overall, but if someone clicks "Reply", it always includes the entire thread in the email content. Which is totally reasonable and understandable.

Does anyone know if any pre-existing libraries exist that can filter this out to the most recent message only for both HTML and Text emails? I don't want to save the entire thread history with each response from a customer. I'm not sure what this sort of function would be called, quite frankly. My searches haven't been the most fruitful.

0 likes
1 reply
LaryAI's avatar
Level 58

Handling email threads and extracting the most recent message can be a bit tricky, as there is no universal standard for how email clients quote the original message in a reply. However, you can use some heuristics to try and parse out the original message.

One approach is to look for common patterns used by email clients to separate the quoted message from the reply, such as lines starting with ">" or patterns like "On [date], [name] wrote:". You can then use regular expressions to split the email content at these points and extract the most recent message.

Here's a basic example of how you might implement this in PHP using the php-mime-mail-parser library:

require_once 'vendor/autoload.php';

use PhpMimeMailParser\Parser;

$parser = new Parser();

// Assuming $emailContent is the raw email content piped to the script
$parser->setText($emailContent);

// Get the text or html body
$text = $parser->getMessageBody('text');
$html = $parser->getMessageBody('html');

// Define a function to clean the reply
function cleanReply($body) {
    // Define your regular expressions for quoted text
    $patterns = [
        '/(On\s(.*)wrote:)/s', // English style
        '/(Le\s(.*)écrit :)/s', // French style
        '/(>\s.*)/s', // Lines starting with '>'
        // Add more patterns as needed
    ];

    // Try each pattern to split the text
    foreach ($patterns as $pattern) {
        if (preg_match($pattern, $body, $matches, PREG_OFFSET_CAPTURE)) {
            // Return the part of the body before the quoted text
            return substr($body, 0, $matches[0][1]);
        }
    }

    // If no patterns matched, return the whole body
    return $body;
}

// Clean both text and html parts
$cleanText = cleanReply($text);
$cleanHtml = cleanReply($html);

// Now you can store $cleanText and $cleanHtml as the most recent message

This is a very simplistic approach and may not work for all email clients or languages. You might need to add more patterns to the $patterns array to match the quoting styles of different email clients.

Also, be aware that this method can be error-prone, especially with complex email threads or when users manually edit the quoted text. There is no perfect solution, and you may need to refine the regular expressions based on the actual emails you receive.

If you're looking for a more robust solution, you might need to consider a library specifically designed for email processing, such as email-reply-parser. However, as of my knowledge cutoff date, there isn't a PHP version of this library that's widely supported, so you might need to port it from another language or look for a similar library in PHP.

1 like

Please or to participate in this conversation.