Mar 19, 2019

Level 5

PHP - Designing a rule based parsing engine

Dynamically apply methods/"rules" to documents

I hope someone here can help me/guide me in the right direction. I am currently creating a web application, where users can import a text file, and then programmaticaly apply different methods on the text.

Example Imagine that an user have imported a text document, that looks like below.

Now as explained, I wish to allow my users to perform/apply a range of different methods to above text. They should be able to apply all rules, in any given order. Consider below example where I have perfomed 4 rules to the original text document:

As you can see, the text is transformed during each rule, as the method should be applied to the text and output the new text string.

Now the users should be able to save these rules, so the next time the user uploads a document to this specific stream. The thought is, that the next time the user uploads a document, these rules should automatically be applied for the document.

My question is, what would be the best approach to allowing my users to do this dynamically?

I will define the methods/rules that a user can perform on the text document - but what's the best approach:

Save the rules to the database
Programmatically apply the rules to each incoming document (parse each document, based on the rules)

My progress so far

So I am a bit lost on where to begin here, but I was thinking something like below.

Streams: A stream is kind of like a "stack" for all documents. I can upload multiple different documents to a stream. I can create multiple streams, which holds multiple rules.

streams table

id | name

Name: the name of the stream. For example "Documents from Acme Inc"

documents table:

id | stream_id | path | content

Stream Id: A Stream will be can have many documents. So each document uploaded to a specific stream, will be parsed by the rules defined on the stream.
Path: the server path to the document
Content: The text content of the document

parsing_rules table

id | stream_id | method | arguments

Stream Id: Parsing rules will belong to a stream. So all documents imported into the stream, will be parsed by the rules associated with the specific stream.
Rule: the name of the rule applied by the user. This will also refer to the method name in my PHP code.
Arguments: Optional. The arguments that will be applied to each rule/method.

An example of the rules from the 2nd screenshot above, would then look like in the parsing_rules table:

1 | 5 | remove_empty_lines | null
2 | 5 | text_replace | "a:2:{s:6:"Search";s:9:"Laracasts";s:7:"Replace";s:6:"Google";}"
3 | 5 | regex_text_replace | "a:2:{s:7:"Pattern";s:9:"/Google/i";s:11:"Replacement";s:6:"Amazon";}"
4 | 5 | start_position_no_lines | a:1:{s:4:"Line";s:1:"2";}"

So here, method accepts the name of the actual method that should be called, and arguments is the arguments the specific method accepts/requires - but serialized.

How to apply these rules?

I was thinking that each time a new document is uploaded/imported into a stream, I will apply the rules associated with the stream. Something like:

$content = $document->content;
$parsing_rules = $stream->parsingRules()->get();
foreach($parsing_rules as $rule)
{
    $arguments = unserialize($rule->argments);
    
    return $this->{$rule->method}($arguments, $content);
    
}

Now above is no where near perfect, and it will return the $content already after the first iteration.

Any feedback is highly appreciated. Above is only my thoughts on how to do this project, but I am not sure if there is a better approach to solve this.

bobbybouwmann

7 years ago

Best Answer

Level 88

I think your initial setup is pretty good! You have a great starting point. As of your bit of code and how to apply it, I can hopefully help you with that!

The current way of parsing is fine but you can do them all at once. Also you can build in some checks to make things a little bit better

public function parse($stream, $document)
{
    $content = $document->content;
    $rules = $stream->parsingRules()->get();

    foreach($rules as $rule) {
        // Convert the method to a different format (regex_text_replace = regexTextReplace)
        $method = Str::camel($rule->method);

        if (!method_exists($this, $method)) {
            throw new Exception($method. ' rule does not exists');
        }

        $arguments = unserialize($rule->arguments);

        $content = $this->{$method}($arguments, $content);
    }

    return $content;
}

public function regexTextReplace($arguments = [], $content)
{
    // Do something with the content

    return $content;
}

Let me know if this makes any sense to you! If not I can give you more explanation per line ;)

1 like

oliverbusk

7 years ago

Level 5

@BOBBYBOUWMANN - @bobbybouwmann Above makes perfect sense. Especially the catch where you set $content = $this.. as this will ultimately return the string after all rules have been applied.

I have tried to apply it in my code and it works beautifully!

Thanks a lot for your help and reassurance! I am still quite new to Laravel and only programming as a hobby.

I have two follow up questions:

Would it make sense to save the final $content in the database? Maybe on the documents table, in a column called parsed_content. This way, when the user navigates away from the page and back (or comes back later), the final string will be saved and the server will not need to parse it again. I could then maybe compare parsing_rules.updated_at with documents.updated_at to see if any changes was made to the parsing rules (if there were, all documents associated with the Stream will have to be parsed again.). Which leads me to the other question:
Would it make sense to add the actual parsing of the document - parse() - to a job queue? This way I won't flood my server with parsing requests.

bobbybouwmann

7 years ago

Level 88

Yeah, I would definitely keep a copy of the original content and also of the modified content. You can store them both in the same table row. Another option could be to keep the diff for each rules in the database as well in a separate table.
Yeah parsing the content can be done on a separated queue, perfect example of a queue ;) However you have to take in account that the queue won't be done in the same second as the request. So you have to build in something in your view to show that it's still processing for example. You need to build around the small delay.

oliverbusk

7 years ago

Level 5

Hi @bobbybouwmann

I've come to the realization that the content now is not necessarily a string ($content = $document->content;).

I have changed my database setup, to store the value of content as JSON. Now, the content can be either just a string of text, or a multiple columns / rows.

Text:

{"text": "Just a regular string.\n Yep!\n\f"}

Columns/rows: (table data)

{"1": [{"1": "The first line of column 1!\n"}, {"2": "The second..\n"}], "2": [{"1": "Second column\f"}]}

So for the text content, I would just serve the content to the parsing rule like: $document->content['text']

However, I am a bit unsure of how I should serve the column data to the parsing rule method?

For table data, a parsing rule could be:

Text Replace $foo with $bar for all columns (loop through all rows)

Text Replace $foo with $bar for column 1 (loop through all rows)

I am unsure how to do so the parsing rule method can accept both string data and table data? I imagine I would have to do a nested loop through the columns and then the rows? Any help or guidance would be highly appreciated!

Please or to participate in this conversation.