PHP - Designing a rule based parsing engine

Posted 3 months ago by oliverbusk

Dynamically apply methods/"rules" to documents

I hope someone here can help me/guide me in the right direction. I am currently creating a web application, where users can import a text file, and then programmaticaly apply different methods on the text.

Example Imagine that an user have imported a text document, that looks like below.

Raw string

Now as explained, I wish to allow my users to perform/apply a range of different methods to above text. They should be able to apply all rules, in any given order. Consider below example where I have perfomed 4 rules to the original text document:


As you can see, the text is transformed during each rule, as the method should be applied to the text and output the new text string.

Now the users should be able to save these rules, so the next time the user uploads a document to this specific stream. The thought is, that the next time the user uploads a document, these rules should automatically be applied for the document.

My question is, what would be the best approach to allowing my users to do this dynamically?

I will define the methods/rules that a user can perform on the text document - but what's the best approach:

  1. Save the rules to the database
  2. Programmatically apply the rules to each incoming document (parse each document, based on the rules)

My progress so far

So I am a bit lost on where to begin here, but I was thinking something like below.

Streams: A stream is kind of like a "stack" for all documents. I can upload multiple different documents to a stream. I can create multiple streams, which holds multiple rules.

streams table

id | name
  1. Name: the name of the stream. For example "Documents from Acme Inc"

documents table:

id | stream_id | path | content
  1. Stream Id: A Stream will be can have many documents. So each document uploaded to a specific stream, will be parsed by the rules defined on the stream.
  2. Path: the server path to the document
  3. Content: The text content of the document

parsing_rules table

id | stream_id | method | arguments
  1. Stream Id: Parsing rules will belong to a stream. So all documents imported into the stream, will be parsed by the rules associated with the specific stream.
  2. Rule: the name of the rule applied by the user. This will also refer to the method name in my PHP code.
  3. Arguments: Optional. The arguments that will be applied to each rule/method.

An example of the rules from the 2nd screenshot above, would then look like in the parsing_rules table:

1 | 5 | remove_empty_lines | null
2 | 5 | text_replace | "a:2:{s:6:"Search";s:9:"Laracasts";s:7:"Replace";s:6:"Google";}"
3 | 5 | regex_text_replace | "a:2:{s:7:"Pattern";s:9:"/Google/i";s:11:"Replacement";s:6:"Amazon";}"
4 | 5 | start_position_no_lines | a:1:{s:4:"Line";s:1:"2";}"

So here, method accepts the name of the actual method that should be called, and arguments is the arguments the specific method accepts/requires - but serialized.

How to apply these rules?

I was thinking that each time a new document is uploaded/imported into a stream, I will apply the rules associated with the stream. Something like:

$content = $document->content;
$parsing_rules = $stream->parsingRules()->get();
foreach($parsing_rules as $rule)
    $arguments = unserialize($rule->argments);
    return $this->{$rule->method}($arguments, $content);

Now above is no where near perfect, and it will return the $content already after the first iteration.

Any feedback is highly appreciated. Above is only my thoughts on how to do this project, but I am not sure if there is a better approach to solve this.

Please sign in or create an account to participate in this conversation.