connecteev's avatar

Need to order posts based on a composite score

I am building a content platform and have a Laravel API for my news feed (infinite-scrolling feed like the Linkedin feed, Facebook feed, etc). I need to order the posts in my news feed based on the algorithm below (each parameter is going to have a weight, and I want to use these to calculate and order the results / Posts by the composite score).

                Parameters for algorithm:
                Num Comments
                Num Likes
                Date Created (Recency of Post)
                Date Updated (Recency of Post)
                Date last comment (Recency of Comments)
                Date last like / reaction (Recency of Likes / Reactions)
                Length of Post?
                Fast Rising: Number of likes / comments in past X <hours / days>

The DB structure looks like this: Post hasMany Comments Post HasMany Reactions ('likes')

The schema is here: https://gist.github.com/connecteev/0e0d6751a7056af00ead267189a5864e

How can you do an OrderBy in Laravel based on

  1. a composite score or combined metric from multiple columns
  2. values from related tables (comments, reactions) (like what I am trying to do above)?
0 likes
22 replies
TerrePorter's avatar

@connecteev I would write out a sql statement that does what you want then use that to make a laravel equivalent and in code. Its hard to make that without more about the algorithm and how the weighting is determined.

connecteev's avatar

@terreporter here is a simplified SQL statement I would like to use for the algorithm.

SELECT id, title, created_at, num_comments, num_likes, CHAR_LENGTH(body),
SUM( (IFNULL(num_comments, 0)*0.5) + (IFNULL(num_likes, 0)*0.4) + (CHAR_LENGTH(body)*0.1) ) / (UNIX_TIMESTAMP(now()) - UNIX_TIMESTAMP(created_at)) AS hotness_score 
from posts 
where created_at > '2019-09-21 22:14:55'
GROUP BY id 
order by hotness_score desc 
limit 100;

This doesn't do a join with the comments / reactions tables, and so this algorithm doesn't consider

* Date of last comment (Recency of Comments, from comments table, which contains post_id)
* Date of last like / reaction (Recency of Likes / Reactions, from reactions table, which contains post_id))

Is there a way to convert the above SQL to a Laravel Eloquent? And maybe an eloquent way to include date of last comment and reaction?

willvincent's avatar

Given the complexity of the sql, probably easier (and arguably better) to just leave it as sql than trying to force it to conform to eloquent query builder syntax.

1 like
connecteev's avatar

@jlrdw @willvincent I marked this as "not resolved" because there's one thing I'm unsure about.

This is the query for retrieving all posts.

        $returnedPosts = Post::where([
            ['status', '=', 'published'],
            ['is_removed_by_admin', '=', '0'],
        ]);
        $returnedPosts = $returnedPosts->orderBy('created_at', 'DESC');

And this is the RAW SQL query for sorting posts by hotness score (my algoritm).

SELECT id, title, created_at, num_comments, num_likes, CHAR_LENGTH(body),
SUM( (IFNULL(num_comments, 0)*0.5) + (IFNULL(num_likes, 0)*0.4) + (CHAR_LENGTH(body)*0.1) ) / (UNIX_TIMESTAMP(now()) - UNIX_TIMESTAMP(created_at)) AS hotness_score 
from posts 
where created_at > '2019-09-21 22:14:55'
GROUP BY id 
order by hotness_score desc 
limit 100;

How would you replace:

$returnedPosts = $returnedPosts->orderBy('created_at', 'DESC');

with the algorithm? Is there a clean way to do that?

jlrdw's avatar
// after this
GROUP BY id 
order by hotness_score desc, created_at DESC   // just add the extra order by.
limit 100;

Or reverse if hotness_score takes priority. But am I seeing wrong, is that two separate queries. I am still at study.

Edit: Something to look over, may need the first query as regular sql, if you go that route see my answer here on query a query, usually it takes some trial and eror:

https://laracasts.com/discuss/channels/eloquent/calculate-average-of-multiple-fields-eloquent

connecteev's avatar

@jlrdw I just discovered orderByRaw() and groupBy() and tried this, but it doesn't quite work.

$returnedPosts = Post::where([
            ['status', '=', 'published'],
            ['is_removed_by_admin', '=', '0'],
        ]);

$returnedPosts = $returnedPosts->orderByRaw("SUM( (IFNULL(num_comments, 0)*0.5) + (IFNULL(num_likes, 0)*0.4) + (CHAR_LENGTH(body)*0.1) ) / (UNIX_TIMESTAMP(now()) - UNIX_TIMESTAMP(created_at)) DESC")->groupBy('id');

If I run the same raw SQL query in Sequel Pro against the same table, I get a different order of results:

SELECT id, title, created_at, num_comments, num_likes, CHAR_LENGTH(body),
SUM( (IFNULL(num_comments, 0)*0.5) + (IFNULL(num_likes, 0)*0.4) + (CHAR_LENGTH(body)*0.1) ) / (UNIX_TIMESTAMP(now()) - UNIX_TIMESTAMP(created_at)) AS hotness_score 
from posts 
-- where created_at > '2019-09-21 22:14:55'
GROUP BY id 
order by hotness_score desc 
-- limit 100
;

which leads me to believe that the orderByRaw above isn't correct.

whoisthisstud's avatar

I'm still a newbie, so my take on this may be crap lol.

What if you queried the data and then used collection methods to set the score for each record and sort the records before passing the result?

$posts = Post::all()->where([
            ['status', '=', 'published'],
            ['is_removed_by_admin', '=', '0'],
         ])
         ->withCount('comments')
         ->withCount('reactions');

$posts->each(function ($item, $key) {
    // Not tested, but you get the point...
    $comment_score = ( $item['comments_count'] == null ? 0 : $item['comments_count'] * 0.5 );
    $reaction_score = ( $item['reactions_count'] == null ? 0 : $item['reactions_count'] * 0.4 );
    $body_score = strlen( $item['body'] ) * 0.1;

    $item['hotness_score'] = ($comment_score + $reaction_score + $body_score) / (\Carbon::now() - $item['created_at'];
});

$sorted = $posts->sortBy('hotness_score');

$sorted->values()->all();

*Updated to add the where() statement to the query and add calc within the each() collection method.

connecteev's avatar

@whoisthisstud thanks but I'd prefer not to loop through the results in a for loop, this would slow down the API and I'd prefer to use something "native" to Laravel if that makes sense.

jlrdw's avatar

Your where parts:

where([
            ['status', '=', 'published'],
            ['is_removed_by_admin', '=', '0']

Perhaps convert to regular sql also and blend them in to the other query, so you will have just one query.

And Right off hand I'd say you could probably work out one eloquent query to do all. But either way it does take some trial and error to work out these queries.

I usually use a sql designer, comes in very handy.

Here is another example of an eloquent query that took a while:

https://laracasts.com/discuss/channels/eloquent/orderby-computed-related-attribute

And you saw my average example, right.

1 like
willvincent's avatar

@whoisthisstud It's generally best to do sorting within the DB whenever possible, especially if you want to paginate (which you almost always should want to do)...

The other option not yet discussed that would be reasonable would be to precalculate the sorting score via a cron job, so every 5/10/15/30 minutes, or whatever, re-run a job that calculates the score for all the records, and stores that value so it can be easily incorporated into a query.

There would be some lag time between sort ordering getting updated, but the drastic ease of querying may be a worthwhile tradeoff to some of the realtimeyness

3 likes
connecteev's avatar
connecteev
OP
Best Answer
Level 11

@jlrdw @willvincent @whoisthisstud There's no way I can mark all of your answers "Best Answer", but I'd like to say I'm appreciative of the help all of ya'll have been. I'm going to mark mine the "best" because it's closest to what works and I hope it will help someone else in the future. The thoughts shared here have been eye-opening, though.

Turns out that my prior SQL query was incomplete (I was missing the where condition in my Raw SQL - and this one below actually works:)

$returnedPosts = Post::where([
            ['status', '=', 'published'],
            ['is_removed_by_admin', '=', '0'],
        ]);

$returnedPosts = $returnedPosts->orderByRaw("SUM( (IFNULL(num_comments, 0)*0.5) + (IFNULL(num_likes, 0)*0.4) + (CHAR_LENGTH(body)*0.1) ) / (UNIX_TIMESTAMP(now()) - UNIX_TIMESTAMP(created_at)) DESC")->groupBy('id');

Unlike what I mentioned in my last comment above, it matches with the SQL query below:

select `id`, `title`, `slug`, `author_id`, `featured_image_caption`, `reading_time_minutes`, `num_comments`, `num_likes`, `created_at`, `updated_at` 
from `posts` 
where (`status` = 'published' and `is_removed_by_admin` = '0') 
and (`created_at` >= "2000-01-01T00:00:00.000000Z") 
and `posts`.`deleted_at` is null 
group by `id` 
order by SUM( (IFNULL(num_comments, 0)*0.5) + (IFNULL(num_likes, 0)*0.4) + (CHAR_LENGTH(body)*0.1) ) / (UNIX_TIMESTAMP(now()) - UNIX_TIMESTAMP(created_at)) DESC 
limit 15 offset 0;

;

I was missing the where (`status` = 'published' and `is_removed_by_admin` = '0') conditions and so the results were inconsistent previously.

@jlrdw I did read most of your comments and examples, though I have to admit some of it was a bit complicated. @willvincent I like the cron approach as well, I might give that a go in the future if this API becomes slow. I agree with you here: There would be some lag time between sort ordering getting updated, but the drastic ease of querying may be a worthwhile tradeoff to some of the realtimeyness

Another approach that others could consider (which won't work in this specific case, unfortunately): a DB trigger to calculate the hotness_score every time a row is inserted or updated. This would have worked except for the fact that my algorithm relies on NOW() and calculates the time difference, and that would quickly become stale if we relied on updating the hotness_score on create/update only.

jlrdw's avatar

I will definitely bookmark this one also to refer others to.

But that's what it's all about asking for some help and people trying to help.

2 likes
willvincent's avatar

Another reason (besides ease of use) the cron option is good to keep in mind, as your dataset grows, the various calculations involved in your sort will occupy more time, so hiding that calculation latency from users, by introducing some overall 'freshness' latency could be a very necessary speed enhancement too...

3 likes
michalk's avatar

Another option would be to calculate SUM( (IFNULL(num_comments, 0)*0.5) + (IFNULL(num_likes, 0)*0.4) + (CHAR_LENGTH(body)*0.1) ) in PHP and store in activity_score on update on insert. Then you would just use activity_score/ (UNIX_TIMESTAMP(now()) - UNIX_TIMESTAMP(created_at)) DESC. Because the SUM and CHAR_LENGTH and all the calculations seem to be heavy on the DB.

1 like
connecteev's avatar

@willvincent that makes sense to me. Thank you and I will keep this in mind - may move this logic to a cron later to speed things up.

@michalk I actually really like the optimization you did there. Even when I move to a cron, I don't have to move EVERYTHING to the cron. Thanks for the tip.

Please or to participate in this conversation.