What database architecture for an analytics system?

Hi, I'm going to create an app where users can create pages that can be visited. I want to store each visit so that my users can have a visitor dashboard (date, country, browser, device...).

I'm using MariaDB and I was going to make a simple "visits" table with a page_id and user_id to filter but, eventually when this table grow too much, queries are going to be very slow.

It's mostly a read-only table, but with heavy aggregated data.

Because the data can be coupled during queries, I can't even split or have copies of this table so as to reduce it. So what's the best thing here?

I was thinking of making an SQLite DB for each user, but for that I would need to also be able to dynamically change the connection based on the loggedin user. That would also create N SQLite files on my server which could be problematique as the nb of user grows.

LaryAI

2 years ago

Level 58

Creating a scalable database architecture for an analytics system can be challenging, especially when dealing with large volumes of data. Here's a solution that balances performance and complexity using MariaDB:

Use Indexes: Ensure that your visits table has appropriate indexes. For example, indexing page_id will speed up queries that filter by page. You should also consider indexing other frequently queried fields like date, country, browser, and device.
Partitioning: If the visits table grows very large, you can use partitioning to divide it into smaller, more manageable pieces. Partitioning can be done by range (e.g., date), by list (e.g., country), or even by hash (e.g., hashing page_id).
Summary Tables: Create summary tables that aggregate data for common queries. For example, you could have a daily summary table that stores the number of visits per page per day. This can significantly reduce the load on the main visits table for aggregate queries.
Archiving Old Data: Implement a data retention policy where old visit data is archived to a separate table or database. This keeps the main visits table smaller and more performant.
Asynchronous Processing: Consider writing visit data to a queue first (like RabbitMQ or Kafka) and then have a background process that batch inserts data into the database. This can help in handling large bursts of traffic without slowing down your application.
Database Sharding: If a single database instance becomes a bottleneck, you might need to look into sharding your database. This means splitting your data across multiple databases based on a shard key (like user_id).
Using a NoSQL Database: For analytics data, sometimes a NoSQL database like MongoDB or a time-series database like InfluxDB might be more appropriate due to their ability to handle large volumes of write-heavy operations.
Caching: Use caching for read operations wherever possible. This can be done at the application level or by using a dedicated caching system like Redis or Memcached.

Here's an example of how you might structure your visits table with indexes:

CREATE TABLE visits (
    id BIGINT AUTO_INCREMENT PRIMARY KEY,
    page_id BIGINT NOT NULL,
    visit_date DATE NOT NULL,
    country VARCHAR(2),
    browser VARCHAR(255),
    device VARCHAR(255),
    -- Other fields...
    INDEX idx_page_id (page_id),
    INDEX idx_visit_date (visit_date),
    INDEX idx_country (country),
    -- Additional indexes as needed...
);

And here's an example of a summary table:

CREATE TABLE daily_visits_summary (
    page_id BIGINT NOT NULL,
    summary_date DATE NOT NULL,
    visit_count BIGINT DEFAULT 0,
    PRIMARY KEY (page_id, summary_date)
);

Regarding the idea of using an SQLite database for each user, it's generally not recommended due to the complexity of managing many files and the overhead of dynamically switching connections. It's better to use a robust database system designed for concurrency and large datasets, like MariaDB, and optimize it using the strategies mentioned above.

Remember, the right solution will depend on the specific requirements and constraints of your application, as well as the expected load and growth. It's important to monitor performance and adjust your strategy as needed.

Snapey

2 years ago

Level 122

databases have indexing techniques for such situations. Just make sure you use them

darkylmnx

2 years ago

Level 1

@Snapey I'm going to use indexes for sur but, when the table hits 700k to 1M rows, won't I have a lot of issues there?

Snapey

2 years ago

Level 122

@darkylmnx mysql is perfectly capable.

Snapey

2 years ago

Level 122

@darkylmnx you will find this interesting for the work you have coming ahead

https://youtu.be/ZdLd5Ijlv1o?si=HJ0MI9Ttqtco35J6

darkylmnx

2 years ago

Level 1

@Snapey thanks, I've been looking into partitionning too, seems like something interesting, I barely do database work, usually we had a specialist for that in previous companies I worked for.

martinbean

2 years ago

Level 80

@darkylmnx Use an event store, since that’s what analytics are: tracking of events over time.

If you use an event store, you can then “decorate” your events using projectors to enrich incoming events. For example, if you capture the user’s IP address as part of an ingested event, you could then use a service to look up the location of that IP address, which will enable reporting on where, geographically, visits are coming from.

That being said, it’s a bit odd to embark on basically recreating Google Analytics if you “barely do database work”, as this is going to be very database-focused.

Sairahcaz

1 year ago

Level 6

Hi @darkylmnx, if you like you can use https://simplestats.io, an analytics tool tailored for Laravel apps, which exactly does what you described.

Please or to participate in this conversation.