tafulini's avatar

Model/Database Design - best way to save a bunch of similar properties (genes)?

I try to modernize an animal genetics hobby project of mine (>15 years old).

I have entries with title, description, gene code and images. The gene code consists of gene pairs currently saved in table columns. This genes are similar, currently there are 17, but they will be more in the future.

While upgrading i think about extending the project so it can be used for more than one species. All species have in fact the same genes, but not each gene is relevant for each species.

Maybe i am going to explain a little too much, but i am unsure how much you need to know to understand the question.

  • Each animal has 2 genes of each sort (on from each parent, you know).
  • Genes may have dominant variants, recessive ones or variants with partial dominance.
  • Often only the dominant one matters, but not always.
  • For the partially dominant ones it's a sort of addition of both.

I use numbers to save the different gene variants. This allows me to sort them by dominance and to use what notation i want (scientific names, common names, ...) So for each gene i concatenate the numbers and save them in a char(2) column. 0 means not set (f.e. in search), 9 means unknown. Example: 19 = one is the dominant one, the other is unknown 22 = both are the same - the recessive variant.

From genes relevant for sorting a number like 112142001013 is created as sort_order to get entries lists ordered by dominance and gene order.

There are genes which have very common mutations in most species up to genes only relevant for one species. Nonrelevant genes may be saved as NULL or with there wildtype as default value.

I need to enlarge the gene number pool to > 10. Maybe 1090 and 2020 rather than 19 and 22, which causes problems with 00, which works as string, but not as a real number.

As a first step of refactoring i created a bunch of similar Enums for the gene variants and a bunch of classes which holds pairs of them for each gene.

I made Enums like this:

\App\Enums\Species1\gene1, gene2, ...
\App\Enums\Species2\gene1, gene2, ...

All Enums use the same trait to get the number, name, symbol, ...

The classes are like this:

\App\Models\Species1\Gene1, Gene2, ...
\App\Models\Species2\Gene1, Gene2, ...
\App\Models\BaseGene, Gene1, Gene2, ...

final class Species/Gene1 inherits from class Gene1 which inherits from abstract class BaseGene. So i can use common methods for common genes and may overwrite them on each level.

Question

Now i need to put this genes back in my entry model. But what's the best/most efficiently way for this?

Several groups of genes are used to generate name parts of the generic name for the genetic variant of the animal. Like this:

Genes ABC  ->	part1		
Gene DEF ->		part2		
Genes GHI ->	part3

generic Name: __('part1') . __('part2') . __('part3')

I am not sure if it's useful to use gene groups. These groups may be different between species. I could use BaseGenecode and GenecodeSpecies1, 2, ...

  • Put all the genes in a json column? Probably not. I need to search entries with certain genes.
  • Let them stay in the entries table and add more columns for new genes?
  • Made a separate table gene_codes with all the genes (mostly to split it up someway)?
  • Made separate gene_code tables per species? (probably not)
  • Made a hasMany genes table with all the single gene pairs? But then i need to search all the genes for all the entries for listing them with gene codes...
  • ???

Thank you for reading down here! I would be glad if someone has time to help me think around this problem.

0 likes
8 replies
martinbean's avatar

@tafulini This sounds like a very specific problem and unfortunately, the best person to decide on the schema is yourself, since you have this specific knowledge.

I don’t really have an understanding of genes and dominant/recessive traits other than secondary school education, so don’t think I’d be able to offer any further insight to how to model and store data that you wouldn’t have already considered yourself.

Given this is a highly specialised area, are you sure that aren’t already projects that solve this problem? As a database holding gene data seems like something someone else in academia or gene research would have already needed and built at least once.

tafulini's avatar

No, as far as i know, there is no such database for users to add new entries and look up, how animals with certain genes may look. You don't want to know how scientists databases look: unstyled, no images, no descriptions, only some strange words and numbers with links to strange data and DNA sequences no nonprofessional understands...

I know about the genes, that's not the problem, but i don't know enough about mysql to decide how to save them.

To simplify it - i need to save a lot of similar properties which comes in pairs of 2 and there type/amount may differ between species (some have 7 relevant genes, others 20 or so). Important data about the genes like symbols, htmlSymbols, names and so on, i had in large arrays previous, now are coming from the Enum classes, so the only things i need to add to the entry model are the numbers (1090, 2020). Is it a good idea to use Enums for this? I think i could use a sort of Enum cast to convert the numbers into gene classes with 2 Enums each. The other thing i need genes for is to create search links for animals with a distinct genetic. The old app creates a dummy entry to use the gene code part of it.

What's about the table with one gene/row? Is accessing a list of entries with all there related genes a sql / performance nightmare or a good idea?

martinbean's avatar

What's about the table with one gene/row? Is accessing a list of entries with all there related genes a sql / performance nightmare or a good idea?

@tafulini I think this is possibly where the solution would head. From what you’ve described, given different animals/species may have different numbers of gene pairs, then it sounds like you need to create a “has many” relation between the animal, and then multiple rows—one per gene pair, that the animal has.

But again, this is where my lack of understanding shows, as I don‘t know if that will have implications. Is it possible for animals to have genes made up of more than two alleles? Or one allele (i.e. asexual organisms)?

tafulini's avatar

Is it possible for animals to have genes made up of more than two alleles?

No, in this case not :-)

tafulini's avatar

@martinbean What's the disadvantage of saving the genes in a table with columns for each gene using NULL for not needed ones?

martinbean's avatar

What's the disadvantage of saving the genes in a table with columns for each gene using NULL for not needed ones?

@tafulini Redundancy.

tafulini's avatar

@martinbean Not needed NULL values are much worse than getting all this genes for each entry for list views, searching, ...? We speak about maybe 5 or 10 species with some more or fewer relevant genes, not hundreds of them.

tafulini's avatar

i am really interested in an answer.

Please or to participate in this conversation.