Hi all!
I am building a computational model to predict the biological effect of mutant proteins, attempting to improve the already existent ones. Simplifying, there are two variables that affect the effect of the mutation, the affinity of the mutant protein and its normalized expression (normalized quantity). The current models only consider the first one, with the following expression, for a protein (a) the effect of the mutation (F) is defined as:
[math]F(a) = ln(Kd-wildtype / Kd - mutant)[/math]
Therefore, it only depends on the ratio between the affinity of the wild-type protein and the affinity of the mutant one.
One of the contributions I am adding is to combine the ratio with the normalized expression (tpm), in this way:
[math]F(a) = ln((Kd-wildtype / Kd - mutant ) x log2(tpm))[/math]
We always use to normalize the expression by tpm (transcripts per million) and we scale it to log10 or log2. So, I am working with different cohorts and I have the corresponding metadata, which contains the expected effect of the mutation. In that sense, sometimes, multiplying the ratio by the tpm works better, in others, summing both terms works better, in others instead of putting log2(tpm) inside of the ln I put them as separate logarithms and multiplied them. It seems that I am trying all the combinations from a very poor mathematical point of view.
Therefore, my question is, how can I find the best way of combining two terms that I am 100% sure they contribute to the overall effect and they are positively correlated (higher the ratio and higher the tpm, bigger the effect)?
Is there a way of using probabilistic models to find an optimal parameter that weights both terms, based on a gold standard dataset ( which I already have) to train my model)? So, let's say I try to optimize an hyperparameter 'G' between the two variables:
[math]F(a) = ln((Kd-wildtype / Kd - mutant ) x G log2(tpm))[/math]
Is there any other approach apart from the probabilistic model that suits better my purpose?
By the way, the distribution of ratio values and tpm looks like a normal distribution. I attach you an histogram of the distributions of the metrics I am using, just in case it helps. KdWt - Wild-type affinity, KdMt - Mutant affinity.
Thank you very much for all the help provided. It's so nice to have forums like this one.
Have a great day everyone!
AP
I am building a computational model to predict the biological effect of mutant proteins, attempting to improve the already existent ones. Simplifying, there are two variables that affect the effect of the mutation, the affinity of the mutant protein and its normalized expression (normalized quantity). The current models only consider the first one, with the following expression, for a protein (a) the effect of the mutation (F) is defined as:
[math]F(a) = ln(Kd-wildtype / Kd - mutant)[/math]
Therefore, it only depends on the ratio between the affinity of the wild-type protein and the affinity of the mutant one.
One of the contributions I am adding is to combine the ratio with the normalized expression (tpm), in this way:
[math]F(a) = ln((Kd-wildtype / Kd - mutant ) x log2(tpm))[/math]
We always use to normalize the expression by tpm (transcripts per million) and we scale it to log10 or log2. So, I am working with different cohorts and I have the corresponding metadata, which contains the expected effect of the mutation. In that sense, sometimes, multiplying the ratio by the tpm works better, in others, summing both terms works better, in others instead of putting log2(tpm) inside of the ln I put them as separate logarithms and multiplied them. It seems that I am trying all the combinations from a very poor mathematical point of view.
Therefore, my question is, how can I find the best way of combining two terms that I am 100% sure they contribute to the overall effect and they are positively correlated (higher the ratio and higher the tpm, bigger the effect)?
Is there a way of using probabilistic models to find an optimal parameter that weights both terms, based on a gold standard dataset ( which I already have) to train my model)? So, let's say I try to optimize an hyperparameter 'G' between the two variables:
[math]F(a) = ln((Kd-wildtype / Kd - mutant ) x G log2(tpm))[/math]
Is there any other approach apart from the probabilistic model that suits better my purpose?
By the way, the distribution of ratio values and tpm looks like a normal distribution. I attach you an histogram of the distributions of the metrics I am using, just in case it helps. KdWt - Wild-type affinity, KdMt - Mutant affinity.
Thank you very much for all the help provided. It's so nice to have forums like this one.
Have a great day everyone!
AP