To select significant predictors(features variables), we have developed an original
software package that includes:
a genetic algorithm,
a stepwise regression method,
regression trees and model trees using M5method.
3. Building regression trees and model trees using M5method
(M5 model tree is a decision tree learner for regression task which is used to predict values of numerical response variable, which is a binary decision tree having linear regression functions at the terminal (leaf) nodes, which can predict continuous numerical attributes) as well as building
ensembles of
M5 trees using Bagging, Random Forests, and Extremely Randomized Trees.
The built trees can also be linearized into decision rules either directly or using the M5 method. Program accepts input variables to be continuous, binary, and categorical, as well as manages missing values. Model trees combine a conventional regression tree with the possibility of linear regression functions at the leaves. This representation usually provides higher accuracy than regression trees but preserves the advantage of clear and easy-to-interpret structure.
Thus, combining the methods GA-PLS, GA-OLS, GA-KNN, GA-RR, GA-PC, FS-PLS, FSOLS, FS-KNN, FS-RR, FS-PC and regression trees and model trees using we got a set of significant predictors for the prediction model. The described methods were implemented in the original toolbox.
At the same time, we consistently used a genetic algorithm, a decision tree to select the optimal set of features.
1. Genetic Algorithms(GA) are adaptive heuristic search algorithms that belong to the larger part of evolutionary algorithms. Genetic algorithms are based on the ideas of natural selection and genetics. These are intelligent exploitation of random search provided with historical data to direct the search into the region of better performance in solution space.
They are commonly used to generate high-quality solutions for optimization problems and search problems.
To select significant predictors, we used the following combined approaches:
GA-PLS (partial least squares),
GA-OLS(ordinary least squares) ,
GA-PC(principal component),
GA-RR(ridge regression),
Also, to improve the classification, we use
GA-KNN( k- nearest neighbor) (KNN).In this case, instead of considering all training samples and taking k-neighbors, we used GA, which immediately takes k-neighbors, and then calculates the distance to classify the test samples.
An example program for the developed Machine Learning and Deep Learning Method for biological systems. The analysis is carried out for charged amino acid residues that are replaced (mutated) in the Spike-ACE2 dimer.
Correlation after clustering represents the dependence between calculated and experimental data
2. Stepwise regression (regression is a method of fitting regression models in which the choice of predictive variables is carried out by an automatic procedure. In each step, a variable is considered for addition to or subtraction from the set of explanatory variables based on some prespecified criterion. Usually, this takes the form of a forward, backward, or combined sequence of F-tests. The main approaches for stepwise regression is: forward selection (FS), which involves starting with no variables in the model, testing the addition of each variable using a chosen model fit criterion, adding the variable (if any) whose inclusion gives the most statistically
significant improvement of the fit, and repeating this process until none improves the model to a statistically significant extent.
FS-PLS (partial least squares),
FS-OLS (ordinary least squares) ,
FS-PC (principal component),
FS-RR (ridge regression),
FS-KNN (k- nearest neighbor)
To select significant predictors, we choose the smallest value of RMSE(root mean square error) из GA- PLS, GA-OLS, GA-KNN, GA-RR, GA-PC, FS-PLS, FS-OLS, FS-KNN,
FS-RR, FS-PC.
An example program for the developed Machine Learning and Deep Learning Method for biological systems. The analysis is carried out for hydrophobic amino acid residues that are replaced (mutated) in the Spike-ACE2 dimer.
Correlation after clustering represents the dependence between calculated and experimental data
The main buttons in the Machine Learning program.
[Deep mutational scanning of an antibody
against epidermal growth factor receptor using
mammalian cell display and massively parallel
pyrosequencing] details of the experimental studies are described here