Feature selection for marine species origin prediction

The framework of general linear models (GLM) is typically used to model a response variable as a function of a set of p associated covariates, based on a random sample with n elements, whenever the parent population is described by a distribution within the exponential family. Fisher information matrix (FIM), arises in this context as the inverse of FIM yields the Cramer-Rao Lower Bound for the covariance matrix of the estimated parameters. However, when dealing with high dimensional data (p > n), the specification matrix is not a full-rank matrix, FIM is not invertible and, consequently, the estimation process fails to converge. In this talk, I’ll discuss some methods that allow to overcome this problem, by selecting explanatory variables (features) and, hence, enable model fitting despite the original data high-dimensionality. An applied example to a marine biology dataset will be discussed.