

Most of the approaches used also hinge on the conformity of data distribution to a priori distributions, requiring extensive data preparation and careful application of differential behavior identification methods to ensure it.


Ideally, selection thresholds should take into account the form of data distributions, the presence of confounders, and the complexity of phenomenon under investigation. Specifically, regarding p-value thresh-holding, critics raise the issues of incomplete information, misrepresentation, misinterpretation and bias, whereas for fold change thresh-holding, the issues cited include the adoption of arbitrary thresholds, consequently the potential of strong bias, with no theoretical underpinning for the threshold values. This approach as a selection philosophy is currently being debated. The first set S is further investigated for its phenotypic relevance, whereas the other is exempted from further analysis, considered to be the baseline distribution with noise, either biological (coexisting, causally unrelated processes) or technical.
#Rank these systems in order of decreasing entropy. full#
Selection by information content measures efficiently addresses problems, emerging from arbitrary thresh-holding, thus facilitating the full automation of the analysis.ĭata analysis of high-throughput technologies (microarrays, next generation sequencing) commonly predicates on the adoption of arbitrary p-value and fold change thresholds to define the reliability and relevance of a set of features, in order to partition the initial distribution into two sets. The feature lists are compact and rich in information, indicating phenotypic aspects specific to the tissue and biological phenomenon investigated. The methodology behaves consistently across different data types. Overall, the derived functional terms provide a systemic description highly compatible with the results of traditional statistical hypothesis testing techniques. Functional analysis through BioInfoMiner and EnrichR was used to evaluate the information potency of the resulting feature lists. Conclusions: Applying the proposed method on microarray (transcriptomic and DNA methylation) and RNAseq count data of varying sizes and noise presence, we observe robust convergence for the different parameterizations to stable cutoff points. Goal is a methodology of threshold-free identification of the differentially expressed features, which are highly informative about the phenomenon under study. We introduce the calculation of the RP entropy of the distribution, to isolate the features of interest by their contribution to its information content. Methods: Our work extends the rank product (RP) methodology with a neutral selection method of high information-extraction capacity. This work aims to propose a methodology, which automates and standardizes the statistical selection, through the utilization of established measures like that of entropy, already used in information retrieval from large biomedical datasets, thus departing from classical fixed-threshold based methods, relying in arbitrary p-value and fold change values as selection criteria, whose efficacy also depends on degree of conformity to parametric distributions.

Such methods could adapt to different initial data distributions, contrary to statistical techniques, based on fixed thresholds. Background: Here, we propose a threshold-free selection method for the identification of differentially expressed features based on robust, non-parametric statistics, ensuring independence from the statistical distribution properties and broad applicability.
