Search results
1 – 10 of 179Fabrice Coutier and Giovanni Sebastiani
This purpose of this paper is to describe a fast and easy method of both clustering samples and identifying active genes in cDNA microarray data.
Abstract
Purpose
This purpose of this paper is to describe a fast and easy method of both clustering samples and identifying active genes in cDNA microarray data.
Design/methodology/approach
The method relies on alternation of identification of the active genes using a mixture model and clustering of the samples based on Ward hierarchical clustering. The initial‐point of the procedure is obtained by means of a χ2 test. The method attempts to locally minimize the sum of the within cluster sample variances under a suitable Gaussian assumption on the distribution of data.
Findings
This paper illustrates the proposed methodology and its success by means of results from both simulated and real cDNA microarray data. The comparison of the results with those from a related known method demonstrates the superiority of the proposed approach.
Research limitations/implications
Only empirical evidence of algorithm convergence is provided. Theoretical proof of algorithm convergence is an open issue.
Practical implications
The proposed methodology can be applied to perform cDNA microarray data analysis.
Originality/value
This paper provides a contribution to the development of successful statistical methods for cDNA microarray data analysis.
Details
Keywords
Richard S. Segall, Gauri S. Guha and Sarath A. Nonis
This paper seeks to present a complete set of graphical and numerical outputs of data mining performed for microarray databases of plant data as described in earlier research by…
Abstract
Purpose
This paper seeks to present a complete set of graphical and numerical outputs of data mining performed for microarray databases of plant data as described in earlier research by the authors. A brief description of data mining is also presented, as well as a brief background of previous research.
Design/methodology/approach
The paper uses applications of data mining using SAS Enterprise Miner Version 4 for plant data from the Osmotic Stress Microarray Information Database (OSMID) that is available on the web for both normalized and log(2) transformed data.
Findings
This paper illustrates that useful information about the effects of environmental stress tolerances (ESTs) on plants can be obtained by using data mining.
Research limitations/implications
Use of SAS Enterprise Miner was very effective for performing data mining of microarray databases with its modules of cluster analysis, decision trees, and descriptive and visual statistics.
Practical implications
The data used from the OSMID database are considered to be representative of those that could be used for biotech application such as the manufacture of plant‐made‐pharmaceuticals and genetically modified foods.
Originality/value
This paper contributes to the discussion on the use of data mining for microarray databases and specifically for studying the effects of ESTs on plants.
Details
Keywords
Richard S. Segall and Qingyu Zhang
To present research in the area of the applications of modern heuristics and data mining techniques in knowledge discovery.
Abstract
Purpose
To present research in the area of the applications of modern heuristics and data mining techniques in knowledge discovery.
Design/methodology/approach
Applications of data mining for neural networks using NeuralWare Predict® software, genetic algorithms using Biodiscovery GeneSight® (2005) software, and regression and discriminant analysis using SPSS® were selected for bioscience data sets of continuous numerical‐valued Abalone fish data and discrete nominal‐valued mushroom data.
Findings
This paper illustrates the useful information that can be obtained using data mining for evolutionary algorithms specifically as those for neural networks, genetic algorithms, regression analysis, and discriminant analysis.
Research limitations/implications
The use of NeuralWare Predict® was a very effective method of implementing training rules for neural networks to identify the important attributes of numerical and nominal valued data.
Practical implications
The software and algorithms discussed in the paper can be used to visualize and mine microarray data.
Originality/value
The paper contributes to the discussion on the data visualization and data mining of microarray database for bioinformatics and emphasizes new applicability of modern heuristics and software.
Details
Keywords
Nageswara Rao Eluri, Gangadhara Rao Kancharla, Suresh Dara and Venkatesulu Dondeti
Gene selection is considered as the fundamental process in the bioinformatics field. The existing methodologies pertain to cancer classification are mostly clinical basis, and its…
Abstract
Purpose
Gene selection is considered as the fundamental process in the bioinformatics field. The existing methodologies pertain to cancer classification are mostly clinical basis, and its diagnosis capability is limited. Nowadays, the significant problems of cancer diagnosis are solved by the utilization of gene expression data. The researchers have been introducing many possibilities to diagnose cancer appropriately and effectively. This paper aims to develop the cancer data classification using gene expression data.
Design/methodology/approach
The proposed classification model involves three main phases: “(1) Feature extraction, (2) Optimal Feature Selection and (3) Classification”. Initially, five benchmark gene expression datasets are collected. From the collected gene expression data, the feature extraction is performed. To diminish the length of the feature vectors, optimal feature selection is performed, for which a new meta-heuristic algorithm termed as quantum-inspired immune clone optimization algorithm (QICO) is used. Once the relevant features are selected, the classification is performed by a deep learning model called recurrent neural network (RNN). Finally, the experimental analysis reveals that the proposed QICO-based feature selection model outperforms the other heuristic-based feature selection and optimized RNN outperforms the other machine learning methods.
Findings
The proposed QICO-RNN is acquiring the best outcomes at any learning percentage. On considering the learning percentage 85, the accuracy of the proposed QICO-RNN was 3.2% excellent than RNN, 4.3% excellent than RF, 3.8% excellent than NB and 2.1% excellent than KNN for Dataset 1. For Dataset 2, at learning percentage 35, the accuracy of the proposed QICO-RNN was 13.3% exclusive than RNN, 8.9% exclusive than RF and 14.8% exclusive than NB and KNN. Hence, the developed QICO algorithm is performing well in classifying the cancer data using gene expression data accurately.
Originality/value
This paper introduces a new optimal feature selection model using QICO and QICO-based RNN for effective classification of cancer data using gene expression data. This is the first work that utilizes an optimal feature selection model using QICO and QICO-RNN for effective classification of cancer data using gene expression data.
Details
Keywords
Irina Farquhar, Michael Kane, Alan Sorkin and Kent H. Summers
This chapter proposes an optimized innovative information technology as a means for achieving operational functionalities of real-time portable electronic health records, system…
Abstract
This chapter proposes an optimized innovative information technology as a means for achieving operational functionalities of real-time portable electronic health records, system interoperability, longitudinal health-risks research cohort and surveillance of adverse events infrastructure, and clinical, genome regions – disease and interventional prevention infrastructure. In application to the Dod-VA (Department of Defense and Veteran's Administration) health information systems, the proposed modernization can be carried out as an “add-on” expansion (estimated at $288 million in constant dollars) or as a “stand-alone” innovative information technology system (estimated at $489.7 million), and either solution will prototype an infrastructure for nation-wide health information systems interoperability, portable real-time electronic health records (EHRs), adverse events surveillance, and interventional prevention based on targeted single nucleotide polymorphisms (SNPs) discovery.
Qingyu Zhang and Richard S. Segall
The purpose of this paper is to review and compare selected software for data mining, text mining (TM), and web mining that are not available as free open‐source software.
Abstract
Purpose
The purpose of this paper is to review and compare selected software for data mining, text mining (TM), and web mining that are not available as free open‐source software.
Design/methodology/approach
Selected softwares are compared with their common and unique features. The software for data mining are SAS® Enterprise Miner™, Megaputer PolyAnalyst® 5.0, NeuralWare Predict®, and BioDiscovery GeneSight®. The software for TM are CompareSuite, SAS® Text Miner, TextAnalyst, VisualText, Megaputer PolyAnalyst® 5.0, and WordStat. The software for web mining are Megaputer PolyAnalyst®, SPSS Clementine®, ClickTracks, and QL2.
Findings
This paper discusses and compares the existing features, characteristics, and algorithms of selected software for data mining, TM, and web mining, respectively. These softwares are also applied to available data sets.
Research limitations/implications
The limitations are the inclusion of selected software and datasets rather than considering the entire realm of these. This review could be used as a framework for comparing other data, text, and web mining software.
Practical implications
This paper can be helpful for an organization or individual when choosing proper software to meet their mining needs.
Originality/value
Each of the software selected for this research has its own unique characteristics, properties, and algorithms. No other paper compares these selected softwares both visually and descriptively for all the three types of data, text, and web mining.
Details
Keywords
Charlie Mayor and Lyn Robinson
The purpose of this article is to evaluate the development and use of the gene ontology (GO), a scientific vocabulary widely used in molecular biology databases, with particular…
Abstract
Purpose
The purpose of this article is to evaluate the development and use of the gene ontology (GO), a scientific vocabulary widely used in molecular biology databases, with particular reference to the relation between the theoretical basis of the GO, and the pragmatics of its application.
Design/methodology/approach
The study uses a combination of bibliometric analysis, content analysis and discourse analysis. These analyses focus on details of the ways in which the terms of the ontology are amended and deleted, and in which they are applied by users.
Findings
Although the GO is explicitly based on an objective realist epistemology, a considerable extent of subjectivity and social factors are evident in its development and use. It is concluded that bio-ontologies could beneficially be extended to be pluralist, while remaining objective, taking a view of concepts closer to that of more traditional controlled vocabularies.
Originality/value
This is one of very few studies which evaluate the development of a formal ontology in relation to its conceptual foundations, and the first to consider the GO in this way.
Details
Keywords
Tai-Wei Chiang and Ta-Cheng Chen
The categorization response model through gene expression patterns turns into one of the most favorable utilizations of the microarray technology. In this study, the aim is to…
Abstract
Purpose
The categorization response model through gene expression patterns turns into one of the most favorable utilizations of the microarray technology. In this study, the aim is to propose a grid computing-based meta-evolutionary mining approach as a categorization response model for gene selection and cancer classification.
Design/methodology/approach
The proposed approach is based on the grid computing infrastructure for establishing the best attributes set selected from a big microarray data. The novel discriminant analysis is based on vector distant of median method as the evaluation function of meta-evolutionary mining approach. In this study, the proposed approach lays stress on finding the best attributes set for constructing a categorization response model with highest categorization accuracy.
Findings
Examples for several benchmarking cancer microarray data sets were used to evaluate the proposed approach, whose results are also compared with other approaches in literatures. Experimental results from four benchmarking problems indicate that the proposed approach works effectively and efficiently, and the results of the proposed methods are superior to or as well as other existing methods in literatures.
Originality/value
The novel discriminant analysis is based on vector distant of median method as the evaluation function of meta-evolutionary mining approach to discover the best feature subset automatically from the microarray tumor database. In this study, the proposed approach lays stress on finding the best attributes set for constructing a categorization response model with highest categorization accuracy.
Details
Keywords
Bruno Feres de Souza, Carlos Soares and André C.P.L.F. de Carvalho
The purpose of this paper is to investigate the applicability of meta‐learning to the problem of algorithm recommendation for gene expression data classification.
Abstract
Purpose
The purpose of this paper is to investigate the applicability of meta‐learning to the problem of algorithm recommendation for gene expression data classification.
Design/methodology/approach
Meta‐learning was used to provide a preference order of machine learning algorithms, based on their expected performances. Two approaches were considered for such: k‐nearest neighbors and support vector machine‐based ranking methods. They were applied to a set of 49 publicly available microarray datasets. The evaluation of the methods followed standard procedures suggested in the meta‐learning literature.
Findings
Empirical evidences show that both ranking methods produce more interesting suggestions for gene expression data classification than the baseline method. Although the rankings are more accurate, a significant difference in the performances of the top classifiers was not observed.
Practical implications
As the experiments conducted in this paper suggest, the use of meta‐learning approaches can provide an efficient data driven way to select algorithms for gene expression data classification.
Originality/value
This paper reports contributions to the areas of meta‐learning and gene expression data analysis. Regarding the former, it supports the claim that meta‐learning can be suitably applied to problems of a specific domain, expanding its current practice. To the latter, it introduces a cost effective approach to better deal with classification tasks.
Details
Keywords
Beatriz Pontes, Federico Divina, Raúl Giráldez and Jesús S. Aguilar‐Ruiz
The purpose of this paper is to present a novel control mechanism for avoiding overlapping among biclusters in expression data.
Abstract
Purpose
The purpose of this paper is to present a novel control mechanism for avoiding overlapping among biclusters in expression data.
Design/methodology/approach
Biclustering is a technique used in analysis of microarray data. One of the most popular biclustering algorithms is introduced by Cheng and Church (2000) (Ch&Ch). Even if this heuristic is successful at finding interesting biclusters, it presents several drawbacks. The main shortcoming is that it introduces random values in the expression matrix to control the overlapping. The overlapping control method presented in this paper is based on a matrix of weights, that is used to estimate the overlapping of a bicluster with already found ones. In this way, the algorithm is always working on real data and so the biclusters it discovers contain only original data.
Findings
The paper shows that the original algorithm wrongly estimates the quality of the biclusters after some iterations, due to random values that it introduces. The empirical results show that the proposed approach is effective in order to improve the heuristic. It is also important to highlight that many interesting biclusters found by using our approach would have not been obtained using the original algorithm.
Originality/value
The original algorithm proposed by Ch&Ch is one of the most successful algorithms for discovering biclusters in microarray data. However, it presents some limitations, the most relevant being the substitution phase adopted in order to avoid overlapping among biclusters. The modified version of the algorithm proposed in this paper improves the original one, as proven in the experimentation.
Details