The aim of this study was to identify a transcriptomic signature that could be used to classify subjects with autism spectrum disorder (ASD) compared to controls on the basis of blood gene expression profiles. The gene expression profiles could ultimately be used as diagnostic biomarkers for ASD.
We used the published microarray data (GSE26415) from the Gene Expression Omnibus database, which included 21 young adults with ASD and 21 age- and sex-matched unaffected controls. Nineteen differentially expressed probes were identified from a training dataset (n=26, 13 ASD cases and 13 controls) using the limma package in R language (adjusted
Hierarchical cluster analysis showed that subjects with ASD were relatively well-discriminated from controls. Based on the support vector machine and K-nearest neighbors analysis, validation of 19-DE probes with a test dataset resulted in an overall class prediction accuracy of 93.8% as well as a sensitivity and specificity of 100% and 87.5%, respectively.
The results of our exploratory study suggest that the gene expression profiles identified from the peripheral blood samples of young adults with ASD can be used to identify a biological signature for ASD. Further study using a larger cohort and more homogeneous datasets is required to improve the diagnostic accuracy.
Autism spectrum disorders (ASDs) are devastating neurodevelopmental disorders characterized by deficits in social communication and interaction across multiple contexts as well as restricted, repetitive patterns of interests and behavior. The Centers for Disease Control recently presented that the prevalence of ASD has risen to approximately 1 in 68, and most children are not diagnosed with ASD until after 4 years of age in the United States.1) Because early intensive behavioral and developmental interventions for toddlers and children with autism could improve outcomes,2) there is a scientific need for reliable diagnostic ASD biomarkers that are expressed early in life. Such markers could have a significant impact on diagnosis and treatment.
Although the complex etiologies of ASD are poorly understood, the high heritability of ASD is supported by high concordance rates (from 36% to 95%) in monozygotic twins and higher recurrence risks of 11% and 19% with single-sibling involvement.3–5) Rapid advances in clinical genetic testing technology have increased the diagnostic yield from about 10% a few years ago to about 30%.6) However, because many of these genetic variants show incomplete penetrance and variable phenotypic expression,7) the use of gene expression signature bio-markers may be informative and provide the best model for identifying ASD cases.
In particular, four studies have investigated blood-derived gene expression signatures to differentiate between ASD individuals (toddlers and children) and unaffected controls.8–11) These studies focused on individuals with a mean age of 2.2 to 9.6 years who were at risk for ASD, and there were relatively high predictive accuracies (between 68% and 91%). To date, no study has demonstrated diagnostic prediction using blood-derived gene expression signatures in adult subjects with ASD. Accordingly, whether the gene expression profiles of adult individuals offer information about the ASD risk remains a critical question.
The aim of this study is to apply a transcriptomic approach to identify a gene expression signature with promising performance in the diagnostic prediction of young adults with ASD. Here, we used a published ASD microarray dataset to test the hypothesis. These methods provide researchers with the opportunity to test hypotheses without performing time-consuming, labor-intensive bench work.
A publicly available microarray dataset (GSE26415) was downloaded from the Gene Expression Omnibus (
The raw data in .CEL format were primarily processed using R language (
We applied machine learning to develop a prediction model that used DE probes extracted from the training set, differentiating between individuals with ASD and controls in the test set. Our strategy included two main types of machine learning, unsupervised and supervised learning.
For unsupervised learning, we adopted hierarchical cluster analysis using complete linkage and the Euclidean distance. Cluster analysis and visualization were performed using the “heatmap.2” function in the “gplots” package16) in R.
For supervised learning, we used three different machine learning algorithms, such as the support vector machine (SVM),17) K-nearest neighbors (KNN)18) and linear discriminant analysis (LDA).19) We performed prediction analysis in the subsequent sequential steps. Using the “set.seed” function in R, we randomly divided our data (n=42) into a training dataset (13 ASD and 13 control subjects) and test dataset (8 ASD and 8 control subjects). Each algorithm was trained on the training dataset of 26 randomly selected samples, which were labeled with DE probes. Eight ASD and 8 control subjects in the test data-set were validated. All supervised machine learning analyses were performed using the “MLinterfaces” packages20) in R language.
In comparing microarray data for the subjects with ASD (n=13) with those of unaffected controls (n=13) in the training dataset, a total of 19 DE probes were identified (adjusted p value <0.05), including 6 up-regulated probes and 13 down-regulated probes (
Using the 19-probe expression signature, a hierarchical cluster analysis of all samples (n=42) showed that ASDs were relatively well discriminated from controls (with the sorting of three ASD cases into the control group), suggesting that these probes could be helpful for differentiating between ASDs and controls. Detailed results from the hierarchical cluster analysis are presented in Figure 1.
For the supervised machine learning algorithms, we simply built a classifier using the 19-probe expression signature and assessed its predictive performance. With this 19-probe prediction model, the test dataset was used to validate the prediction of ASD. This validation test revealed that our prediction model successfully distinguished between the individuals with ASD and controls. Both SVM and KNN analysis accurately identified 8 individuals with ASD and 8 controls with the exception of classifying one control as ASD, resulting in a predictive accuracy of 93.8% (sensitivity of 100% and specificity of 87.5%). However, in the LDA analysis, the diagnostic prediction of ASD vs. control samples was 68.8% accurate (Table 3).
Our analyses were designed to validate a potential biological signature using peripheral blood microarray data obtained from young Asian adults with ASD in combination with machine learning algorithms. In this exploratory study using previously published microarray data,13) we identified a blood-based gene expression signature that reliably identified young adults with ASD.
These results are consistent with the findings of four previous studies that reported on gene expression signatures with high diagnostic accuracy for toddlers and children with ASD.8–11) The results of this and four previous studies suggest that gene expression profiles from the peripheral blood samples contain a biological signature that could be used to predict the ASD risk in both children and young adults. According to several studies of healthy adults, the expression of most genes within individuals remains temporally stable, and only 1% to 2% of genes display significant changes over time periods of at least one month.22,23) In addition, previous studies observed that the cognitive, behavioral, and emotional symptoms of individuals with ASD generally persist over time.24,25) Therefore, the gene expression patterns underlying these long-standing phenotypes may be constant over time in the transition children to young adults.
Gene expression microarrays primarily measure messenger RNA for thousands of identified genes.26) The microarrays specifically evaluate the sequence of DNA that is transcribed to RNA in the genome at a given time. Prediction models using multivariate gene expression have been widely adopted for screening, diagnosis, and prognosis.27,28) Several previous transcriptome-wide studies of gene expression in ASD subjects have used post mortem brain tissue29–31) or peripheral blood samples.8–11,13) Among them, the gene expression profiles using peripheral blood have shown that RNA expression is disrupted across hundreds of genes in individuals with ASD. Blood-based analyses of gene expression profiles are encouraging because blood samples are easily obtainable from living individuals and are likely to contain ASD-relevant signatures.
Although the connection between blood and brain transcriptomic profiles is not well known, growing evidence suggests that measurements performed in tissues that are not primarily involved in the disease process may uncover disease signatures.10) Sullivan et al.32) have established a shared gene expression profile between whole blood and brain tissues suggesting that the cautious and thoughtful use of peripheral gene expression may be a useful surrogate for gene expression in the brain. Further research will be required to determine whether the dysregulated signatures in peripheral blood are actual indicators of the brain pathophysiology in ASD. Our results could also provide further evidence of the emerging consensus that peripheral blood is a potential source of biological signatures that are responsible for dysregulation of the brain and other unreachable tissues.33)
The gene list in our study partially overlaps with previously reported candidate genes and loci associations for ASD (Table 2). These various transcriptomic changes would be representative of the genomic alteration in the ASD. Blood-derived gene expression studies of subjects with ASD repeatedly demonstrate dysregulation of immune/inflammation genes.34) Regulatory factor X1 (RFX1; transcription factor regulating a wide variety of genes involved in immunity)35) expression was significantly increased in the ASD group in our study. Substantial percentages of patients with ASD show peripheral markers of mitochondrial energy metabolism dysfunction.36) We found the mitochondrial ribosomal protein S31 (MRPS31) expression was significantly reduced in the ASD group. In particular, we identified a probe (A_23_P399501, pyruvate kinase muscle isozyme [PKM]) that has the best ability for detecting whether a sample was collected from a patient with ASD (
Our study has several limitations, mostly stemming from small sample size and lack of phenotypic information of the original data. In particular, most of ASD subjects in this study exhibited normal intelligence quotients (IQ; mean full scale IQ, 91.9), this probably does not represent the broader ASD population. Unfortunately, it is not well understood about the connection between the peripheral blood and the brain transcriptomic profiles and the influence of age factor for gene expression in subjects with ASD. The results of our study should be cautiously interpreted. If further analysis is performed on a more homogeneous dataset and validated in an independent, large cohort of cases and controls, the accuracy of the results should be higher. These strategies for class prediction analyses will help identify robust biomarkers for both the diagnosis of ASD and individualized treatment options for patients and their families.39)
In conclusion, this study reveals a blood-based gene expression signature that has promising accuracy in distinguishing between young adults with ASD and age- and sex-matched unaffected controls. The ability of the 19 DE probes to correctly predict ASD samples compares favorably with the results of four previous studies on ASD diagnosis in toddlers and children. This transcriptomics approach may shed light on an important aspect of clinical biomarker discovery, offering high predictive accuracy for detecting ASD.
We thank Dr. Yuki Kuwano for generously depositing the microarray data series (GSE26415) online.
Demographic and clinical characteristics of study subjects
|Characteristic||ASD (n=21)||Control (n=21)|
|Age (yr)||26.7 (5.5)||27.0 (5.5)|
Values are presented as number only or mean (standard deviation).
ASD, autism spectrum disorder; AQ, autism spectrum quotient; WAIS, Wechsler Adult Intelligence Scale; IQ, intelligence quotients; VIQ, verbal IQ; PIQ, performance IQ; FIQ, full IQ; NA, not applied.
Nineteen probes significantly dysregulated
|Probe ID||Gene symbol||Gene name||Location||logFC||p value||Adjusted p value||Evidence for association with ASD|
|Type||Number of reports|
|A_32_P9963||HSF2||Heat shock transcription factor 2||6q22.31||−0.5288||3.99E-08||0.0007||Deletion-duplication of 6q22.31||20|
|A_24_P391104||RFX1||Regulatory factor X, 1 (influences HLA class II expression)||19p13.1||0.5379||1.16E-07||0.0021||Deletion of 19p13.13-p13.11||2|
|A_23_P214037||NPM1||Nucleophosmin (nucleolar phosphoproein B23, numatrin)||5q35.1||−0.5139||4.34E-07||0.0079|
|A_24_P832113||NPM1||Nucleophosmin (nucleolar phosphoproein B23, numatrin)||5q35.1||−0.4808||4.79E-07||0.0088|
|A_23_P119683||MIER2||Mesoderm induction early response 1, family member 2||19p13.3||0.3391||5.41E-07||0.0099|
|A_23_P162807||MRPS31||Mitochondrial ribosomal protein S31||13q14.11||−0.4422||1.13E-06||0.0206||Deletion-duplication of 13q14.11||15|
|A_23_P88439||TC2N||Tandem C2 domains, nuclear||14q32.12||−0.4056||1.25E-06||0.0228||Deletion of 14q32.11-q32.13||1|
|A_32_P188674||NPM1||Nucleophosmin (nucleolar phosphoproein B23, numatrin)||5q35.1||−0.4742||1.30E-06||0.0237|
|A_23_P399501||PKM||Pyruvate kinase, muscle||15q22||0.4023||1.32E-06||0.0242|
|A_32_P46765||C12orf29||Chromosome 12 open reading frame 29||12q21.32||−0.4320||1.38E-06||0.0252||Deletion of 12q21.31-q21.33||1|
|A_24_P927883||JADE2||Jade family PHD finger 2||5q31.1||−0.3695||1.52E-06||0.0278|
|A_23_P131676||ACKR3||Atypical chemokine receptor 3||2q37.3||−0.4413||1.66E-06||0.0303||Deletion of 2q37.1-q37.3||3|
|A_23_P84154||ARHGAP15||Rho GTPase activating protein 15||2q22.2-q22.3||−0.3958||1.69E-06||0.0309||Rare single gene variant||6|
|A_23_P250462||ATP6AP1||ATPase, H+ transporting, lysosomal accessory protein 1||Xq28||0.3399||2.25E-06||0.0412||Deletion-duplication of Xq27.1-q28||1|
|A_23_P322593||TAPT1-AS1||TAPT1 antisense RNA 1||4p15.32||−0.3650||2.28E-06||0.0417||Deletion of 4p16.3-p15.32||2|
|A_32_P173058||TMEM41B||Transmembrane protein 41B||11p15.4||−0.3388||2.40E-06||0.0438||Deletion-duplication of 11p15.4||28|
|A_23_P117424||DCAF11||DDB1 and CUL4 associated factor 11||14q11.2||0.3174||2.54E-06||0.0464||Duplication of 14q11.2-q21.1||1|
ASD, autism spectrum disorder; ID, intellectual disability; NA, not applied; logFC, log2 of fold change.
Adjusted p values <0.05;
The Simons Foundation Autism Research Initiative (SFARI) Gene 2.0 database (available at
Prediction performances of the 19-probe set on the test (validation) set, according to machine learning algorithms
|Accuracy (%)||Sensitivity (%)||Specificity (%)||Positive predictive value (%)||Negative predictive value (%)|
SVM, support vector machine; KNN, K-nearest neighbor; LDA, linear discriminant analysis.