dxy logo
首页丁香园病例库全部版块
搜索
登录

生物信息PCA主成分分析(原创)

发布于 2015-11-04 · 浏览 4.3 万 · IP 广东广东
这个帖子发布于 9 年零 209 天前,其中的信息可能已发生改变或有所发展。
         主成分分析(Principal Component Analysis,PCA), 是一种掌握事物主要矛盾的统计分析方法,它可以从多元事物中解析出主要影响因素,揭示事物的本质,简化复杂的问题。
主成分分析试图在力保数据信息丢失最少的原则下,用较少的综合变量代替原本较多的变量,而且综合变量间互不相关。PCA 的目标是寻找 r ( r<n )个新变量,使它们反映事物的主要特征,压缩原有数据矩阵的规模。每个新变量是原有变量的线性组合,体现原有变量的综合效果,具有一定的实际含义。这 r 个新变量称为“主成分”,它们可以在很大程度上反映原来 n 个变量的影响,并且这些新变量是互不相关的,也是正交的。通过采用这样的主成分,便可以只选用若干变量而不是上千的变量来对一种样品进行分析了。这样,就可以将样品有关变量绘制成图,使得样品间的相似性和相异之处一目了然,对不同样品是否可以归为一组,也一清二楚。
使用在分析复杂的多维数据集的时候。例如不同实验条件下的转录组测序数据,表达谱芯片数据,以及蛋白组和代谢组数据。当变量的数目比样品的数目多时,PCA可以在不损失信息量的情况下将样品的维度最大程度地减少至样品数。它可以被看做复杂实验数据预处理的一个步骤。
下面我们介绍一下我们的分析过程,在生物信息学里,我们常用免费的R软件进行分析。如果大家想了解更加,可以与我们联系,方式见丁香园名字,或者掏宝输入“生物信息学视频”。


首先,我们使用r自带的函数对输入数据进行PCA分析。命令如下:
#pca analysis
setwd("D:/workdir/pca") #设置工作目录
data=read.table("input.txt",header=T,sep="\t",row.names=1) #读取表格
data=t(as.matrix(data)) #矩阵转置
data.class <- rownames(data)
data.pca <- prcomp(data, scale. = TRUE) #PCA分析
接下来输出一些主要的记过表格以及柱状图和碎石图:
write.table(data.pca$rotation,file="PC.xls",quote=F,sep="\t") #输出特征向量
write.table(predict(data.pca),file="newTab.xls",quote=F,sep="\t") #输出新表
pca.sum=summary(data.pca)
write.table(pca.sum$importance,file="importance.xls",quote=F,sep="\t") #输出PC比重
pdf(file="pcaBarplot.pdf",width=15) #柱状图
barplot(pca.sum$importance[2,1:10]*100,xlab="PC",ylab="percent",col="skyblue")
dev.off()
pdf(file="pcaPlot.pdf",width=15) #碎石图
plot(pca.sum$importance[2,1:10]*100,type="o",col="red",xlab="PC",ylab="percent")
dev.off()
最后,我们会绘制pca 2d和3图形,命令如下:
pdf(file="PCA2d.pdf")
ggplot(data = PCA, aes(PCA1, PCA2)) + geom_point(aes(color = group)) +
geom_path(data=df_ell, aes(x=PCA1, y=PCA2,colour=group), size=1, linetype=2)+
annotate("text",x=PCA.mean$PCA1,y=PCA.mean$PCA2,label=PCA.mean$group)+
theme_bw()+
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
dev.off()
#pca 3d plot
library(pca3d)
library(rgl)
pca3d(data.pca, components = 1:3, group = c(rep("con",5),rep("A",5),rep("B",3)),show.centroids=TRUE,new=TRUE) #画3d图
PCA分析步骤:
img

输入数据(input.txt):
namescon1con2con3con4con5A1A2A3A4A5B1B2B3
A1BG8.663510088.5347457298.3853927868.5457752668.5787517298.4637645318.6451298168.5347457298.514777888.5696306268.351187188.2374188278.461811104
A1BG-AS13.021766483.0931293963.2142780183.2542576693.2790177053.1960730183.1213060823.2417846033.3966233023.3420039873.2046010353.0521474523.322340138
A1CF3.2855307053.3271313473.2250800983.3985264513.2477386993.37147483.4063375773.2007230943.2709852033.4929214853.2297359743.3976508493.285530705
A2M4.0694930924.1861805794.1981329914.1981329914.3128176694.2531473494.1981329914.391590794.1250007834.0139821914.414675166

特征向量(PC.xls):
GenePC1PC2PC3PC4PC5PC6PC7PC8PC9PC10PC11PC12
A1BG0.0057905890.001929067-0.0142030860.0029226520.0071409770.0051596080.00493789-0.0001961720.0013790720.002963374-0.0042871070.004048848
A1BG-AS10.005327245-0.0022684390.00300568-0.000665231-0.0061911530.0080580350.0024619-0.0127545260.0092905630.000651841-0.009359061-0.010451533
A1CF0.002194472-0.004031856-0.0018767280.001124061-0.002400310.009944522-0.0129070170.004923729-0.007133585-0.00824172-0.0066759640.012957584
A2M-0.000952940.000179863-0.0018439340.0093272080.0011801-0.0100296630.00143721-0.0144756230.0068892110.0082734880.004086961-0.010433829
A2M-AS1-0.002843101-0.001382516-0.013245983-0.0030072940.011691307-0.002380244-0.002668031-0.0071743970.0034557480.0062518650.004605682-0.003355077
A2ML1-0.003787138-0.0054960560.008581222-0.0046590770.008131280.0051859150.003508166-0.0121149980.003060394-0.003639957-0.0054436270.008152114
A2MP1-0.010805096-0.001023290.0016867930.004286958-0.0055375130.0052907470.0068819870.003929337-0.00058904-0.0073678090.001756044-0.006071528
A3GALT2-0.011361021-0.003059231-0.002510242-0.0012662380.0010298220.003190896-0.010756433-0.0013442920.0008799840.001644095-0.005150191-0.000833097
A4GALT0.010912023-0.0076157870.0021389920.0041940730.000543388-0.0003121530.000798340.004199552-0.00265660.0020346350.001100124-0.004481925
A4GNT-0.005256946-0.000190154-0.0064941130.001426331-0.005731-0.0101918310.0111804150.0080
PC比重(importance.xls)
PCPC2PC3PC4PC5PC6PC7PC8PC9PC10PC11PC12
Standard deviation73.4898992164.0007655552.6234178745.7491634141.2074672340.4040016839.6148878439.0577732337.9035323437.1289394936.5077152634.445461
Proportion of Variance0.206780.15680.10600.08010.06500.06250.06000.05840.05500.05270.05100.
Cumulative Proportion0.206780.36360.46960.54970.61470.67720.73730.79570.85070.9035
柱状图:
img

碎石图:
img

新表格(newTab.xls):
samplePC1PC2PC3PC4PC5PC6PC7PC8PC9PC10PC11PC12
con1-43.1171356821.8495655-81.93759338-58.2184457764.77318904-26.3242933445.2098781240.85062399-34.1874754-7.060072463-23.399342774.895849213
con2-14.4543284575.73996466-14.7820338110.2243167-24.3800141929.7638999-26.3728367819.11590358-41.1823265488.7672835530.9447027719.44439628
con319.82766522140.0952406113.68712833.76050338934.77706609-13.2823471620.67125684-7.5132235128.15760442-21.45316491-5.3030572123.013275942
con4-51.5769294557.62138492-56.0520888259.40315525-54.6971049831.81892797-2.803617473-20.25433079-32.8273714-58.47135395-32.40090472-18.20853311
con58.74689968859.53180268-69.32816365-46.33245491-16.98742798-11.27889777-29.26179173-21.3352148491.05594049-4.98056128326.54200516-5.6301674
A196.42224421-25.78562194.36883610411.1108611-25.75103501-92.68449676-54.643020312.5323434-20.253817355.359336443-39.35441750.834339574
A248.22451092-45.96824832-23.4456081173.0091976627.128525682.90749987820.92173109-2.55853656717.27908731-18.7198293828.4133530581.48101907
A340.31835413-36.87740449-5.96470056264.7407038434.07994593-0.53116076317.04295829.8539428711.8758622810.3730167440.30767264-81.36567388
A499.48737321-39.5386084618.83787978-47.7641994-70.4295665418.4951011683.05653016-5.541506703-1.33493751810.00733607-4.723884125-2.92912412
A588.10326892-44.6113754812.2563032-42.8123061646.5403062779.90346307-57.84045222-16.71111714-15.79562485-15.74245923-21.62435982-4.642372503
B1-83.54127057-54.3949498117.98792221-22.2459331710.95116793-34.805261280.876188501-92.78568383-36.312442752.56499950641.24368002-7.96835478
B2-103.9946495-54.9622253756.90138017-31.3003327-34.820225354.175221039-27.1421621272.829118363.440963576-39.0498976633.1021940510.45249908

pca2d图:
img

pca3d图:
img


















































最后编辑于 2022-10-09 · 浏览 4.3 万

24 163 19

全部讨论0

默认最新
avatar
24
分享帖子
share-weibo分享到微博
share-weibo分享到微信
认证
返回顶部