生物信息PCA主成分分析(原创)
主成分分析试图在力保数据信息丢失最少的原则下,用较少的综合变量代替原本较多的变量,而且综合变量间互不相关。PCA 的目标是寻找 r ( r<n )个新变量,使它们反映事物的主要特征,压缩原有数据矩阵的规模。每个新变量是原有变量的线性组合,体现原有变量的综合效果,具有一定的实际含义。这 r 个新变量称为“主成分”,它们可以在很大程度上反映原来 n 个变量的影响,并且这些新变量是互不相关的,也是正交的。通过采用这样的主成分,便可以只选用若干变量而不是上千的变量来对一种样品进行分析了。这样,就可以将样品有关变量绘制成图,使得样品间的相似性和相异之处一目了然,对不同样品是否可以归为一组,也一清二楚。
使用在分析复杂的多维数据集的时候。例如不同实验条件下的转录组测序数据,表达谱芯片数据,以及蛋白组和代谢组数据。当变量的数目比样品的数目多时,PCA可以在不损失信息量的情况下将样品的维度最大程度地减少至样品数。它可以被看做复杂实验数据预处理的一个步骤。
下面我们介绍一下我们的分析过程,在生物信息学里,我们常用免费的R软件进行分析。如果大家想了解更加,可以与我们联系,方式见丁香园名字,或者掏宝输入“生物信息学视频”。
首先,我们使用r自带的函数对输入数据进行PCA分析。命令如下:
#pca analysis
setwd("D:/workdir/pca") #设置工作目录
data=read.table("input.txt",header=T,sep="\t",row.names=1) #读取表格
data=t(as.matrix(data)) #矩阵转置
data.class <- rownames(data)
data.pca <- prcomp(data, scale. = TRUE) #PCA分析
接下来输出一些主要的记过表格以及柱状图和碎石图:
write.table(data.pca$rotation,file="PC.xls",quote=F,sep="\t") #输出特征向量
write.table(predict(data.pca),file="newTab.xls",quote=F,sep="\t") #输出新表
pca.sum=summary(data.pca)
write.table(pca.sum$importance,file="importance.xls",quote=F,sep="\t") #输出PC比重
pdf(file="pcaBarplot.pdf",width=15) #柱状图
barplot(pca.sum$importance[2,1:10]*100,xlab="PC",ylab="percent",col="skyblue")
dev.off()
pdf(file="pcaPlot.pdf",width=15) #碎石图
plot(pca.sum$importance[2,1:10]*100,type="o",col="red",xlab="PC",ylab="percent")
dev.off()
最后,我们会绘制pca 2d和3图形,命令如下:
pdf(file="PCA2d.pdf")
ggplot(data = PCA, aes(PCA1, PCA2)) + geom_point(aes(color = group)) +
geom_path(data=df_ell, aes(x=PCA1, y=PCA2,colour=group), size=1, linetype=2)+
annotate("text",x=PCA.mean$PCA1,y=PCA.mean$PCA2,label=PCA.mean$group)+
theme_bw()+
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
dev.off()
#pca 3d plot
library(pca3d)
library(rgl)
pca3d(data.pca, components = 1:3, group = c(rep("con",5),rep("A",5),rep("B",3)),show.centroids=TRUE,new=TRUE) #画3d图
PCA分析步骤:

输入数据(input.txt):
names | con1 | con2 | con3 | con4 | con5 | A1 | A2 | A3 | A4 | A5 | B1 | B2 | B3 |
A1BG | 8.66351008 | 8.534745729 | 8.385392786 | 8.545775266 | 8.578751729 | 8.463764531 | 8.645129816 | 8.534745729 | 8.51477788 | 8.569630626 | 8.35118718 | 8.237418827 | 8.461811104 |
A1BG-AS1 | 3.02176648 | 3.093129396 | 3.214278018 | 3.254257669 | 3.279017705 | 3.196073018 | 3.121306082 | 3.241784603 | 3.396623302 | 3.342003987 | 3.204601035 | 3.052147452 | 3.322340138 |
A1CF | 3.285530705 | 3.327131347 | 3.225080098 | 3.398526451 | 3.247738699 | 3.3714748 | 3.406337577 | 3.200723094 | 3.270985203 | 3.492921485 | 3.229735974 | 3.397650849 | 3.285530705 |
A2M | 4.069493092 | 4.186180579 | 4.198132991 | 4.198132991 | 4.312817669 | 4.253147349 | 4.198132991 | 4.39159079 | 4.125000783 | 4.013982191 | 4.414675166 |
特征向量(PC.xls):
Gene | PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | PC9 | PC10 | PC11 | PC12 |
A1BG | 0.005790589 | 0.001929067 | -0.014203086 | 0.002922652 | 0.007140977 | 0.005159608 | 0.00493789 | -0.000196172 | 0.001379072 | 0.002963374 | -0.004287107 | 0.004048848 |
A1BG-AS1 | 0.005327245 | -0.002268439 | 0.00300568 | -0.000665231 | -0.006191153 | 0.008058035 | 0.0024619 | -0.012754526 | 0.009290563 | 0.000651841 | -0.009359061 | -0.010451533 |
A1CF | 0.002194472 | -0.004031856 | -0.001876728 | 0.001124061 | -0.00240031 | 0.009944522 | -0.012907017 | 0.004923729 | -0.007133585 | -0.00824172 | -0.006675964 | 0.012957584 |
A2M | -0.00095294 | 0.000179863 | -0.001843934 | 0.009327208 | 0.0011801 | -0.010029663 | 0.00143721 | -0.014475623 | 0.006889211 | 0.008273488 | 0.004086961 | -0.010433829 |
A2M-AS1 | -0.002843101 | -0.001382516 | -0.013245983 | -0.003007294 | 0.011691307 | -0.002380244 | -0.002668031 | -0.007174397 | 0.003455748 | 0.006251865 | 0.004605682 | -0.003355077 |
A2ML1 | -0.003787138 | -0.005496056 | 0.008581222 | -0.004659077 | 0.00813128 | 0.005185915 | 0.003508166 | -0.012114998 | 0.003060394 | -0.003639957 | -0.005443627 | 0.008152114 |
A2MP1 | -0.010805096 | -0.00102329 | 0.001686793 | 0.004286958 | -0.005537513 | 0.005290747 | 0.006881987 | 0.003929337 | -0.00058904 | -0.007367809 | 0.001756044 | -0.006071528 |
A3GALT2 | -0.011361021 | -0.003059231 | -0.002510242 | -0.001266238 | 0.001029822 | 0.003190896 | -0.010756433 | -0.001344292 | 0.000879984 | 0.001644095 | -0.005150191 | -0.000833097 |
A4GALT | 0.010912023 | -0.007615787 | 0.002138992 | 0.004194073 | 0.000543388 | -0.000312153 | 0.00079834 | 0.004199552 | -0.0026566 | 0.002034635 | 0.001100124 | -0.004481925 |
A4GNT | -0.005256946 | -0.000190154 | -0.006494113 | 0.001426331 | -0.005731 | -0.010191831 | 0.011180415 | 0.0080 |
PC | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | PC9 | PC10 | PC11 | PC12 | |
Standard deviation | 73.48989921 | 64.00076555 | 52.62341787 | 45.74916341 | 41.20746723 | 40.40400168 | 39.61488784 | 39.05777323 | 37.90353234 | 37.12893949 | 36.50771526 | 34.445461 |
Proportion of Variance | 0.20678 | 0.1568 | 0.1060 | 0.0801 | 0.0650 | 0.0625 | 0.0600 | 0.0584 | 0.0550 | 0.0527 | 0.0510 | 0. |
Cumulative Proportion | 0.20678 | 0.3636 | 0.4696 | 0.5497 | 0.6147 | 0.6772 | 0.7373 | 0.7957 | 0.8507 | 0.9035 |

碎石图:

新表格(newTab.xls):
sample | PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | PC9 | PC10 | PC11 | PC12 |
con1 | -43.11713568 | 21.8495655 | -81.93759338 | -58.21844577 | 64.77318904 | -26.32429334 | 45.20987812 | 40.85062399 | -34.1874754 | -7.060072463 | -23.39934277 | 4.895849213 |
con2 | -14.45432845 | 75.73996466 | -14.78203381 | 10.2243167 | -24.38001419 | 29.7638999 | -26.37283678 | 19.11590358 | -41.18232654 | 88.76728355 | 30.94470277 | 19.44439628 |
con3 | 19.82766522 | 140.0952406 | 113.6871283 | 3.760503389 | 34.77706609 | -13.28234716 | 20.67125684 | -7.513223512 | 8.15760442 | -21.45316491 | -5.303057212 | 3.013275942 |
con4 | -51.57692945 | 57.62138492 | -56.05208882 | 59.40315525 | -54.69710498 | 31.81892797 | -2.803617473 | -20.25433079 | -32.8273714 | -58.47135395 | -32.40090472 | -18.20853311 |
con5 | 8.746899688 | 59.53180268 | -69.32816365 | -46.33245491 | -16.98742798 | -11.27889777 | -29.26179173 | -21.33521484 | 91.05594049 | -4.980561283 | 26.54200516 | -5.6301674 |
A1 | 96.42224421 | -25.7856219 | 4.368836104 | 11.1108611 | -25.75103501 | -92.68449676 | -54.6430203 | 12.5323434 | -20.25381735 | 5.359336443 | -39.3544175 | 0.834339574 |
A2 | 48.22451092 | -45.96824832 | -23.44560811 | 73.00919766 | 27.12852568 | 2.907499878 | 20.92173109 | -2.558536567 | 17.27908731 | -18.71982938 | 28.41335305 | 81.48101907 |
A3 | 40.31835413 | -36.87740449 | -5.964700562 | 64.74070384 | 34.07994593 | -0.531160763 | 17.042958 | 29.85394287 | 11.87586228 | 10.37301674 | 40.30767264 | -81.36567388 |
A4 | 99.48737321 | -39.53860846 | 18.83787978 | -47.7641994 | -70.42956654 | 18.49510116 | 83.05653016 | -5.541506703 | -1.334937518 | 10.00733607 | -4.723884125 | -2.92912412 |
A5 | 88.10326892 | -44.61137548 | 12.2563032 | -42.81230616 | 46.54030627 | 79.90346307 | -57.84045222 | -16.71111714 | -15.79562485 | -15.74245923 | -21.62435982 | -4.642372503 |
B1 | -83.54127057 | -54.39494981 | 17.98792221 | -22.24593317 | 10.95116793 | -34.80526128 | 0.876188501 | -92.78568383 | -36.31244275 | 2.564999506 | 41.24368002 | -7.96835478 |
B2 | -103.9946495 | -54.96222537 | 56.90138017 | -31.3003327 | -34.82022535 | 4.175221039 | -27.14216212 | 72.82911836 | 3.440963576 | -39.04989766 | 33.10219405 | 10.45249908 |
pca2d图:

pca3d图:

最后编辑于 2022-10-09 · 浏览 4.3 万