用bioperl从NCBI批量下载序列
拟定解决问题:已知有多个cDNA的accession number,要求获得这些序列的物种名词,简单描述,cDNA序列和/或蛋白序列。
第一步:把accession number拷贝到一个文本文件,通常可以从发表的文章的表格里面直接copy和past。附件2get_by_acc.txt是示例文件,内容如下:
NM_001048135
AY293929
AY623897
AM421526
AJ852018
XM_695386
AY850106
AB125660
AM850085
NM_002759
AF193339
EF467667
DQ645944
NM_011163
DQ115394
AM850088
XM_001166391
AM850089
NM_019335
ABA64562
NM_214319
AM850086
AM850087
AM421523
AM421524
AM421525
AM421528
AM850090
AM850091
AM850092
AB104655
然后下载附件里bioperl脚本文件,其内容如下:
#! /usr/bin/perl;
use Bio::Seq;
use Bio::SeqIO;
use Bio:: DB::GenBank;
use Bio::Species;
use Bio::SeqFeatureI;
use List::MoreUtils qw(uniq);
########## BIGIN: Cofiguration ##########################################################
$ToGet= "2get_by_acc.txt";
########## END: Configuration ###########################################################
########## BIGIN: processing seq by seq #################################################
$i=0;
open(FILE,$ToGet);
open(OUT1,">$ToGet.table");
open(OUT2,">$ToGet.fasta");
@list2get=<FILE>;
@list2get = uniq @list2get;
$total=@list2get;
foreach $query_id (@list2get){
$i++;
chomp($query_id);
print "$i/$total\t$query_id\t";
if(length($query_id) < 4){print "This is not likely an accession number. Retrieve is canceled\n";next;}
print "\t\tis under retrieving...\t";
$db_obj = Bio:: DB::GenBank->new;
$seq_obj = $db_obj->get_Seq_by_version($query_id);
$species=$seq_obj->species;
$bi = $species->binomial();
$common_name=$species->common_name();
@feature_array=$seq_obj->get_SeqFeatures;
$count=$seq_obj->feature_count;
my @protein_str=('There is no translation available');
if ($seq_obj->alphabet ne 'protein' ){
foreach $feat_object (@feature_array) {@protein_str=$feat_object->get_tag_values("translation") if ($feat_object->has_tag("translation"));}}
print OUT1 ">\t",$bi,"\t",$common_name,"\t", $seq_obj->display_id, "\t", $seq_obj->desc, "\t", $seq_obj->seq(),"\t", $protein_str[0], "\n";
print OUT2 ">",$seq_obj->display_id;
print OUT2 "|", $bi,"|",$common_name,"|", $seq_obj->desc;
print OUT2 "\n", $seq_obj->seq(), "\n";
print "done!\n";
}
close(FILE);
close(OUT1);
close(OUT2);
########## END: processing seq by seq ######################################################
学过bioperl的童鞋可以对文件略做修改,如果菜鸟只要把两个文件下载到同一个文件夹,运行get_seqs_by_acc.pl就可以了。如果要下载自己的序列,只要把2get_by_acc.txt文件的accession number替换了就可以。注意:该文件中不能有空格等不必要的符合,每一个accession number占一行,用回车隔开。通常可以从参考文献的表格里面直接copy。
本代码在ubuntu linux中通过测试。
最后编辑于 2010-10-24 · 浏览 8634