[资料]

matlab提取DNA序列小程序

2012-2-23 14:56:01 7616 matlab

0 从网上下载下来的网页内容如下： LOCUS NC_001133 230218 bp DNA linear PLN 14-JUL-2011 DEFINItiON Saccharomyces cerevisiae S288c chromosome I, complete sequence. ACCESSION NC_001133 VERSION NC_001133.9 GI:330443391 DBLINK Project: 128 KEYWORDS . SOURCE Saccharomyces cerevisiae S288c ORGANISM Saccharomyces cerevisiae S288c Eukaryota; Fungi; Dikarya; Ascomycota; Saccharomycotina; Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces. REFERENCE 1 (bases 1 to 230218) AUTHORS Goffeau,A., Barrell,B.G., Bussey,H., Davis,R.W., Dujon,B., Feldmann,H., Galibert,F., Hoheisel,J.D., Jacq,C., Johnston,M., Louis,E.J., Mewes,H.W., Murakami,Y., Philippsen,P., Tettelin,H. and Oliver,S.G. TITLE Life with 6000 genes JOURNAL Science 274 (5287), 546 (1996) PUBMED 8849441 REFERENCE 2 (bases 1 to 230218) AUTHORS Bussey,H., Kaback,D.B., Zhong,W., Vo,D.T., Clark,M.W., Fortin,N., Hall,J., Ouellette,B.F., Keng,T., Barton,A.B. et al. TITLE The nucleotide sequence of chromosome I from Saccharomyces cerevisiae JOURNAL Proc. Natl. Acad. Sci. U.S.A. 92 (9), 3809-3813 (1995) PUBMED 7731988 REFERENCE 3 (bases 1 to 230218) CONSRTM NCBI Genome Project TITLE Direct Submission JOURNAL Submitted (25-APR-2011) National Center for Biotechnology Information, NIH, Bethesda, MD 20894, USA REFERENCE 4 (bases 1 to 230218) CONSRTM Saccharomyces Genome Database TITLE Direct Submission JOURNAL Submitted (31-MAR-2011) Department of Genetics, Stanford University, Stanford, CA 94305-5120, USA REMARK Sequence update by submitter REFERENCE 5 (bases 1 to 230218) CONSRTM Saccharomyces Genome Database TITLE Direct Submission JOURNAL Submitted (11-DEC-2009) Department of Genetics, Stanford University, Stanford, CA 94305-5120, USA COMMENT PROVISIONAL REFSEQ: This record has not yet been subject to final NCBI review. The reference sequence is identical to BK006935. On Apr 26, 2011 this sequence version replaced gi:269970293. COMPLETENESS: full length. PRIMARY REFSEQ_SPAN PRIMARY_IDENTIFIER PRIMARY_SPAN COMP 1-2669 U73805.1 1-2669 2592-106271 U12980.3 1-103681 106140-135553 L05146.3 1-29415 133553-175545 L22015.2 1-41988 175399-230218 L28920.2 1-54812 FEATURES Location/Qualifiers source 1..230218 /organism="Saccharomyces cerevisiae S288c" /mol_type="genomic DNA" /strain="S288c" /db_xref="taxon:559292" /chromosome="I" repeat_region complement(1..801) /note="TEL01L; Telomeric region on the left arm of Chromosome I; composed of an X element core sequence, X element combinatorial repeats, and a short terminal stretch of telomeric repeats; Telomeric region on the left arm of Chromosome I; composed of an X element core sequence, X element combinatorial repeats, and a short terminal stretch of telomeric repeats" /rpt_family="Telomeric Region" /db_xref="SGD:S000028862" repeat_region complement(1..62) /note="TEL01L-TR; Terminal stretch of telomeric repeats on the left arm of Chromosome I; Terminal stretch of telomeric repeats on the left arm of Chromosome I" /rpt_family="Telomeric Repeat" /db_xref="SGD:S000028864" repeat_region complement(63..336) /note="TEL01L-XR; Telomeric X element combinatorial Repeat region on the left arm of Chromosome I; contains repeats of the D, C, B and A types, as well as Tbf1p binding sites; formerly called SubTelomeric Repeats; Telomeric X element combinatorial Repeat region on the left arm of Chromosome I; contains repeats of the D, C, B and A types, as well as Tbf1p binding sites; formerly called SubTelomeric Repeats" /rpt_family="X element Combinatorial Repeats" /db_xref="SGD:S000028866" repeat_region complement(337..801) /note="TEL01L-XC; Telomeric X element Core sequence on the left arm of Chromosome I; contains an ARS consensus sequence, an Abf1p binding site consensus sequence and two small overlapping ORFs (YAL068W-A and YAL069W); Telomeric X element Core sequence on the left arm of Chromosome I; ………… 这是开头的文件内容，在最后面是DNA序列，在这些E文中有个CDS字段表示的是DNA中外显子的位置，要把文件中所有标示的CDS段中的序列提取出来，当然还要把所有的DNA序列全部提取到一个单独的文件，CDS段中的数字标示也要全部提取到一个文件中。任务已明确，手工操作确实难度不大，但却很繁琐，我想先实现提取CDS段再说后面的。提取CDS段要看字符串是不是“CDS”，若是则提取后面的一串“数字”，后面的数字还是字符串形式，而且有的数字前面还有一串我不认识的英语，先做简单的。首先读取文件了。 fid=fopen(ORIGINAL_FILE_NAME,'r'); file_content=fscanf(fid,'%s'); fclose(fid); file_length=length(file_content); 将文件读取到file_content中，这里用的fscanf函数，只能读取字符，几乎所有不可见的字符都会忽略，回车换行也读不出来的。接下来要新建个文件来保存CDS，如下 fid_out=fopen(ALL_CDS,'w'); 然后就要开始遍历file_content的内容了，一个字符一个字符的遍历，好麻烦啊，不过在程序里就方便了， cds_count=0; xulie=[]; for i=1:file_length switch file_content(i) case {'O'} if(file_content(i:i+5)=='ORIGIN') xulie=file_content(i+6:file_length-1); end case {'C'} if(file_content(i:i+2)=='CDS') cds_count=cds_count+1; 变量xulie保存全部DNA序列，cds_count记录找到的CDS的个数，找到了CDS就要提取后面对应的数字了。 tmp1=i+3; tmp2=double(file_content(tmp1)); while(tmp2 ~= 47 && tmp2 ~= 41 ) if(isnumber(file_content(tmp1))==1) fprintf(fid_out,'%c',file_content(tmp1)); end if((file_content(tmp1))==',') cds_count=cds_count+1; fprintf(fid_out,'+'); end tmp1=tmp1+1; tmp2=double(file_content(tmp1)); end fprintf(fid_out,'+'); 其中fid_out就是所有cds对应的数保存的文件夹，提取完要关闭文件。 fclose(fid_out); fprintf('共有%d个外显子。n',cds_count); fprintf('任务1已完成。n'); 接下来要处理提取的所有DNA序列，因为源文件为了便于查找对dna进行了序列标注，这里要将里面的数字删除，并将小写的actg转换成大写，这个功能在上学期写过一个，当时做的就像上面那个，用个循环，一个一个的处理，这次我查了点资料，几句话就搞定了。 string_out=char(xulie); for tmp1=0:9 string_out=strrep(string_out,num2str(tmp1),''); end string_out=upper(string_out); string_out=char(string_out); fid_out=fopen(ALL_XULIE,'w'); string_out=char(string_out); fprintf(fid_out,'%s',string_out); fclose(fid_out); fprintf('任务2已完成。n'); 这里就是将0~9这10个数字全部替换成空，再用upper函数转换大写，几乎瞬间完成。接下来就是重头戏了，要将cds对应的序列单独提取并保存到文件，这里要用到前面生成的两个文件。 fid=fopen(ALL_CDS,'r'); file_content=fscanf(fid,'%s'); fclose(fid); file_length=length(file_content); fid=fopen(ALL_XULIE,'r'); xulie_content=fscanf(fid,'%s'); fclose(fid); xulie_length=length(file_content); 然后我又笨笨的一个大的for循环，一个一个的提取创建吧。 for tmp1=1 : file_length if(file_content(tmp1)~='.' && file_content(tmp1)~='+') tmp2=[tmp2(:)',file_content(tmp1)']; else if(file_content(tmp1)=='.' && file_content(tmp1-1)=='.') bg_num=str2num(tmp2); tmp2=''; elseif(file_content(tmp1)=='+') ed_num=str2num(tmp2); tmp2=''; cds_count=cds_count+1; tmp_str2=strcat(PRE_FIX_NAME,num2str(cds_count)); tmp_str2=strcat(tmp_str2,'.txt'); tmp_str1=xulie_content(bg_num:ed_num); fid_out=fopen(tmp_str2,'w'); fprintf(fid_out,'%s',tmp_str1); fclose(fid_out); end end end 这里面挺麻烦的，因为开始和结束处之间有两个分割符，所以前面那个就忽略掉了。新建的文件名要按顺序起名，并要有共同的前缀。这功能就在上面的程序中实现。功能实现了。可以放松下了。全部程序： ORIGINAL_FILE_NAME='dat/dat.txt'; ALL_XULIE='dat/new_dat.txt'; ALL_CDS='dat/CDS.txt'; PRE_FIX_NAME='dat/exon'; clc; fprintf('任务已启动。n'); %读取文件内容 fid=fopen(ORIGINAL_FILE_NAME,'r'); file_content=fscanf(fid,'%s'); fclose(fid); file_length=length(file_content); tmp1=0; tmp2=0; fid_out=fopen(ALL_CDS,'w'); cds_count=0; xulie=[]; for i=1:file_length switch file_content(i) case {'O'} if(file_content(i:i+5)=='ORIGIN') xulie=file_content(i+6:file_length-1); end case {'C'} if(file_content(i:i+2)=='CDS') cds_count=cds_count+1; tmp1=i+3; tmp2=double(file_content(tmp1)); while(tmp2 ~= 47 && tmp2 ~= 41 ) if(isnumber(file_content(tmp1))==1) fprintf(fid_out,'%c',file_content(tmp1)); end if((file_content(tmp1))==',') cds_count=cds_count+1; fprintf(fid_out,'+'); end tmp1=tmp1+1; tmp2=double(file_content(tmp1)); end fprintf(fid_out,'+'); end end % if(bitand(i,65535)==65535) % fprintf('任务1完成%f%%n',i/file_length*100); % end end fclose(fid_out); fprintf('共有%d个外显子。n',cds_count); fprintf('任务1已完成。n'); string_out=char(xulie); for tmp1=0:9 string_out=strrep(string_out,num2str(tmp1),''); end string_out=upper(string_out); string_out=char(string_out); fid_out=fopen(ALL_XULIE,'w'); string_out=char(string_out); fprintf(fid_out,'%s',string_out); fclose(fid_out); fprintf('任务2已完成。n'); fid=fopen(ALL_CDS,'r'); file_content=fscanf(fid,'%s'); fclose(fid); file_length=length(file_content); fid=fopen(ALL_XULIE,'r'); xulie_content=fscanf(fid,'%s'); fclose(fid); xulie_length=length(file_content); tmp2=''; bg_num=0; ed_num=0; cds_count=0; tmp_str1=''; tmp_str2=''; for tmp1=1 : file_length if(file_content(tmp1)~='.' && file_content(tmp1)~='+') tmp2=[tmp2(:)',file_content(tmp1)']; else if(file_content(tmp1)=='.' && file_content(tmp1-1)=='.') bg_num=str2num(tmp2); tmp2=''; elseif(file_content(tmp1)=='+') ed_num=str2num(tmp2); tmp2=''; cds_count=cds_count+1; tmp_str2=strcat(PRE_FIX_NAME,num2str(cds_count)); tmp_str2=strcat(tmp_str2,'.txt'); tmp_str1=xulie_content(bg_num:ed_num); fid_out=fopen(tmp_str2,'w'); fprintf(fid_out,'%s',tmp_str1); fclose(fid_out); end end end fprintf('任务3完成。n'); fprintf('共有%d个外显子。n',cds_count); fprintf('任务全部完成。nn'); 还有个小程序 function jg=isnumber(canshu) if((double(canshu)<=58 && double(canshu)>47) ... \|\| double(canshu)==46) jg=1; else jg=0; end 0
举报淘帖0 只看该作者相关推荐 • 【「时间序列与机器学习」阅读体验】时间序列的信息提取 431 • 【《时间序列与机器学习》阅读体验】+ 时间序列的信息提取 602 • 怎样去设计一种基于随机产生实时过滤算法的DNA编码序列 1572 • 利用MATLAB提取交点计算SPWM周期 590 • DNA编码的学习 1235 • 基于MATLAB人脸提取方法（人脸轮廓提取） 11071 • 如何提取脉冲序列的各个脉冲 4039 • 谁会基于MATLAB的GOLD序列的仿真这个，有GOLD序列的程序吗 3032 • 谁会用matlab画单位采样序列、实指数序列、矩形序列 16952 • MATLAB下的数字信号处理：理想采样信号序列/单位脉冲序列矩形序列/特定冲击串/卷积 23836 2条评论发表评论只看该作者