%SMF Symmetric matrix factorization % smf7_5 residues with few cliques are not clustered correctly and 9rubB's % N' two residues were joined to the C'terminal. We increase the % minNclq to 0.1% of the number of cliques. % smf7_3 is the same as smf7_2 except % the linker domain detection feature has been removed. % smf7_2 is smf7_1 with the following non-consequential changes: % 1. linker domain is identified a la smf5_9, i.e. % Diagonals weaker than cold*(max off-diagonal) are considered as % linkers. The value of the parameter cold = 1.2. % 2. produces clique map and computes mou as in smf6_5. % mou = mean over- and under-filling of domains by the cliques % 3. Q scores are redefined: % 1. Largest off-diagonal density relative to the smaller of the two diagonals', ... % 2. Minimum scaled diagonal % 3. 0.5*erfc(Q1) % 4. 1-0.5*erfc(Q2) % 5. Increase in Q1 upon increasing nd by one from the current value',... % 6. mou % 7. Q3*Q4 % 8. Q3*Q4*Q5 % 4. 1-domain case is included in the loop and not treated specially. % smf7_1 is smf447 with bug fix and cosmetic changes. % 1. a bug in Nc matrix output procedure has been fixed - % smf447 misses plotting 2D solution. % 2. pname (program name) variable has been introduced. % 3. coq1 value has been changed from 0.25 to 0.30. % This is a cosmetic change becasue its usage has also been changed % so that net effect is no change. % 4. The unused variable fmin has been moved and commented out. % 5. The low Nres clique and low Ncli res. handling code has been % modified cosmetically. % 6. Renamed H11 as H1 and old H1 as H2. Th eprocedure for obtaining H % from H1 (formerly H11) has been changed to avoid using a for loop. % 7. The deinition ZERO has been delelted. % 8. Although smf7_1 starts with Nd=12, output is restricted to ndmax=9. % smf447 is the same as smf446, % except that the penalty for 1D idagonal minimum was reduced. % smf446 is the same as smf444, % except that the ndmax parameter was increased to 12 from 9. % smf444 is the same as smf443, % except that the erfc is applied also to the minimum diagonal because, % otherwise, 1qbkB becomes 6D. % smf443 is the same as smf442, % except that the cutoff is made smooth around 0.25. % This was done because the cutoff value has to be between 0.2517 (for % 1atnA) and 0.2560 (for 1jjcB). % smf442 is the same as smf441, % exceept that the cutoff value was set to 0.30 in order to save 1atnA. % smf441 is derived from smf434, % This version does a hierachical clustering to find the initial seed for % the kmeans clustering. % smf434 is the same as smf433, % except that 'correlation' is used for distance. % smf433 is the same as smf432, % (except that single is used for linkage. : gave bad result for 1atnA) % except that 'weighted' is used for linkage. % smf432 is the same as smf431, % except that correlation is used for the distance function. % smf431 is the same as smf428, % except that hierarchical clustering is used and % the cutoff value was changed to 0.251 to resdue 1atnA. % smf428 is the same as smf427, % except that the best nd selection procedure was modified. % This version selects best nd as that % with maximum Q2*Q7 among those for which Q1<=cutoff, which is set to 0.25. % smf427 is the same as smf42, % except that the low-clique number screen feature (fmin) is suppressed % and the cluster with no clique is properly handles. % smf42 is the same as smf426, % with output resotered. % smf426 is the same as smf425, % except that 1-D non-selection criterion is Q(1,2)coq7. cold=1.2; % diagonals weaker than cold*(max. off-diagonal) are linkers. minClqCut=0.001; % number of minimum cliques to keep the resiude, 0.1% of the total number of cliques. %minNcli=0; % only the residues with more than this number of cliques are used minNres=30; % only the cliques that span more than this number of residues are used filename=qryName AAA=matali; [nn,mm]=size(AAA); % remove short cliques Nres=sum(AAA,2); AA=AAA(Nres>minNres,:); [nnn,mmm]=size(AA); % Sort the cliques in increasing order of mean residue number meanr=(AA*[1:mm]')./sum(AA,2); [meanrs,oofn]=sort(meanr); Af=AA(oofn,:); % remove residues few cliques span Ncli=sum(Af); minNcli=int8(minClqCut * nnn); A=Af(:,Ncli>minNcli); nonzeroR=find(Ncli>minNcli); [n,m]=size(A); nm=[nn mm; n m] % Write warning message if any resiudes and/or cliques are left out if (mm>m) [s,errmsg]=sprintf(... '%d residues were left out because less than %d cliques span them.',... mm-m,minNcli); disp(s) end if (nn>n) [s,errmsg]=sprintf(... '%d cliques were left out because they span less than %d residues.',... nn-n, minNres); disp(s) end % Hierarchically cluster the residues in the N matrix into ndmax clusters N=A'*A; H1=clusterdata(N,'distance','cosine','linkage','ward','maxclust',ndmax); % Calculate H matrix and obtain the median residue for each cluster H=zeros(m,ndmax); H(sub2ind([m ndmax],[1:m]',H1(:)))=1; vm=zeros(ndmax,1); cres=ones(ndmax,1); for j=1:ndmax vm=find(H(:,j)); cres(j)=vm(round((length(vm)+1)/2)); end % Now, cluster again using kmeans and seeds formed from the center residues % found above seeds=N(cres,:); H1=kmeans(N,ndmax,'start',seeds,'distance','cosine','emptyaction','drop'); % Obtain H matrix for ndmax and reset ndmax after eliminating empty columns H=zeros(m,ndmax); H(sub2ind([m ndmax],[1:m]',H1(:)))=1; vnd=sum(H); H(:,vnd==0)=[]; [m,ndmax]=size(H); % sort domains rofc=((1:m)*H)./sum(H); [vnd,cofd]=sort(rofc); H=H(:,cofd); % allocate memory space for some matrices H2=zeros(m,ndmax); Hf=zeros(mm,ndmax); Nf=zeros(ndmax,ndmax); D=zeros(ndmax,ndmax); Dh=zeros(ndmax,ndmax); Dtmp=zeros(ndmax,ndmax); Doff=zeros(ndmax,ndmax); DD=cell(ndmax,4); DB=cell(ndmax,1); DD(ndmax,:)={diag(Dh)',diag(Dh)',diag(Dh)',D}; S=zeros(n,ndmax); R=zeros(n,ndmax); T=zeros(n,ndmax); M=zeros(n,ndmax); Q=zeros(ndmax,8); Qsheets=char('1. Largest off-diagonal density relative to the smaller of the two diagonals', ... '2. Minimum scaled diagonal.',... '3. 0.5*erfc(Q1)',... '4. 1-0.5*erfc(Q3)',... '5. Increase in Q1 upon increasing nd by one from the current value',... '6. mou',... '7. Q3*Q4',... '8. Q3*Q4*Q5'); Qdg=zeros(ndmax); newds=[1:ndmax]; i=1; % For each number of domains, from large to small for nd=ndmax:-1:1 % fill-in H2 column H2(:,nd)=H*[1:nd]'; % compute D matrix Dh=H'*H; Dtmp=inv(Dh); D=Dtmp*H'*N*H*Dtmp; % identify linker domains Doff=D; Doff(eye(nd)==1)=0; % maxoffd=max(Doff,[],2); linkerd=zeros(nd,1); % linkerd(diag(D)j) i=i-1; end end % 1-domain case % m1=H'*H; % oad=H'*N*H/(m1^2); % H2(:,1)=H; % Q(1,1)=0; % Q(1,4:5)=oad; % Qdg(1,newds)=oad; % DD(1,:)={[m1],[oad]}; % modify scores and calculate other derived scores Q(:,2)=Q(:,2)/Q(1,2); Q(:,3)=0.5*erfc((Q(:,1)-coq1)/tolq1); Q(:,4)=1-0.5*erfc((Q(:,2)-coq2)/tolq2); Q(:,5)=circshift(Q(:,1),-1)-Q(:,1); Q(:,7)=Q(:,3).*Q(:,4); Q(:,8)=Q(:,3).*Q(:,4).*Q(:,5); Qdg=Qdg/Q(1,2); % define green zone and find the best number of domains hiq7=find(Q(:,7)>coq7); [maxQ,ndbest]=max(Q(:,8)); % Output Q vectors Qsheets, Q % disp('Diagonals'), Qdg % output domain boundaries sprintf('Domains with negative Nseg are linkers.'); db=DB{ndbest}; nseg=db{1}; mxnseg=length(find(sum(db{2},2))); b=db{2}(1:mxnseg,:); e=db{3}(1:mxnseg,:); Nsegments=nseg BegResNumbers=b EndResNumbers=e nl=length(find(Nsegments<0)); nnd=ndbest-nl; sprintf('nd, N of linker domains, N of domains = %d %d %d',ndbest,nl,nnd) smf2NDO; % Output various matrices % Define common colors black= [0 0 0]; blue= [0 0.5 1]; cyan= [0 1 1]; lime= [0 1 0.5]; green= [0 1 0]; ygreen= [0.5 1 0]; dyellow=[0.5 0.5 0]; yellow= [1 1 0]; brown= [1 0.5 0]; red= [1 0 0]; magenta=[1 0 1]; purple= [0.5 0 1]; white= [1 1 1]; gray= [0.5 0.5 0.5]; lgray= [0.75 0.75 0.75]; % lgray= [0.85 0.85 0.85]; % Define colormap color18=[white; lgray; black; red; green; blue; magenta; cyan; purple; brown;... dyellow; black; red; green; blue; magenta; cyan; purple]; pinkscale=[1-pink]; cmap=[color18;pinkscale]; ncol1=size(color18,1); ncol2=size(pinkscale,1); ncolall=size(cmap,1); % output clique maps figure(3) clf colormap(cmap) for nd=1:min(ndmax,9) H=zeros(m,nd); H(sub2ind([m nd],[1:m]',H2(:,nd)))=1; Hf=zeros(mm,nd); Hf(nonzeroR,:)=H(:,:); S=A*H; [maxS,ccn]=max(S,[],2); % ccn=clique cluster number M=zeros(n,nd); M(sub2ind([n nd],[1:n]',ccn))=1; ax(nd)=subplot(3,3,nd); C2=1+M*Hf'+Af.*(ccn*ones(1,mm)); image(1:mm,1:n,C2) s=sprintf('%s-%s-%d-%d %d', filename, pname, minNcli, minNres, nd); title(s) end pngformat='png'; pngName=strcat(filename,'-smf7_5-1.', pngformat); saveas(gcf, pngName, pngformat); % output Nc=Hf*D*Hf matrices figure(2) clf colormap(cmap) for nd=1:min(ndmax,9) H=zeros(m,nd); H(sub2ind([m nd],[1:m]',H2(:,nd)))=1; Hf=zeros(mm,nd); Hf(nonzeroR,:)=H(:,:); D=DD{nd,4}; Nf=Hf*diag(diag(D))*Hf'; cmin = min(Nf(:)); cmax = max(Nf(:)); C1 = min(ncolall,round((ncol2-1)*(Nf-cmin)/(cmax-cmin))+ncol1+1); ax(nd)=subplot(3,3,nd); imagesc(1:mm,1:mm,C1) s=sprintf('%s-%s-%d-%d %d', filename, pname, minNcli, minNres, nd); title(s) end Nf=Af'*Af; cmin = min(Nf(:)); cmax = max(Nf(:)); C1 = min(ncolall,round((ncol2-1)*(Nf-cmin)/(cmax-cmin))+ncol1+1); ax(1)=subplot(3,3,1); image(1:mm,1:mm,C1) s=sprintf('%s-%s-%d-%d N', filename, pname, minNcli, minNres); title(s) pngName=strcat(filename,'-smf7_5-2.', pngformat); saveas(gcf, pngName, pngformat); % output for best nd figure(1) clf colormap(cmap) % output observed N ax(1)=subplot(221); Nf=Af'*Af; cmin = min(Nf(:)); cmax = max(Nf(:)); C1 = min(ncolall,round((ncol2-1)*(Nf-cmin)/(cmax-cmin))+ncol1+1); image(1:mm,1:mm,C1) title([filename,' N=A''*A']) % Re-compute the Hf matrix, ccn vector, and M matrix H=zeros(m,ndbest); H(sub2ind([m ndbest],[1:m]',H2(:,ndbest)))=1; Hf=zeros(mm,ndbest); Hf(nonzeroR,:)=H(:,:); S=A*H; [maxS,ccn]=max(S,[],2); % ccn=clique cluster number M=zeros(n,ndbest); M(sub2ind([n ndbest],[1:n]',ccn))=1; % output calculated N ax(2)=subplot(222); D=DD{ndbest,4}; Nf=Hf*diag(diag(D))*Hf'; cmin = min(Nf(:)); cmax = max(Nf(:)); C1 = min(ncolall,round((ncol2-1)*(Nf-cmin)/(cmax-cmin))+ncol1+1); image(1:mm,1:mm,C1) title([filename,' H*B*H''']) % output clique map ax(3)=subplot(223); C2=1+M*Hf'+Af.*(ccn*ones(1,mm)); image(1:mm,1:n,C2) s=sprintf('%s-%s-%d-%d %d', filename, pname, minNcli, minNres, ndbest); title(s) % plot quality measures ax(4)=subplot(224); plot(Q(:,1),'-ko') hold on plot(hiq7,Q(hiq7,1),'ko','MarkerFaceColor','g') plot(ndbest,Q(ndbest,1),'ko','MarkerFaceColor','k') plot([0,ndmax+1],[coq1,coq1],'--k') Q6=Q(:,6); Q6(Q6>1.0)=1.0; plot(Q6,'-ks') plot(hiq7,Q6(hiq7),'ks','MarkerFaceColor','g') plot(ndbest,Q6(ndbest),'ks','MarkerFaceColor','k') hold off title([filename,' Q1, Q6']) pngName=strcat(filename,'-smf7_5-3.', pngformat); saveas(gcf, pngName, pngformat);