admin 管理员组

文章数量: 1184232

PSP

欢迎关注我的CSDN:/
本文地址:

MMseqs2HHblits 的算法比较:

  • 蛋白质序列搜索算法 MMseqs2 与 HHblits 的搜索结果差异
  • HHblits 算法搜索 BFD 与 UniRef30 的结果分析

在 AlphaFold2 中,使用 HHblits 算法搜索 BFD 与 UniRef30,输出 bfd_uniref_hits.a3m 文件。MMseqs2 优化搜索速度之后,比较使用 MMseqs2 算法与 HHblits 算法之间,BFD 与 UniRef30的搜索结果差异。

构建 MMseqs2 搜索库使用 ffindex2fasta 工具 (来源于 MMseqs v1),将 BFDUniRef30 转换成 fasta 文件,即:

$MMDIR/bin/ffindex2fasta resultDB resultDB.fasta

转换之后 BFD 是 1.5T,UniRef30 是 160G。

测试 Case 来源于 CASP15,即:T1104-D1_A117.fastaT1137s8-D1_A251.fastaT1188-D1_A573.fastaT1157s1_A1029.fasta,序列长度100、250、500、1000等。


1. AF2 MSA

bfd_uniref_hits.a3m中序列数量,一般而言,序列越长,可搜索出的序列数量越多,即:

  • T1104-D1_A117: 333 (包括自己)
  • T1137s8-D1_A251:1687
  • T1188-D1_A573:3648
  • T1157s1_A1029:3383

即,来自于 HHblits 搜索的日志 wc -l

666 bfd_uniref_hits.a3m			# T1104-D1_A117
3374 bfd_uniref_hits.a3m		# T1137s8-D1_A251
7296 bfd_uniref_hits.a3m		# T1188-D1_A573
6766 bfd_uniref_hits.a3m		# T1157s1_A1029

其中, T1104-D1_A117bfd_uniref_hits.a3m 数据如下,一部分来自BFD,一部分来自UniRef30,即:

>A
QLEDSEVEAVAKGLEEMYANGVTEDNFKNYVKNNFAQQEISSVEEELNVNISDSCVANKIKDEFFAMISISAIVKAAQKKAWKELAVTVLRFAKANGLKTNAIIVAGQLALWAVQCG
>tr|A0A256LDH3|A0A256LDH3_9LACO Streptococcin A-M57 OS=Lactobacillus taiwanensis GN=CBF70_06415 PE=4 SV=1
VDNDPEVAVVAQELEKIFANGVSQENLNRYVLKNFSNKELTVAEKELDVNYNPFslqskndnnslhsvsvygwnnlgqCMYNKIKDEFFAMVNIGVIVKYAKKKAWKELAKVVIRFAKGAGVRTNAAIVAAQLAVWAVQCG
>tr|F7SG70|F7SG70_LACJH Uncharacterized protein OS=Lactobacillus johnsonii pf01 GN=PF01_01547 PE=4 SV=1
VDNDPEVAIAAQELEKIFVDVVSQKNLNRYALKNFSNKELTVAEKELDVNYNPFslqlkndnnslhsvsvyg---------------------------------------------------------------
>tr|Q6DRR6|Q6DRR6_STRPY Streptococcin A-M57 OS=Streptococcus pyogenes GN=scnM57 PE=4 SV=1
IEEQRQIDEVAAVLEKMFADGVTEENLKQYAQANYSEEELIIADNELNTNLSQIqdenaimykvdwgalgnCMANKIKDELLAMISVGTIIKYAQKKAWKELAKIVIKYVAKAGVKTNAALIAGQLAIWGLQCG
>tr|A0A1C0USB4|A0A1C0USB4_STREE Streptococcin A-M57 OS=Streptococcus pneumoniae GN=A4260_03625 PE=4 SV=1
TQEQKQIDDVANVLEQMFRNGVNEKNFTEYVYKNFSQKDIALAENELETNINNPydrvpwdemggCIAGKIRDEFFAMINVSLIVKYAQKKAWSELAKVVLRFVKANGLKTNIYIIAGQLAIWAVQCG
>tr|A0A133S201|A0A133S201_STRMT Uncharacterized protein OS=Streptococcus mitis GN=HMPREF3228_00364 PE=4 SV=1
----------------MFRNGVNEKNFTECVYKNFSQKDIALAENKLETNINNLydrvpwdemggCIARKIREEFFAMTNVSLTVRYA----------------------------------------
>tr|R2N7C7|R2N7C7_ENTFC Uncharacterized protein OS=Enterococcus faecium EnGen0191 GN=SSI_02965 PE=4 SV=1
QIKNSEVDTVAQGLEQMFSNGVSEENFKNYVNANFSSEEITKSEKELDVNLSNTsspiqarvnwnglgqCMANKIKDEFFAMINVGAIVAAAQKKAWKELAMTVLIFAKANGLKTNALIVAGQLAVWAVQCG
>UniRef100_A0A2N8LAA9 Uncharacterized protein n=1 Tax=Streptococcus penaeicida TaxID=1765960 RepID=A0A2N8LAA9_9STRE
-ISQSEVDAVAIEFEKLFSNGIiiSGNNYSinyDYLNNNYTSEEIQAFINLMASSELSPissgrrkrsissfvvCMKDKAVSDIADMFKVSAFVSFVQRKAWKEAAKFAVSWLAKNGIKRNVAATAALLSWYGIQCA
>UniRef100_A0A2X3SLP5 Uncharacterized protein n=2 Tax=Streptococcus equi TaxID=1336 RepID=A0A2X3SLP5_STRSZ
-TNtsSQEVDQVAQALELMFDNNVSTSNFKKYVNNNFSDSEIAIAELELESRISNSrsefrvawnemggCIAGKIRDEFFAMISVGTIVKYAQKKAWKELALVVLKFVKANGLKTNAIIVAGQLALWAVQCG
>UniRef100_A0A2X4H7F6 Uncharacterized protein n=4 Tax=Streptococcus uberis TaxID=1349 RepID=A0A2X4H7F6_STRUB
-VSQTEIESVASEFEQLFTKGIiiSGNNYTfnhDYLTNNYTSDEIQAFIHLMDSTDLSPtfskrrkrsigsfavCMKDKAVSDIADMFKVGAFVSFVQRKAWKEAAKFAVSWLAKNGIKRNVAATAALLSWYGIQCA

其中,tr 开头的来自于 BFD,UniRef100 开头的来自于 UniRef30

具体而言:

  • >UniRef100 是 250 条,UniRef30
  • >tr| + >SRR 是 81 条,BFD
  • 其他 1 条,BFD
  • 合计 332 条

序列示例如下:

>UniRef100_A0A2W5AUB7 Uncharacterized protein n=1 Tax=Streptococcus pyogenes TaxID=1314 RepID=A0A2W5AUB7_STRPY
-----------------------------MIGDLLKSKEVLLKKTARQTSLSQiQdddvimyktdwnalgsCMANKIKDELLAMISIGTIITYAKRKAWKELATIVIKYVAKAGVRTHAAFIAGQLAIWGLQCG
>SoimicmetaTmtLPB_FD_contig_61_1212631_length_209_multi_1_in_0_out_0_1 # 2 # 76 # 1 # ID=3042713_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.693
---NEEIEQLAADLEFLMEEAAiydEKGKVVnfdfDLLEERFgYVLELEMLKEEIEAYnattegd--NDeiqlfswksCMISALKGHFGvALIEValtGGLWSYLEKKAYKEAAKLLLKI----GIEGNVIGLTAFLTWYSVDCI
>tr|Q4V1Z0|Q4V1Z0_BACCZ Uncharacterized protein OS=Bacillus cereus (strain ZK / E33L) GN=pE33L466_0102 PE=4 SV=1
-----LVQAVAAQLKFVVEEAAvkdKHGRVVdiDMIENKYgKTEELEQLRQEIQRVNTPPgyedpfkqeteavgnCIERKLIANYVEVLSVgflGSIIANITNKEYELAARKMIKL----GVKGNLISLAGQLAWYLGTCI
>SRR5690606_14355373 
-----QIEELAAQLEFLMEEALiiENGERTfdfEKIENEFgkeVKDEIKMLTVDAQVWqvqpgaitlaanqPWKDCMVGAIKDHFGvALVTAaleGGLWAYLEKKAYKEAAKLLVKF----AVGTNAVGIAGTLIYYGGKCT

结论:在T1104-D1_A117bfd_uniref_hits.a3m 数据中,UniRef30 的搜索数量略大于 BFD,与数据库大小相反。

BFD (Big Fantastic Database) 是由 Martin Steinegger 博士构建的大型蛋白质序列数据库,包含超过 65 亿个蛋白质家族和 220 亿个蛋白质序列,覆盖了多种来源的蛋白质数据,如 UniProt、Metaclust、SRC 和 MERC。BFD 蛋白质数据库的目的是为了提供一个高质量、高覆盖率、低冗余的蛋白质序列资源,用于蛋白质结构预测、功能注释、进化分析等领域。

BFD 蛋白质数据库的构建过程是这样的:

  1. 从三大序列库 (GenBank \ EMBL \ DDBJ) 中采集 24 亿条蛋白质序列,使用 MMSeqs2 / Linclust 工具进行聚类,根据序列相似度和排列覆盖率的标准,将相似的序列归为一个家族,并且选取一个代表序列。
  2. 从 Metaclust 序列库中,补充了一些较长的序列,与已有的代表序列进行比对,将满足条件的序列加入相应的家族,将不满足条件的序列单独聚类。
  3. 对每个家族进行多序列比对 (MSA) 和隐马尔可夫模型 (HMM) 的构建,得到 BFD 蛋白质数据库。

2. MMseqs2 MSA

测试 BFD 脚本:

bash mmseqs2_search.sh test_fasta/T1104-D1_A117.fasta test_fasta/T1104-D1_A117-bfd-i3s8.a3m msa_databases/af2_msa_mmseqs/bfd_mmseqs/bfd_db 3 8 T1104-D1_A117-bfd-i3s8

测试效果:

[Info] Time taken to execute commands is 1974 seconds. (32.9 min)
154 test_fasta/T1104-D1_A117-bfd-i3s8.a3m

测试 UniRef30 脚本:

bash mmseqs2_search.sh test_fasta/T1104-D1_A117.fasta test_fasta/T1104-D1_A117-uniref30-i3s8.a3m msa_databases/af2_msa_mmseqs/uniref30_mmseqs/uniref30_db 3 8 T1104-D1_A117-uniref30-i3s8

测试效果:

[Info] Time taken to execute commands is 1266 seconds.
194 test_fasta/T1104-D1_A117-uniref30-i3s8.a3m

测试脚本:

# params
#=========================================
mmseqs=mmseqs
tmp=tmp_my
query_fasta=$1		# fasta
a3m_db=$2			# MSA
target_db=$3		# DB Path
num_iterations=$4   # 迭代轮次
sensitivity=$5		# 敏感度
out_tag=$6          # 唯一标识, 避免多进程搜索干扰target_db_index=${target_db}.idxquery_db=$tmp/${out_tag}_queryDB
result_db=$tmp/${out_tag}_res  # 文件
result_db_realign=$tmp/${out_tag}_res_realign
result_db_realign_filter=$tmp/${out_tag}_res_realign_filter
tmp_db=$tmp/${out_tag}_tmp#=========================================
echo "[Info] query_fasta: $1"
echo "[Info] a3m_db: $2"
echo "[Info] target_db: $3"
echo "[Info] num_iterations: $4"
echo "[Info] sensitivity: $5"
echo "[Info] out_tag: $6"mkdir -p $tmptime_start=$(date +%s)
#=========================================$mmseqs createdb ${query_fasta} ${query_db}
#$mmseqs search ${query_db} ${target_db} ${result_db} ${tmp_db} --db-load-mode 2 --num-iterations ${num_iterations} -s ${sensitivity} --max-seqs 10000 -e 0.1 -a --threads 16 --mpi-runner "mpirun --allow-run-as-root -np 8"
$mmseqs search ${query_db} ${target_db} ${result_db} ${tmp_db} --db-load-mode 2 --num-iterations ${num_iterations} -s ${sensitivity} --max-seqs 10000 -e 0.1 -a --threads 16
$mmseqs align ${query_db} ${target_db_index} ${result_db} ${result_db_realign} --db-load-mode 2 -e 10 --max-accept 100000 --alt-ali 10 -a --threads 16
$mmseqs filterresult ${query_db} ${target_db_index} ${result_db_realign} ${result_db_realign_filter} --db-load-mode 2 --qid 0 --qsc 0.8 --diff 0 --max-seq-id 1.0 --filter-min-enable 100 --threads 16
$mmseqs result2msa ${query_db} ${target_db_index} ${result_db_realign_filter} ${a3m_db} --msa-format-mode 6 --db-load-mode 2 --filter-msa 1 --filter-min-enable 1000 --diff 3000 --qid 0.0,0.2,0.4,0.6,0.8,1.0 --qsc 0 --max-seq-id 0.95 --threads 16$mmseqs rmdb ${result_db_realign_filter}
$mmseqs rmdb ${result_db}
$mmseqs rmdb ${result_db_realign}#=========================================
time_end=$(date +%s)
time_take=$(( time_end - time_start ))echo "[Info] MMseqs2 path is ${mmseqs} ."
echo "[Info] Time taken to execute commands is ${time_take} seconds."

参数网格搜索的多进程脚本:

  • 支持断点运行,避免遇到 Bug 重新开始。
  • 多进程需要资源比较多,长序列建议单进程运行。

源码如下:

#!/usr/bin/env python
# -- coding: utf-8 --
"""
Copyright (c) 2022. All rights reserved.
Created by C. L. Wang on 2023/7/31
"""import os
import subprocess
import sys
from multiprocessing import Pool
from time import timefrom tqdm import tqdmp = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
if p not in sys.path:sys.path.append(p)from myutils.project_utils import mkdir_if_not_exist, traverse_dir_files, time_elapsed, create_empty_file, read_file
from root_dir import ROOT_DIR, DATA_DIRdef process(params):"""单进程"""sh_file, fasta_path, output_path, db_path, i_idx, s_idx, out_name = params# 已经处理完成,支持断点重跑if os.path.isfile(output_path):data_lines = read_file(output_path)print(f"[Info] out_name: {out_name}, num: {len(data_lines)}")if len(data_lines) > 0:print(f"[Info] pass: {output_path}")returncreate_empty_file(output_path)# 输出文件s_time = time()res = subprocess.check_output(["bash", sh_file,fasta_path,output_path,db_path,str(i_idx),str(s_idx),out_name,  # 用于缓存临时文件], shell=False)# print(f"[Info] res: \n{res.decode()}")print(f"[Info] time: {time_elapsed(s_time, time())}, output_path: {output_path}")def main():"""参数网格搜索的脚本"""output_dir = os.path.join(DATA_DIR, "results")fasta_dir = os.path.join(ROOT_DIR, "data", "fasta")mkdir_if_not_exist(output_dir)print(f"[Info] 输出文件夹: {output_dir}")print(f"[Info] 输入 fasta 文件夹: {fasta_dir}")path_list = traverse_dir_files(fasta_dir, "fasta")print(f"[Info] fasta 数量: {len(path_list)}")bfd_db = "/pfs_beijing/chenlong/msa_databases/af2_msa_mmseqs/bfd_mmseqs/bfd_db"uniref30_db = "/pfs_beijing/chenlong/msa_databases/af2_msa_mmseqs/uniref30_mmseqs/uniref30_db"sh_file = os.path.join(ROOT_DIR, "search.sh")params_list = []# 遍历次数 4x2x3x8 = 192for fasta_path in path_list:base_name = os.path.basename(fasta_path).split(".")[0]# print(f"[Info] fasta_path: {fasta_path}")output_fasta_dir = os.path.join(output_dir, base_name)mkdir_if_not_exist(output_fasta_dir)max_i, max_s = 3, 8for db in ["bfd", "uniref30"]:if db == "bfd":     # 数据库 设置db_path = bfd_dbelif db == "uniref30":db_path = uniref30_dbelse:raise Exception(f"db name error: {db}")for i_idx in range(1, max_i+1):for s_idx in range(1, max_s+1):out_name = f"{base_name}-{db}-i{i_idx}s{s_idx}"# TODO: 搜索结果是 aln 格式output_path = os.path.join(output_fasta_dir, f"{out_name}.a3m")params_list.append((sh_file, fasta_path, output_path, db_path, i_idx, s_idx, out_name))print(f"[Info] 推理次数: {len(params_list)}")# 多进程# n_proc = min(len(params_list), 5)# pool = Pool(processes=n_proc)# list(tqdm(pool.imap(process, params_list), desc="[Info] run"))# pool.close()# pool.join()# 单进程测试for params in tqdm(params_list, desc="[Info] run"):process(params)print("[Info] 全部处理完成!")if __name__ == '__main__':main()

其他

PyCharm 其他版本下载

下载地址:PyCharm - 2022.3

参考

  • CSDN - python执行shell脚本的几种方法
  • GitHub - user guide error and unknown output format
  • GitHub - MMseqs

本文标签: PSP