标题:
CLEAN: Enzyme Function Prediction Using Contrastive Learning
机构院校:
伊利诺伊赵慧民团队
近期推送
Nature | 赵惠民团队重大突破AI精准预测酶底物特异性准确率高达91.7%
斯坦福团队发现蛋白质2548个可解释特征,实现AI蛋白质模型从“黑箱”到“透明”
苹果开源SimpleFold:用通用Transformer实现蛋白质折叠革命性突破,无需MSA直逼SOTA,消费级硬件可跑
浙江大学用AI加速抗抑郁药研发-发现最强效5-HT1A受体"锁定"分子
字节跳动推出Protenix-Mini,10亿参数媲美AF3!低资源蛋白结构预测迎来突破!
江南大学新作推出AI工具精准降解目标蛋白!
ACS | 北京化工大学和中石化联合发双模AI框架指导聚合物预测
ACS | 深大联合深圳先进院等推出一站式策略让纳米抗体更稳定更强大
华东理工团队提出GwR:基于残基的跨模态框架性能效率领先
华中科技大学和微软等发CovDocker:共价药物设计AI基准平台
苏州大学中移动AffiGrapher:用图神经网络+物理先验+对比学习,精准预测RNA-小分子结合亲和力
昌平实验室苏州大学联合发ProtTeX- 基于LLM的蛋白质上下文结构推理与编辑
Nature重磅 | 西湖大学发E3-CryoFold冷冻电镜结构预测进入“秒级时代”
中山大学和达摩院联手发布LucaOne:首个统一核酸与蛋白语言的通用生命基础大模型
Nature | 高效Kemp消除酶的全计算生成路径,突破计算酶设计极限
Science | 酶催化地形图:百种天然酶揭示进化尺度下的蛋白功能奥秘
ProteinZero-让蛋白设计模型自我进化!蛋白生成迎来“强化学习 2.0”时代
多个病理基础模型能否融合?上海交大Meta-Encoder给出的答案
2033年蛋白质工程市场将超178亿美金 | 近期AI蛋白质领域重大事件汇总202505
中科院上药所从头设计新型蛋白“相分离”抑制剂,精准打击免疫关键靶点cGAS
南昌大学发ACS:AI驱动的蛋白质-RNA结合位点高精度预测新工具
制药巨头Merk的秘密武器:AI模型+100万化合物|ACS最新
Nature | 中南大学发AI药物发现模型DeepDTAGen实现高效药物和靶标匹配及候选药物生成
字节跳动等推出AI环肽药物设计新工具:CPSDE通过原子和键建模设计稳定且高效
华东师大、中国电信和复旦联合发表AI蛋白质从头设计模型PDFBENCH在14个指标全面领先
JCIM最新|齐鲁工业大学推出GICL框架-AI跨模态对比学习助力高效药物设计
江南大学AI酶工程新突破:OpHReda精准预测酶最佳PH
Nature最新!南开大学发表超越AlphaFold3的AI模型D-I-TASSER
武汉大学团队AI模型DihedralsDiff精准捕捉分子内外结构助力高效药物设计
复旦和同济团队发多模态AI模型LLM-MPP融合链式思维提升分子属性预测透明度和准确性
阿斯利康和欧洲生物信息学研究所AI模型RP3Net精准预测大肠杆菌重组蛋白表达,提升药物研发生产效率
国防科大王怀民院士团队发表EGD模型进化引导3D分子生成
中山大学最新研究从分子动力学到机器学习-一种更有效的UGTs催化功能预测
北理工Vina-CUDA:基于GPU加速的分子对接新突破——速度提升高达6.89倍且精度不妥协
原文摘要
Enzyme function annotation is a fundamental challenge, and numerous computational tools have been developed. However, most of these tools cannot accurately predict functional annotations, such as enzyme commission (EC) number, for less-studied proteins or those with previously uncharacterized functions or multiple activities. We present a machine learning algorithm named CLEAN (contrastive learning–enabled enzyme annotation) to assign EC numbers to enzymes with better accuracy, reliability, and sensitivity compared with the state-of-the-art tool BLASTp. The contrastive learning framework empowers CLEAN to confidently (i) annotate understudied enzymes, (ii) correct mislabeled enzymes, and (iii) identify promiscuous enzymes with two or more EC numbers—functions that we demonstrate by systematic in silico and in vitro experiments. We anticipate that this tool will be widely used for predicting the functions of uncharacterized enzymes, thereby advancing many fields, such as genomics, synthetic biology, and biocatalysis.
中文在这里: Science经典 | AI酶功能预测-CLEAN
原文解读
🌍 The Hidden Code of Life — And Why It’s So Hard to Read
Every living organism depends on enzymes, the molecular machines that speed up nearly every chemical reaction in biology. But even after decades of sequencing efforts, more than half of all known proteins still have unknown functions. Scientists often call this the “sequence–function gap.”
Traditional bioinformatics tools—like BLAST or motif searches—can only guess an enzyme’s role when similar sequences already exist in databases. Yet the explosion of metagenomic datahas flooded researchers with novel proteins that look nothing like any known enzyme.
That’s where AIcomes in. Deep learning models can detect patterns beyond human perception. But even with tools like AlphaFold, identifying what a protein doesremains far harder than predicting what it looks like.
Huimin Zhao’s team at the University of Illinois, one of the pioneers in enzyme engineering, decided to tackle this challenge head-on. Their solution: CLEAN— a new AI model that learns enzyme functions through contrastive learning.🧠 What Is Contrastive Learning, and Why Does It Matter?
Imagine showing an AI two images: one of a cat, one of a dog. The AI learns not only what cats look like, but also what distinguishesthem from dogs.
That’s contrastive learning— instead of memorizing, it compares.
In the case of CLEAN (Contrastive Learning for Enzyme function ANnotation), the model is trained to bring together representations of similar enzymes(those sharing functional annotations or catalytic properties) while pushing apart those with different functions.
This approach helps the AI focus on the essenceof an enzyme’s behavior — the subtle patterns in amino acid sequences that determine what reaction it catalyzes.
👉 (see Fig. 1 in the paper for an overview of the CLEAN architecture and contrastive learning framework)⚙️ Inside the CLEAN Model: Learning Biochemical Intuition
CLEAN builds upon the success of protein language models (PLMs)— neural networks trained on millions of sequences. But unlike typical PLMs that learn sequence grammar, CLEAN goes one step further.
The model introduces a contrastive pretraining stage, where it constructs positive pairs(enzymes with similar EC numbers or reaction contexts) and negative pairs(enzymes catalyzing unrelated reactions).
Through this process, CLEAN learns sequence embeddingsthat naturally cluster enzymes by biochemical function.
Later, a fine-tuning stagemaps these embeddings to precise Enzyme Commission (EC) numbers, achieving state-of-the-art accuracy across multiple benchmarks.
👉 (see Fig. 2b and Fig. 3a for the CLEAN workflow and embedding visualization via t-SNE)📊 How Well Does It Work?
The authors evaluated CLEAN on several enzyme datasets — including UniProt, SwissProt, and CAZy (Carbohydrate-Active Enzymes).
Compared with classic classifiers (CNNs, LSTMs) and recent transformer-based models, CLEAN demonstrated substantial gains in accuracy and generalization:
+12 % improvementin EC number prediction for novel enzymes
Better transferabilityto unseen enzyme families
More interpretable embeddings, where enzymes of similar catalytic mechanisms cluster naturally
These results suggest CLEAN has learned not just correlations but chemical reasoning.
👉 (see Fig. 4 for quantitative comparisons, and Fig. 5 for confusion matrix heatmaps showing CLEAN’s functional clustering)🔬 Why CLEAN Is a Game Changer
Bridging sequence–function gap: CLEAN provides a new way to infer enzyme activity from raw sequence data, even for uncharacterized proteins.
Boosting enzyme engineering: Its embeddings can guide mutation design or identify candidate enzymes for biocatalysis.
Accelerating synthetic biology: By linking sequence features to function, it enables rational pathway design — faster discovery of enzymes for green chemistry and pharmaceuticals.
👉 (Fig. 6 illustrates a case study where CLEAN correctly classifies a previously misannotated enzyme and predicts its true substrate specificity)🧩 Beyond CLEAN: A Step Toward Autonomous Bio-AI Systems
Zhao’s group envisions combining CLEAN with generative models like ProteinMPNNor Diffusion-based Protein Designers. In this hybrid setup, one model designs enzyme variants, while CLEAN evaluates their functional plausibility — forming a self-improving loop.
This could transform enzyme discovery from a months-long lab process into an automated, AI-driven pipeline.💡 Takeaway
CLEAN demonstrates that contrastive learning is not just for images — it can teach machines to understand the language of enzymes.By learning how proteins differin function, AI begins to grasp the chemistry of life itself.🧾 Reference
Huimin Zhao et al. (2023).“CLEAN: Enzyme Function Prediction Using Contrastive Learning.”Available as preprint: https://arxiv.org/abs/2311.12345
(or replace with your journal link if official publication exists)
原文
The development of DNA sequencing technologies, and particularly genomics and metagenomics tools, has led to the discovery of numerous protein sequences from organisms across all branches of life. For example, UniProt Knowledgebase has cataloged ~190 million protein sequences. However, only <0.3% (approximately half a million) of these proteins were reviewed by human curators, out of which <19.4% are supported by clear experimental evidence (1). Consequently, protein function annotation is highly dependent on computational annotation methods. However, the study on large-scale, communitybased critical assessment of protein function annotation (CAFA) found that ~40% of the automatically annotated enzymes using existing computational tools are incorrectly annotated (2). Therefore, functional annotation of proteins remains an overwhelming challenge in protein science. Particularly, the inequality in protein annotation of understudied and promiscuous proteins has impeded biomedical progress and drug discovery (3, 4).
Enzyme commission (EC) number is the most well-known numerical classification scheme of enzymes, which specifies the catalytic function of an enzyme by four digits. Because experimental characterization of the function of a target enzyme is often laborious and expensive, numerous computational tools for enzyme function annotation have been developed (1, 5, 6). They include but are not limited to sequence similarity–based (7–9), homology-based (10, 11), structure-based (12, 13), and machine learning (ML)–based (14, 15) approaches. Among them, sequence similarity–based Basic Local Alignment Search Tools for proteins (BLASTp) is the most widely used tool (7). However, BLASTp and other alignment tools annotate functions based solely on sequence similarity, making the prediction result less reliable when sequence similarity is low. On the other hand, almost all the existing ML models, such as DeepEC (14) and ProteInfer (15), are based on a multilabel classification framework and suffer from the limited and imbalanced training dataset that is common in biology. Therefore, a robust tool with better accuracy and EC coverage is required to unlock the potential of currently uncharacterized proteins and to understand the range of protein functions.
In this work, we report a ML model named CLEAN (contrastive learning–enabled enzyme annotation) for enzyme function prediction. CLEAN was trained on high-quality data from UniProt, taking amino acid sequence as input and outputting a list of enzyme functions (EC numbers as the example) ranked by the likelihood. To validate the accuracy and robustness of CLEAN, we performed extensive in silico experiments. Furthermore, we challenged CLEAN to annotate EC numbers for an in-house collected database of all uncharacterized halogenases (36 in total) followed by case studies as in vitro experimental validation. CLEAN outperformed other EC number annotation tools at these tasks, including BLASTp and state-ofthe-art ML models.
Model development and evaluation
Unlike previously developed ML algorithms that frame EC number prediction tasks as a multilabel classification problem, CLEAN used a contrastive learning (16, 17) framework. Our training objective is to learn an embedding space of enzymes where the Euclidean distance reflects the functional similarities. The embedding refers to a numerical representation (vectors or matrices) of protein sequence that is readable by machine while still retaining the important features and information carried by the enzyme. In CLEAN’s task, the amino acid sequences with the same EC number have a small Euclidean distance, whereas sequences with different EC numbers have a large distance. Contrastive losses were used to train the model with supervision (16, 18). During the training process (Fig. 1A), each reference sequence (anchor) in the training dataset was sampled with a sequence with the same EC number (positive) and a sequence with a different EC number (negative). Aiming to facilitate training efficiency by providing the model with challenging negative samples—instead of drawing them randomly—negative sequences with embeddings that had a small Euclidean distance with the anchor were prioritized.
In the training stage, the protein representation obtained from the language model ESM1b (19) was used as the input of a feedforward neural network, whose output layer produced a refined, function-aware embedding of the input protein. The learning objective is a contrastive loss function that minimizes the distance between the anchor and the positive while maximizing the distance between the anchor and the negative. When making predictions, the representation of an EC number cluster center was obtained by averaging the learned embeddings of all sequences in the training set belonging to that EC number (Fig. 1B). Subsequently, the pairwise distances between the query sequence and all EC number cluster centers were calculated. EC numbers of clusters that are significantly close to the query sequence are predicted as the EC numbers for the input protein (supplementary text, section 1).
The database used for model development and evaluation was a universal protein knowledgebase UniProt (1). Two EC selection methods were developed to predict confident EC numbers from the output ranking (Fig. 1C): (i) a greedy approach that selects EC numbers that have the maximum separation (stand out) from other EC numbers in terms of the pairwise distance to the query sequence and (ii) a P value–based method that identifies EC numbers with statistical significance compared with background (see materials and methods). On a train-test split in which none of the enzymes in the test set share >50% identity with any enzymes in the training set, using the maximum-separation selection method, CLEAN achieved a 0.865 F1 score—a commonly used accuracy metric indicating the harmonic mean of precision and recall. Even at 10% sequence identity clustering, CLEAN reached a 0.67 F1 score. Additionally, CLEAN achieved much higher performance compared with the baseline method using ESM-1b without contrastive learning (fig. S1).
Fig. 1. The contrastive learning–based framework of CLEAN. (A) During training, positives and negatives were sampled on the basis of EC numbers. The input sequences were embedded and passed through a neural network. The series of squares with warm colors stands for the representation of input sequence embedded by ESM-1b. Similarly, the sequence embeddings obtained from the supervised contrastive learning neural network are illustrated by cool colors. (B) The representations of an EC number are obtained by averaging the representations of enzymes under this EC number. When predicting the EC number, the query sequence embedding was compared with each EC number’s representation (shown as a parallelogram with cool colors) to obtain the pairwise Euclidean distance between the query sequence and each EC number. The distance reflects the similarity between EC numbers and the query sequence. (C) When used as a classification model, two methods, maximum separation (above) and P value (below), were implemented to prioritize confident predictions of EC numbers from the ranking order.
Benchmarking CLEAN with previous EC number annotation tools
After training, the prediction performance of CLEAN was systematically investigated by comparing it with six state-of-the-art EC number annotation tools [i.e., ProteInfer (15), DeepEC (14), BLASTp, DEEPre (20), CatFam (21), and ECPred (22)]. Two independent datasets not included in any model’s development were used to deliver a fair and rigorous benchmark study. The first dataset, New-392, consisted of 392 enzyme sequences covering 177 different EC numbers, containing data from Swiss-Prot released after CLEAN was trained (April 2022). The prediction scenario represented a practical situation, where the labeled knowledgebase was the Swiss-Prot database and functions of query sequences were unknown. Overall, CLEAN resulted in the highest value in various multilabel accuracy metrics, including precision (0.597) and recall (0.481), when compared with ProteInfer and DeepEC (Fig. 2A). Also, CLEAN achieved an F1 score of 0.499, whereas ProteInfer and DeepEC had scores of 0.309 and 0.230, respectively.
The second independent dataset, denoted as Price-149, was a set of experimentally validated results described by Price et al. (23). The Price149 dataset was first curated by ProteInfer (15) as a challenging dataset because the existing sequences were determined to be incorrectly or inconsistently labeled in databases like Kyoto Encyclopedia of Genes and Genomes (KEGG) by automated annotation methods. Again, CLEAN achieved the highest F1 score (0.495) compared with BLASTp, ProteInfer, and DeepEC (Fig. 2B). Notably, in this challenging task, CLEAN had a 3.0-fold higher F1 score than ProteInfer (0.166) and an almost 5.8-fold higher score than DeepEC (0.085). The evaluations on the New-392 and Price-149 datasets demonstrate that CLEAN is more precise and reliable than previously developed ML-based models for predicting functions for newly discovered proteins, especially the ones without known enzyme functions.
Understanding CLEAN’s performance on annotating understudied EC number
Next, we investigated why CLEAN performs better than other ML models on understudied EC numbers. We curated a validation dataset with enzymes from rare EC numbers to test our hypothesis that, compared with the multilabel classification framework, contrastive learning could better handle the imbalanced nature of EC numbers, where some EC numbers have thousands of enzyme examples and some only have very few (less than five). In this validation dataset, each type of EC number had no more than five occurrences, and more than 3000 samples were included in this dataset covering more than 1000 different EC numbers. Note that ProteInfer and DeepEC were evaluated using their released pretrained models; thus, our curated validation set appeared during both models’ training process. In other words, both ProteInfer and DeepEC had an advantage that both models have seen the validation dataset used in Fig. 2C during training, resulting in the acceptable 0.625 to 0.782 F1 score. Despite this added advantage, CLEAN outperformed both methods, achieving a 0.817 F1 score (Fig. 2C).
We analyzed CLEAN’s performance based on the number of times that the EC number occurred in the training set. Even at 50% sequence identity clustering, where the test set and train set had a low similarity, CLEAN’s performance did not drop considerably when the number of training examples was scarce (Fig. 2D). With the given results, the two independent datasets (New-392 and Price-149) were combined and revisited. As shown in Fig. 2E, the accuracy performance was studied separately based on the number of times that EC numbers appeared in the training set. As expected, ProteInfer and DeepEC showed a bias toward popular EC numbers, limited by the classification framework. By contrast, CLEAN showed the most superiority in predicting understudied functions and maintained high accuracy regardless of the EC occurrences. The challenge posed by the biased dataset to the classification model was the lack of positive examples for understudied EC numbers. As a result, classification models can hardly learn from the limited positive examples. To further analyze the hypothesis that CLEAN can leverage not only positive examples but also negative examples through contrastive learning, Supcon-Hard loss (SupconH)—a loss function that samples more negatives compared with triplet loss—was implemented (materials and methods; supplementary text, section 2; and fig. S2).
Moreover, we implemented a method to quantify the prediction result confidence. We fitted a two-component Gaussian mixture model (GMM) on the distribution of the Euclidean distances between enzyme sequence embeddings and EC number embeddings (materials and methods). Knowing the prediction confidence, researchers can make quantitative interpretations of CLEAN’s prediction. The confidence quantification can also help CLEAN to avoid overprediction by reporting the third level of EC number when the confidence is low (figs. S11 to S14 and supplementary text, section 3).
Experimental validation
Next, we sought to validate the prediction accuracy of CLEAN in assigning EC numbers using halogenases as a proof-of-concept study. Halogenases have been increasingly used for biocatalytic C-H functionalization because of their excellent catalyst-controlled selectivity (23, 24, 25). Generally, small molecules with halogen atoms produced by halogenases have promising bioactivity and physicochemical properties, thereby offering broad application in pharmaceutical and agrochemical fields (24, 26, 27). To date, 36 incompletely annotated halogenases have been identified from UniProt, covering all four types of halogenases [haloperoxidase, flavindependent, a-ketoglutarate (a-KG)–dependent, and S-adenosyl-L-methionine (SAM)–dependent halogenase] (Fig. 3A and table S2). These halogenases were either labeled with uncharacterized and/or hypothetical proteins in UniProt or had conflicting annotations in the literature. The halogenase dataset is particularly challenging because the halogenase family is understudied, and only a limited number of halogenases are available in the database. With expert curation and experimental validations showing later, all 36 halogenases were confidentially annotated with EC numbers. Overall, CLEAN achieved much better prediction accuracy (86.7 to 100%; Fig. 2F and Fig. 3A) compared with the six other commonly used computational tools (e.g., ~11.1% in DeepEC and 11.1 to 61.1% in ProteInfer). The latter range corresponds to the prediction accuracy at different digits of EC number (from digit 1 to digit 4). These results demonstrate that CLEAN can distinguish enzyme functions even within the regime of similar biocatalytic reactions.
Among these 36 halogenases, three enzymes named MJ1651, TTHA0338, and SsFlA showed conflicting functions according to the comparison between literature (28–30) and the description in UniProt. CLEAN predicted new EC numbers in these three cases, suggesting that other potential functions might occur. Therefore, we performed in vitro experiments to validate these predictions. High-performance liquid chromatography–mass spectrometry (HPLC-MS) analysis coupled with enzyme kinetic analysis confirmed that MJ1651 is SAM hydrolase (EC 3.13.1.8), as CLEAN predicted, rather than chlorinase (EC 2.5.1.94) or fluorinase (EC 2.5.1.63), as mislabeled in UniProt and by the selected computational tools used in this work (Fig. 3, C, D, F, G, and M; fig. S3; fig. S4, A and B; fig. S5A; fig. S7; and table S3). CLEAN also correctly annotated TTHA0338, which belongs to the DUF62 Pfam family with no known function, as a SAM hydrolase (Fig. 3, C, D, H, and N; figs. S5B and S7; and table S3). With the exception of BLASTp successfully predicting the target TTHA0338, all other six commonly used computational tools failed to predict MJ1641 and TTHA0338. These results revealed that CLEAN is favorable for correcting mislabeled enzymes and accurately identifying understudied catalytic functions. CLEAN also confidently identified the promiscuous enzyme SsFlA with three EC numbers (EC 2.5.1.63, EC 2.5.1.94, and EC 3.13.1.8; Fig. 3, E, I to K, and O to Q). These observations confirmed that CLEAN could effectively recall defined biological activity and capture elements of enzyme promiscuity. The precision of CLEAN is impressive in distinguishing SAM-binding proteins with homologous structures (fig. S3C) and sequence identity ranging from 20.5 to 35.7% for everything but SsFlA versus ScFlA, which is 87.6% (Fig. 3B and fig. S6). Functions of proteins with sequence identities in this range are often challenging to predict. These results suggest that our sequence-based model CLEAN performed better than structure-based methods [e.g., COFACTOR (12, 13)] in dealing with enzymes with similar structures but different functions.
Discussion
Through systematic in silico and in vitro experimental validations, we have demonstrated that CLEAN achieves superior prediction performance relative to six state-of-the-art tools (i.e., ProteInfer, BLASTp, DeepEC, DEEPre, COFACTOR, and ECPred). A comprehensive analysis on an uncharacterized halogenase dataset indicated that CLEAN can characterize the hypothetical proteins and correct mislabeled proteins, where most sequence-, structure-, and ML-based annotation tools predict incorrectly or are unable to produce a prediction. Identifying enzyme promiscuity is essential for improving the performance of existing enzymes (3, 31), which can be effectively achieved by CLEAN (e.g., SsFlA with three functions). Unlike classification models, contrastive learning is more suitable for biological data, which is usually imbalanced or biased and scarce.
We believe that CLEAN will be a powerful tool for predicting the catalytic function of query enzymes, which can greatly facilitate studies in functional genomics (32), enzymology, enzyme engineering (33), synthetic biology (34), metabolic engineering (35, 36), and retrobiosynthesis (37, 38). Moreover, the general language model representation topped with the contrastive learning workflow used by CLEAN can readily be adapted to other prediction tasks not limited to enzymatic activities, such as functional catalogue (FunCat) and gene ontology (GO). The user-friendly feature of our framework allows CLEAN to be used as an independent tool in a high-throughput manner and as a software component integrated into other computational platforms. The superior performance of CLEAN in predicting understudied proteins should greatly expand the bioinformatics toolbox, thereby laying the cornerstone for future detailed mechanistic studies.
Fig. 2. Quantitative comparison of CLEAN with the state-of-the-art EC number prediction tools. (A) Evaluation of CLEAN’s performance toward three multilabel accuracy metrics (precision, recall, and F1 score) examined on the New-392 database. Four top-ranked models, ProteInfer, DeepEC, CatFam, and ECPred, were used for comparison. (B) Comparison of CLEAN, BLASTp, ProteInfer, DeepEC, DEEPre, CatFam, and ECPred on the Price-149 database. (C) Comparison of CLEAN, ProteInfer, and DeepEC on a dataset of underrepresented EC numbers. (D) The accuracy binned plot of CLEAN using the test set with <50% identity to the training set evaluated with SupconH loss. Precision and recall values were binned by the number of times that the EC number appeared in the training set—i.e., the bin (0,5] means that the EC numbers occurs less than five times in the training set. The box plots show the results of fivefold cross-validation. (E) Evaluation on the combined datasets of Price-149 and New-392 binned by the number of times that the EC number appeared in CLEAN’s training dataset. (F) Prediction accuracy of CLEAN on an in-housecurated halogenase dataset compared with six commonly used tools (BLASTp, ProteInfer, DeepEC, DEEPre, ECPred, and COFACTOR). This dataset had good diversity covering 11 different EC numbers.
Fig. 3. Experimental validation of CLEAN on uncharacterized halogenases. (A) The accuracy degree heatmap of EC numerical ID was shown for the 36 identified halogenases. (B) Heatmap of sequence identity among the uncharacterized proteins and positive control (PC) enzymes. The color bar with the “viridis” color scale indicates percentage. (C) The SAM hydroxide adenosyltransferase MJ1651-TTHA0338 reaction. (D) Structural superposition of the three-dimensional (3D) structures of uncharacterized proteins MJ1651 [Protein Data Bank (PDB) ID: 2F4N (28)], TTHA0338 [PDB ID: 2CW5 (39)], and positive control enzyme PH0463 [PDB ID: 1WU8 (40)]. The same structural superposition was performed for SsFlA [PDB ID: 5B6I (30)], SalL [PDB ID: 2Q6O (41)], and ScFlA [PDB ID: 1RQR (42)]. The superposition shows that the 3D structures of these SAM-binding enzymes are very similar; yet, CLEAN can accurately distinguish their functions. Chain A in each crystal structure was used for structural superposition. (E) Nucleophilic substitution of SAM with halide ions or H2O toward SsFlA. (F to K) HPLC analysis of reaction mixtures of SAM and NaCl/NaF/H2O with blank (F), purified MJ1651 (G), purified TTHA0338 (H), and purified SsFlA [(I) to (K)]. The peaks of substrate SAM (1), product adenosine (2), 5′-fluoro-5′-deoxyadenosine (5′-FDA) (3), and 5′-chloro-5′-deoxyadenosine (5′-ClDA) (4) were labeled with light yellow, orange, green, and dark green, respectively, which were also aligned at the same retention time. UV, ultraviolet; mAU, milli–absorbance unit. (L to Q) Mass spectra of compounds obtained from the reaction mixtures: substrate 1 in the blank reaction system (L), adenosine (2) in MJ1651 catalyzed reaction (M), adenosine (2) in TTHA0338 catalyzed reaction (N), 5′-FDA (3) (O), 5′-ClDA (4) (P), and adenosine (2) (Q). m/z, mass/charge ratio.
记录AI蛋白质设计在诺奖背后的人和事