Amino acid frequencies of metal ion-binding sites
We calculated the frequencies of 20 types of amino acid residues in the metal ion-binding sites. The frequencies of each amino acid in the binding sites of each metal ion are represented in Supplementary material B.
Metal ions are commonly coordinated by nitrogen, oxygen, or sulfur centers belonging to side chains on the amino acid residues of the protein, where metal ions provide empty orbits and amino acids provide electrons. The imidazole substituents of histidine residues, the thiolate substituents of cysteine residues, and the carboxylate groups of aspartic acid and glutamic acid can provide electrons as donor groups. This is consistent with the results of the present study, which showed that metal ions preferentially bind certain residues, namely cysteine, histidine, aspartic acid, and glutamic acid. As for Ca2+ and Mg2+, cysteine and histidine do not form cross-links with amino acid residues for constructing a specific structure; therefore, the frequencies of cysteine and histidine are low compared to other ions. Negatively charged residues (aspartic acid and glutamic acid) have high frequencies because of electrostatic interaction with Ca2+ and Mg2+. These ions also bind to oxygen atoms of the backbone. The imidazole substituents of histidine residues, the thiolate substituents of cysteine residues, and the carboxylate groups of aspartic acid and glutamic acid can provide electrons as donor groups, and metal ions can provide empty orbits. Nonpolar residues, such as leucine, isoleucine, and valine, as well as less polar amino acids, such as proline and threonine, show no preference for metal coordination. However, they show certain frequencies because we define a metal ion–binding site not as interacting residues but as neighboring residues within 3.5 Å of a metal ion center.
Performance of the homology-based method
In the homology-based method, we used different E-value thresholds: 0.0001. 0.001, 0.01, and 0.1. The results are shown in Supplementary material C. The number of true positive cases increases and the number of true negative cases decreases as the E-value threshold increases. The accuracy depends on the number of true positives and true negatives, and in many metal ions, except for Ca2+, Mg2+, and Ni2+, the increase in true positives is larger than the increase in true negatives, and thus the accuracy is best for the E-value threshold 0.1. We set this rather high E-value threshold because the sequences of metalloproteins for each metal ion are not so similar as can be detected by BLASTP although some kinds of motifs may exist for binding sites.
Table 2 shows the results of homology-based prediction for each of 11 metal ions when the E-value threshold is 0.1. The last column, “All,” shows the performance for all 11 metal ions (not all metal ions in natural proteins). Metal ions bound to some specific protein families have high sensitivity. In terms of accuracy, the homology-based method performed excellently, with an overall accuracy of 0.9905. It should be noted that with this method, the amount of negative data, which was easier to predict, was much larger than the amount of positive data. For instance, the sensitivity was 0.3544 and the specificity was 0.9961 when an E-value of 0.1 was used to predict Ca2+ binding sites, which means that only 35.44% of actual binding sites (positive data) were correctly predicted. Since substantial numbers of nonbinding sites were correctly predicted, the accuracy became higher.
Performance of the machine-learning method
An SVM classifier with RBF kernel has at least two parameters that need to be tuned for good performance: the cost parameter C
, which determines the misclassification penalty; and the gamma parameter ?
, which is used in the RBF kernel function. We used grid search for obtaining optimal values of ?
and C
to train the SVM models. As a result, we set ?
and C
to 0.07 and 100, respectively, to train the SVM models.
In order to determine an optimal window size, we used the PSSM features of metal ion-binding proteins with window sizes of 9 to 19 to train SVM models. The results are shown in Supplementary material D. As can be seen in the results, for Ca2+ and Mg2+, the accuracy increased from window size 9 to window size 15; for Co2+, Cu+, Fe3+, Hg2+, Mn2+, and Zn2+, the accuracy increased up to window size 13; for Cu2+ and Fe2+, the accuracy increased up to window size 11; for Ni2+, the best performance was shown at window size 9. More than half of the models performed best at window size 13, and the accuracy of the others did not change much in the range between their own optimal window size and window size 13. Therefore, we selected 13 residues as the optimal window size of all 11 types of metal ion-binding proteins.
Table 3 shows the performance of the machine-learning (SVM) method with the PSSM feature. The accuracy was 0.8017 and the MCC was 0.61 overall, and the performance was best for Cu+, with an accuracy of 0.8846 and an MCC of 0.77. It is interesting that the accuracy for Cu+ was the worst in the homology-based method. Table 4 shows the performance of the machine-learning method with the PSSM, the amino acid type, and the side chain type features. The accuracy increased to 0.8336 and the MCC increased to 0.67 overall. The performance was best for Zn2+, with an accuracy of 0.8901 and an MCC of 0.78.
Comparison of the two methods
We compared the results of the homology-based method and the machine-learning method. Figure 4 shows the number of chains predicted by the two methods for each metal ion, and Table 5 shows the number of unpredictable chains in the homology-based method. (In the homology-based method, we used the E-value threshold of 0.01.) For 11 types of metal ion-binding protein, the numbers of chains predicted by the SVM method were larger than those predicted by the homology-based method, and only 78% of chains hit their homologous sequences. For instance, the Cu+ binding sites of three chains (PDB IDs: 4BZ4-A, 4MAI-A, 5FJE-B) were successfully predicted by the SVM method and were not predicted by the homology-based method. The SVM method can better predict Ca2+-, Mg2+-, and Zn2+-binding chains compared to the homology-based method. Since Ca2+- and Mg2+-binding proteins exist in various families, the sensitivity of the homology-based method is low. Zn2+-binding sites can often be represented as motifs, and their sequence features tend to be local. The SVM method can recognize these features, while the homology-based method cannot align these sequence patterns.
The blue bar represents the number of chains predicted by the homology-based method and the red bar represents the number of chains predicted by the SVM method. The SVM shows higher sensitivity for all metal ions.
Table 5 summarizes the performances of the homology-based method and the machine-learning method. The high accuracy and low sensitivity of the homology-based method were caused by the inequality between negative and positive data. By contrast, the SVM method, which predicted with a balanced performance of accuracy, sensitivity, and specificity, was more effective.
We applied Student’s t-test to compare the two methods used in this study. The p-values of accuracy, sensitivity, specificity, precision, and MCC were 1.551 ´ 10–5, 3.840 ´ 10–8, 8.556 ´ 10–6, 1.684 ´ 10–1, and 1.765 ´ 10–2, respectively. For P < 0.05, the performance measures, except for precision, were significantly different.
Comparison with other work
We also compared our results with those of Kumar’s method (Kumar et al., 2017), which used simplified amino acid alphabets and a random forest model (Table 6). Our method using an SVM model targeted the prediction of three types of metal ion-binding protein (Cu+-, Fe2+-, and Hg2+-binding proteins) which are not available in Kumar’s method. As for Hg2+ ion, however, our dataset contains Hg2+ ions for the soaking in the X-Ray crystal structure analysis, and the results are not clear. When the same types of metal ion-binding proteins were compared, a considerable increase in the accuracy of the Zn2+-, Mn2+-, and Fe3+-binding proteins was observed. The disparity between the accuracy of the Ca2+-, Co2+-, Cu2+-, and Mg2+-binding proteins and the accuracy of Kumar’s method was not apparent. For Ni2+-binding proteins, the performance of our method was unsatisfactory.