50 years of amino acid hydrophobicity scales: revisiting the capacity for peptide classification

Background Physicochemical properties are frequently analyzed to characterize protein-sequences of known and unknown function. Especially the hydrophobicity of amino acids is often used for structural prediction or for the detection of membrane associated or embedded β-sheets and α-helices. For this purpose many scales classifying amino acids according to their physicochemical properties have been defined over the past decades. In parallel, several hydrophobicity parameters have been defined for calculation of peptide properties. We analyzed the performance of separating sequence pools using 98 hydrophobicity scales and five different hydrophobicity parameters, namely the overall hydrophobicity, the hydrophobic moment for detection of the α-helical and β-sheet membrane segments, the alternating hydrophobicity and the exact ß-strand score. Results Most of the scales are capable of discriminating between transmembrane α-helices and transmembrane β-sheets, but assignment of peptides to pools of soluble peptides of different secondary structures is not achieved at the same quality. The separation capacity as measure of the discrimination between different structural elements is best by using the five different hydrophobicity parameters, but addition of the alternating hydrophobicity does not provide a large benefit. An in silico evolutionary approach shows that scales have limitation in separation capacity with a maximal threshold of 0.6 in general. We observed that scales derived from the evolutionary approach performed best in separating the different peptide pools when values for arginine and tyrosine were largely distinct from the value of glutamate. Finally, the separation of secondary structure pools via hydrophobicity can be supported by specific detectable patterns of four amino acids. Conclusion It could be assumed that the quality of separation capacity of a certain scale depends on the spacing of the hydrophobicity value of certain amino acids. Irrespective of the wealth of hydrophobicity scales a scale separating all different kinds of secondary structures or between soluble and transmembrane peptides does not exist reflecting that properties other than hydrophobicity affect secondary structure formation as well. Nevertheless, application of hydrophobicity scales allows distinguishing between peptides with transmembrane α-helices and β-sheets. Furthermore, the overall separation capacity score of 0.6 using different hydrophobicity parameters could be assisted by pattern search on the protein sequence level for specific peptides with a length of four amino acids. Electronic supplementary material The online version of this article (doi:10.1186/s40659-016-0092-5) contains supplementary material, which is available to authorized users.


Background
Hydrophobicity as a physicochemical property is frequently used to characterize secondary structures of proteins. Early on it was noted that this property of amino acids dominates the initial interactions during protein folding [1,2]. In addition, the physicochemical properties of secondary structures depend on the properties of their amino acids and differ in relation to the native environment of the secondary structure, e.g., in solution or in membranes [3][4][5]. Considering this, it is not of surprise that the classification and characterization of amino acids according to their hydrophobicity attracted much attention.
In 1962 the first hydrophobicity scale of amino acids was formulated [6]. In addition, a first model to calculate the difference in free energy for the unfolded and native form of the protein catalase in solution was established [6]. Ever since many "hydrophobicity scales" were published. However, not all of these scales focus exclusively on hydrophobicity, but we will continue using this term. The information about hydrophobicity for the amino acids were extracted from biochemical experiments [7], distributions of amino acids in different protein classes [8,9], the capacity of amino acids to participate in hydrophobic or hydrophilic milieu [10,11] or from in silico calculations [12]. Today, about 98 "hydrophobicity scales" exist which contain a defined hydrophobicity value for each of the 20 amino acids. A high variance between these scales can be expected due to the variance of the underlying experimental approaches.
At the same time many hydrophobicity parameters for peptide classification have been developed for specific purposes. The overall hydrophobicity was introduced to globally classify peptides [6]. In addition, a hydrophobic moment for detection of the helical membrane segments [13], the alternating hydrophobicity for detection of membrane embedded ß-sheets [14,15] or exact ß-strand score (EBSS) considering the frequency of amino acids pointing inward or outward of a ß-barrel [16] has been defined.
In parallel many alternative algorithms and methods have been developed to predict protein properties based on hydrophobicity scales and classify them concerning environment (soluble, transmembrane) or function. Among them are routines for the prediction of transmembrane regions [17][18][19][20] or protein folding [21][22][23][24][25]. Even today, the hydrophobicity scales are often used to define properties of peptides within proteins [26][27][28][29]. However, the wealth of hydrophobicity scales complicates the process of scale selection and of the parameters to be calculated.
Thus, 50 years after formulation of the first scale we analyzed 98 different hydrophobicity scales present in the literature [22,30,31]. We used the overall hydrophobicity, the hydrophobic moment for detection of α-helical and β-sheet transmembrane elements, the alternating hydrophobicity and the EBSS as parameters to evaluate their influences on the separation on different secondary structure pools. For the analysis of the different scales and parameters we developed a five dimensional consensus approach to define the quality of the combinatory usage. Finally, we clustered the hydrophobicity scales to classify their performance for general separation capacity of secondary structures, environmental specifications or subsets thereof. We found that the overall performance of the hydrophobicity scales is rather comparable irrespective of the strategy of generation. However, the application of more than one hydrophobicity parameter enhances the capacity of the pool separation, but the alternating hydrophobicity has the lowest impact on the separation capacity when compared with the other four parameters. In general hydrophobicity is suitable to classify transmembrane α-helices and β-sheets better than peptides with other secondary structures. However, specific pattern of four or five amino acids were identified in the different peptide pools analyzed.

Sequence pools, hydrophobicity scales and parameter selection
Different sequence pools were generated to study the separation capacity of hydrophobicity scales and hydrophobicity parameters. To this end, sequences of proteins with known structure were extracted from the ASTRAL40 (http://scop.berkeley.edu/astral/) database [32] and dissected in sequences with exclusive α-helical, ß-sheet and random coil (random) content. The α-helical and ß-sheet sequences were further separated in pools representing transmembrane segments (tm-sheet, tm-helix) and soluble (s-sheet. s-helix; annotated as cytoplasmic). Subsequently, the two small individual transmembrane pools were expanded by one round of psi-blast using the sequences with structural information as bait. Psi-blast for the two small datasets using only sequences of known secondary structures was performed to reach an increase of highly similar sequences from the bait sequence pools. To prevent overfitting of the two pools, filtering for redundant and similar (>95 % sequence identity) sequences was performed. This approach was required to avoid artifacts by comparing drastically different volumes and peptide densities. Otherwise, a small volume could result in an unjustified good separation value. The other three pools (random, s-helix, s-sheet) were not expanded leading to the final sequence number for the five different secondary structure pools (tm-sheet, tm-helix, s-sheet, s-helix, random) ( Table 1; Additional file 1: Table S1, Additional file 2: Table S2).
Further, we implemented an in silico tryptic [K (Lysine)/R (Arginine)] digest of the whole ASTRAL40 [32] database. On the one hand, this approach yields peptides with mixed structural content. On the other hand, on the base of these peptides we wanted to test whether peptides identified by mass spectrometric approaches typically generated by tryptic digest can be used to define topologies of proteins. These sequences were classified according to their continuous (dc-sheet, dc-helix, dcrandom) or discontinuous (dd-sheet, dd-helix, dd-random) dominating secondary structure elements (SSE). Peptides without dominating SSE are clustered concerning their portion of different SSE (no-helix, no-sheet, no-random, all). In addition, sequences with transmembrane content were pooled individually for the SSE helix and sheet (krtm-helix, krtm-sheet, Table 2). Combining the pools generated by in silico tryptic digestion or based on known secondary structure approaches, we ended up with 17 different sequence pools. Each of the 17 different sequence pools contains a discrete number of sequences (Tables 1, 2).
After gathering the test pools different hydrophobicity scales were extracted from literature (Tables 3, 4; Additional file 1: Table S1). The 98 selected scales cover experimentally developed scales, calculated scales as well as scales which have been created by improving pre-existing scales. In addition, eight scales represent the reverse algebraic numeral scales to other scales to test if the algebraic numeral has an influence on the results.
To investigate the separation capacity of the selected hydrophobicity scales on the different defined sequence pools five measurement parameters have been defined ( Table 5). The EBSS [16], the alternating hydrophobicity [14,15] and the hydrophobic moment calculated for β-sheets with an angle of 180° [13] are hallmarks for hydrophobic content, while the hydrophobic moment calculated for α-helices [14] and the average hydrophobicity [33] are used to identify α-helical transmembrane regions (hydrophobic-moment α). Each of the parameters was calculated in a sliding window of ten amino acids. The largest and smallest value for each peptide was considered as outlined below.

The relation of the hydrophobicity scales
We clustered the hydrophobicity scales by comparing the hydrophobicity value for each amino acid to each other. The variance of the different hydrophobicity scales was analyzed by Pearson correlation. The Table 1 Sequence pools based on secondary structure dissection All peptides have a minimal length of ten amino acids required for EBBS based analysis. Soluble pools contain only peptides with the defined SSE, while the trans-membrane pools contain both, peptides with the given SSE only and with the given SSE and additional amino acids to reach a length of ten amino acids    [58] 96 1989 Fasman FASG890101 reverse [48] Shown are the category of the scale (column 1), the ID of the hydrophobicity scale (column 2) the year of the publishing (column 3), the name of the authors (column 4) and the name of the scale (column 5) 25 1985 Kidera KIDA850101 reverse [50] 98 1988 Roseman ROSM880102 reverse [59] 88 1988 Roseman ROSM880103 reverse [59] 97 1988 Roseman ROSM880101 reverse [59] 84 1981 Wolfenden WOLR810101 reverse [60] values obtained were used to calculate the dissimilarity (= 1 − correlation 2 ) to create an Unweighted Pair Group Method with Arithmetic mean (UPGMA) tree of the hydrophobicity scales via MEGA6 [34]. The tree was used to cluster those scales to groups of similar amino acid value behavior setting a threshold ( Fig. 1; Additional file 2: Table S2). The linearized UPGMA tree of the 98 hydrophobicity scales was inspected to split the scales in clusters using a threshold of a maximal dissimilarity of 0.05. The created clusters were named alphabetically. The UPGMA tree was circled to give an overview of all 98 hydrophobicity scales and their position at a glance ( Fig. 1). As expected, the hydrophobicity scales generated by inverting the amino acid values cluster with the original scales (Table 3). However, not all hydrophobicity scales that have been created by the same experimental approach or by the same author cluster together (  [37]; or (ii) the scales proposed by Zviling (scales 89-91; SET1-3; cluster I) [12] that have been based on the scales of Kyte and Doolittle (cluster B) [38] and Engelman (cluster A) [39].

Calculation of separation capacity of hydrophobicity scales
Next we analyzed the capacity of the 98 hydrophobicity scales to separate the 17 defined sequence pools. The initial analysis was based on all five hydrophobicity parameters. For each parameter the maximal and minimal value for each peptide was calculated (Table 5). However, we realized that the simultaneous application of the minimal and maximal value of the same parameter does not increase the separation performance. By this we limit the parameter selection too either the minimal or the maximum value. We calculated the 32 parameter combinations (five parameter and alternating minimal and maximal value) for each peptide for all 98 hydrophobicity scales. The resulting five dimensional vectors for each peptide and each hydrophobicity scale were used to define five dimensional clouds for each pool and each specific hydrophobicity scale.
For the analysis of the separation capacity (Fig. 2, right) of a scale between two clouds of sequence pools we calculated the overlap volume (Fig. 2, left) and the number of peptides within the overlap (Fig. 2, middle). The size of the overlap volume and the number of peptides within the overlap volume negatively correlates with the separation capacity for two sequence pools. Further, we defined the "convex envelope" as described in the "Methods" section. Next we removed all peptides, which are part of the convex envelope. We recalculated the volumes of the pools and repeated the last step of the routine with the new volumes. Removing the peptides on the convex envelope was performed because the presence of only few peptides positioned distantly from the other peptides could increase the volume drastically. In case that the peptides are positioned close to others the volume did not change significantly and thus, this step improved the reliability of assignment of the majority of peptides.
Next, we defined a separation capacity score (Formula 1) to rank all scenarios. S v is a score based on the volume of the overlap in relation to the volumes of the two clouds. For S p we counted all peptides in the overlap volume of two sequence pools and set them in relation to all peptides of the two clouds. The score S is scaled between zero (both clouds totally overlap) and one (no peptides in overlap volume) and gives the quality of a certain hydrophobicity scale for the separation of two defined pools.

Formula 1 Separation capacity score
Here, P 1 and P 2 are the total numbers of peptides of pool 1 and 2, P ov is the number of all sequences in the overlap volume, V 1 and V 2 are the volumes defined by the sequence pools 1 and 2, and V ov is the overlapping volume of both pools. The number of V i and P i was always i = 2 because two pools were analyzed in parallel.
The general S value was calculated for each scale for the sequence pools based on secondary structure  [42] using the eigenvalues of a normalized nearest neighbor matrix. In parallel, the average S value for the clusters of hydrophobicity scales (Fig. 1) defined according to the UPGMA-tree was calculated (Fig. 3b). The average S values observed for the peptide pools obtained by in silico tryptic digest (Fig. 3b, blue) or by the combination of peptide pools generated by the two strategies (Fig. 3b  green) do not show a dependence on the selected cluster. Only for the secondary structure peptide pools the observed average S values differ between 0.28 for the cluster B and 0.13 for cluster X. Moreover, after sorting the clusters according to the average S values for the secondary structure peptide pools the order of clusters does not follow the order in the UPGMA tree (Fig. 3b, orange).

Separation of specific structure pools via hydrophobicity
The moderate separation capacity of all hydrophobicity scales using all 17 sequence pools prompted us to inspect the separation of the individual sequence pool pairs. The results are exemplified for the separation value for the best performing scale 14 [40] (Fig. 4a) as well as the maximal separation capacity out of all 98 hydrophobicity scales for each pairwise sequence pool combination (Fig. 4b). Globally, the S values obtained for the pairwise pool separation by maximal separation capacity out of all 98 hydrophobicity scales (Fig. 4b) are in general larger than the S values obtained by using the best performing scale 14 only.
In detail, the three pools with transmembrane α-helix (krtm-helix), with transmembrane β-sheet (krtm-sheet) or without random coil content (no-random) generated by digestion have the largest S value while analyzing the overlap with other sequence pools, irrespective whether the best scale (Fig. 4a) or the best value (Fig. 4b) is considered. In contrast, the secondary structure transmembrane pools (tm-sheet, tm-helix) show low S values while analyzing the overlap with other pools. Nevertheless, the S values of the secondary structure transmembrane pools are larger than the S values found while analyzing the overlap of the remaining sequence pools (Fig. 4b).
Remarkably, high S values were found when the overlap between the two secondary structural transmembrane pools (tm-sheet, tm-helix) and the three pools with transmembrane α-helix (krtm-helix), with transmembrane β-sheet (krtm-sheet) or without random coil content (no-random) generated by digestion was calculated. This might suggest that the regions flanking the transmembrane domain present in the sequences of the peptide pools generated by digestion provide additional information. This information in combination with the hydrophobicity might give an additional signature for such domains. Hence, in silico digestion with subsequent analysis by the described parameters using e.g. the hydrophobicity scale 14 can be used to detect transmembrane helices and sheets.
With respect to the remaining pools we observed that the S value obtained while analyzing the overlap of Table 5 Hydrophobicity parameters Shown are the index (column 1), the name (column 2) and the description of the used hydrophobicity parameters (column 3). The description is equal for the minimum and maximum of the used hydrophobicity parameter

Index
Name Description 0 Max. exact β-strand score (EBSS) Parameter to score the probability of a sequence with ≥10 AA to be a TM β-sheet [16] 1 Min. exact β-strand score (EBSS) 2 Max. alternating hydrophobicity High alternating hydrophobicity probing for polar and unpolar AA alternation typical for transmembrane β-sheets [14, 15] 3 Min. alternating hydrophobicity 4 Max. hydrophobicity-moment α Analyzing the distribution of hydrophobicity considering the amino acid distribution in α-helices with an angle between amino acids of 100° to probe for potential to form a TM α-helix [13] 5 Min. hydrophobicity-moment α 6 Max. hydrophobicity-moment β Analyzing the distribution of hydrophobicity considering the amino acid distribution of β-sheets with an angle between amino acids of 180° to probe for potential to form a TM β-sheet [13] 7 Min. hydrophobicity-Moment β 8 Max. average hydrophobicity Average hydrophobicity of the peptide [33] the secondary structure pools (s-sheet, s-helix and random) is higher when compared to the pools containing sequences with mixed structures (Fig. 4a, b). This result is expected, as the chosen parameters detect the individual elements and a mixture thereof yields mixed information. Consequently, we analyzed the performance of the individual hydrophobicity parameters and the usage of the multi-dimensional approach. We realized that the S value is more dependent on the combination of the hydrophobicity parameter in a multi-dimensional vector than on a specific hydrophobicity scale (Additional file 3: Fig. S1).
Including the top 5 % of all scenarios for separating two pools from each other (Fig. 5a) the influence of the different parameter varies significantly (Fig. 5b). It becomes obvious that the average hydrophobicity and EBSS have the strongest impact on the separation quality, while the alternating hydrophobicity previously thought to specifically recognize transmembrane β-strands [14,15] has the lowest impact on separation. Only for the mixed situation we observed that the alternating hydrophobicity has no impact at all. Thus, the overall performance is dependent on all parameter although to a different extent. Fig. 1 Clustering of hydrophobicity scales. Shown is the UPGMA tree of the clustered hydrophobicity scales based on the normalized amino acid value distances (see "Methods" section). The hydrophobicity scales are clustered to groups (a to z) within similarity larger than 0.07

Benefit of amino acid pattern to separate specific structure pools
An amino acid based approach for the different structure pools was subsequently considered in addition to the hydrophobicity based separation. At first the amino acid composition of the different pools was analyzed, which did not yield a significant difference between the individual pools (Additional file 4: Table S3).
Thus, the occurrence of amino acid patterns of two to five amino acids was analyzed utilizing a markov chain approach ( Fig. 6; "Methods" section). Nearly all (~80 %) detected amino acid patterns up to the length of three occur in each of the different sequence pools. The number of globally occurring amino acid patterns of a length of four drops down to 60 %. An elongation of the pattern length to more than five amino acids results in coverage of less than 5 % and thus, is not of use for separation. Thus, the exclusive appearance or at least overrepresentation of amino acid patterns of four and five amino acids in the distinct sequence pools was analyzed.
We identified several peptides of four (Table 6; Additional file 5: Table S4) and five amino acids that are specific or at least enriched in peptides of a certain pool (Table 7; Additional file 6: Table S5). Analyzing the frequency of occurrence of these peptides revealed that patterns with four amino acids are only very moderately specific as only very few are at least 50-fold more frequent in a certain pool than in general (Table 6). In turn, patterns with five amino acids can serve as an additional discriminator because patterns with 500-fold higher frequency in a specific pool than in the overall sequence pool exist (Table 7). This holds true particularly when the sequence pools generated by the same strategy (Tables 1,  2) are compared. This information can only be taken in addition to the hydrophobicity parameters, because occurrence of specific amino acid patterns in one specific structure pool compared to all pools as reference did not yield an adjusted p value below 0.05 for any amino acid pattern of length five. Nevertheless, detection of ß-strands in peptides can be supported by the detection of penta-peptides YLVNM (dc-sheet), LTVTG, TLDGG (dd-sheet), CGGSL and YGGVT (s-sheet). Remarkably, the penta-peptides observed for the structurally derived pool is not overlapping with the penta-peptides observed for the pool derived by in silico tryptic digestion which might suggest that the latter contain specific regions at the end of the strand. Moreover, amino acid patterns specific for the transmembrane ß-strands are SIGA (krtmsheet, Table 6), LYGKV, PTLDL and SASAG (tm-sheet, Table 7). For peptides with mainly random content we found that S-GSSG-S, SGPSS or TILPL are enriched (random, dd-random; Table 7). In turn, for pools mainly consisting of helical structures we found only one pentapeptide specific pattern for the structural pools with α-helix (s-helix; EELKK) and for the pool of peptides with the transmembrane α-helix (krtm-helix; YVFFG; Table 7). Thus, a prescreening of sequence pools with these amino acid patterns might improve the classification quality.

Factors influencing pool separation
The variation of separation capacity of hydrophobicity scales (Fig. 3) prompted the analysis of the impact of individual amino acid values. However, we did not realize any correlation between specific distribution of values for individual amino acid within the individual scales and the scale performance based on the 98 known hydrophobicity scales. As an alternative approach we created random hydrophobicity scales based on the 98 already known ones (Tables 3, 4). At first, the maximum and minimum amino acid values of the 98 real scales were used as interval to create 200 random hydrophobicity scales by assignment of a random value to each individual amino acid. Subsequently, several rounds of in silico evolution were Fig. 2 Scoring scheme. Shown is the scoring scheme of the separation of two sequence pools. Both sequence pools (dark grey area, grey area) build a cloud. The overlap of both clouds (light grey area) as well as the sequences (bold points) in this overlap is used to calculate the separation capacity. The circle surrounding both clouds represents the volume of the outlayer performed to improve the separation capacity for the five different structural sequence pools (Fig. 7).
After six rounds of in silico evolution the created random hydrophobicity scales reached a separation threshold of 0.6, which is comparable to the separation potential of the best performing hydrophobicity scale. This suggests that a limit of the potential of amino acid scales for the separation of structural sequence pools exists by 0.6. Furthermore, we realized during the evolution of the hydrophobicity scales that the value of some amino acids had greater positive or negative influence on the separation capacity like others.
After establishing the evolutionary scale, we aimed at an understanding which property of a scale has an impact on its separation capacity. At first, we tested whether the general order of amino acids with respect to their hydrophobicity value is important. We realized that it is not the overall order of the amino acid hydrophobicity values that influences the performance of the hydrophobicity scale (Additional file 7: Fig. S2). At second we analyzed whether the value of specific amino acids dominate the separation capacity of a scale. We realized high S values for hydrophobicity scales sharing rather comparable hydrophobicity values for Gln, His, Gly, Ser or Arg to Fig. 3 Separation of pools by hydrophobicity scales. a Shown is the overall separation value for each hydrophobicity scale for the secondary structure (orange), in silico tryptic digest (blue) and mixed (green) sequence pools as area plot. The hydrophobicity scales are sorted from highest to lowest value. b The same as in a but the separation value is calculated for the cluster of hydrophobicity scales the evolved scale or for scales with hydrophobicity values for Cys, Met, Lys, Val or Ile distinct from the evolved scale (Additional file 8: Fig. S3). Thus, the hydrophobicity value of some amino acids like Gln, His, Gly, Ser or Arg might be more important for the separation capacity of the scales than others.
Thirdly, we asked whether cluster of amino acids with comparable or rather distinct values exist within one scale, which lead to high S values. To this end we analyzed the difference between hydrophobicity values of amino acids of individual scales, namely of the in silico evolved scale, the experimental hydrophobicity scale with highest (best) and the scale lowest (worst) S value, respectively. Each scale was normalized as such that the highest hydrophobicity value within the scale was set to one and the lowest hydrophobicity value to zero. Subsequently the difference of hydrophobicity values of the amino acids of one scale was calculated (Additional file 9: Table S6). Finally we analyzed whether a pair of amino acids shows a very small (<0.1, Fig. 8a green field) or very large (>0.9, Fig. 8a, red field) difference of the hydrophobicity value within each of the three scales. Finally, we  (Fig. 8a, orange frame). In addition, we selected amino acids pairs with very different hydrophobicity value at least in one of the two scales (Fig. 8a, blue frame) and such pairs where the difference was small in one and large in the other one of these two scales (Fig. 8a, yellow frame).
Inspecting the information we realized that a large difference of the hydrophobicity values for glutamate and arginine to each other exists. In addition, the hydrophobicity value of glutamate is most distant to the hydrophobicity values of tyrosine, tryptophan, leucine and isoleucine, respectively Fig. 8a, blue frame). The hydrophobicity value of arginine is distant to the value of phenylalanine and methionine (Fig. 8a, blue frame). In turn, three distinct clusters of amino acids with comparable hydrophobicity values become obvious (Fig. 8a,  orange frames). Considering all pairs one can draw relations of the hydrophobicity values within these clusters. Interestingly, the hydrophobicity values of cluster three are most distant form arginine (Fig. 8b), while the hydrophobicity values of cluster one are most distant to glutamate. However, these clusters do not correlate with the amino acid pattern detected for the specific sequence pools (Tables 6, 7) and moreover, they do not necessarily represent the physicochemical properties of the amino acids.

Conclusion
We demonstrate that most of the hydrophobicity scales reach the same level of peptide separation capacity (Figs. 3, 4) and thereby, the method by which the scale was generated has no direct influence on clustering or separation capacity (Figs. 1, 3). Nevertheless, if at all we realized that the scale 14 defined by Naderi-Manesh developed in 2001 [40] performs somewhat better than the other hydrophobicity scales. We propose a rule of thumb for experimentalists that aim to use a hydrophobicity scale for identification of peptides with transmembrane segments from a pool of peptides. The hydrophobicity value of arginine and tyrosine should be most distant from the value of glutamate, while the hydrophobicity values of Asn, Asp, His, Lys should be in the center of the scale (Fig. 8c). We further observed that separation of sequence pools defined by known secondary structures is more likely than separation of sequence pools with a combination of secondary structures derived from in silico digestion (Figs. 3, 4), but The dash-dotted line shows the best separated 5 % of all scenarios and serves as marginal value to detect the threshold for analyzing the influence of the different hydrophobicity parameter to the separation. b Shown is the influence on separation of the ten hydrophobicity parameters (Table 5) for the secondary structure based sequence pools (black), the sequence pools generated by digestion (white) and the combination of both (grey). The hydrophobicity parameters are paired (max., min.). The separation influence is calculated as absolute value of the difference between observed and expected frequency of the best 5 % of separated scenarios (Fig. 5a)   Fig. 6 Amino acid pattern distribution. Shown is the percentage of occurrence of all possible amino acid pattern of a specific length in the different sequence pools. The length of the pattern varies from 2 to 5. 2 AA black circle; 3 AA red circle; 4 AA green triangle down; 5 AA yellow triangle up the tryptic digested sequence pools with helical and strand content or transmembrane ß-strand or α-helix content are best separable from the other pools (Fig. 4).
Nevertheless, we realized a threshold of S = 0.6, irrespective of the nature of the scale, which is supported by an in silico approach to optimize the scale (Fig. 7). In turn, the Table 6 Patterns of four amino acids Given are the sequence pool name (column 1) and peptide sequence with the highest frequency of occurrence (column 2); the frequency of occurrence (FO) of this peptide in the according pool (column 3), the frequency of occurrence of this peptide in the pool containing all sequences except the one of the analyzed pool (column 4), the frequency of occurrence of this peptide in the pool containing all sequences generated by the same strategy (GBSS) as the analyzed pool excluding the sequences of the analyzed pool (column 5). Italic shows peptides that have an at least 50-fold higher frequency, with respect to the remaining sets (column 4) or the remaining peptides generated by the same strategy (column 5). Peptide sequences with p values below 0.05 were marked by an asterisk  separation capacity depends on the number of parameter calculated (Additional file 3: Fig. S1), although we realized that the alternating hydrophobicity has the lowest capacity for sequence pool separation (Fig. 5). Remarkably, we observed that detection of ß-strands in peptides can be supported by the detection of penta-peptides ( Fig. 6) because such peptides have been detected in the structural pool and in the pools generated by simulated tryptic digest (Table 7). Similarly, amino acid patterns specific for the transmembrane ß-strands (Tables 6, 7) or largely random content (Table 7) have been observed. In turn, for pools mainly consisting of helical structures only one specific penta-peptide for soluble (s-helix) and transmembrane (krtm-helix) α-helices could be detected (Table 7). Summarizing, the quality of separation of Given are the sequence pool name (column 1) and sequence with the highest frequency of occurrence (column 2); the frequency of occurrence (FO) of this peptide in the according pool (column 3), the FO in the pool containing all sequences except the one of the analyzed pool (column 4), the FO in the pool containing all sequences generated by the same strategy (GBSS) excluding sequences of the analyzed pool (column 5). Italic shows peptides with at least 50-fold higher frequency with respect to all (column 4) or peptides of the same strategy (column 5). Hashtag after the pattern indicate 500-fold higher frequency in at least column 4 or column 5 Green boxes mark distances below 0.1, dark green boxes below 0.01, red boxes distance above 0.9 and dark red boxes distance above 0.99. Combination framed in orange mark amino acids for which values should be rather similar as concluded from the low difference in the best performing and the evolutionary evolved scale, blue frames mark amino acid combinations for which values should be rather different as concluded from the low difference in the best performing and the evolutionary evolved scale and yellow frames mark amino acid combination for which the value difference is irrelevant. b The clusters with comparable (black lines) or distinct amino acids values (blue lines) are shown. c The arrow indicates the distance of amino acid values that should be present in a good performing scale sequence pools depends rather on the parameter used for calculation than on the scale used and can be supported by the search for specific amino acid pattern.

Hydrophobicity parameter
Five different hydrophobicity parameters (Table 5) were used to analyze their influence on the separation capacity. For all of those hydrophobicity parameters we used contrary pairs (max. and min.) of the parameters to look for potential differences. The EBBS [16] should be able to detect β-sheets, whereas the alternating hydrophobicity [14,15] should be more specific to detect transmembrane β-sheets. The hydrophobicity moment α and β [13] were used to identify α-helices and β-sheet in general. The last parameter was the average hydrophobicity, which had no preferentially detectable secondary structure so far.

Structure pools
The known secondary structure pools (Tables 1, 2) were extracted from the ASTRAL40 database [32] and differed in random coil (random), cytosolic β-sheets (s-sheet), cytosolic α-helix (s-helix), transmembrane β-sheet (tmsheet) and transmembrane α-helix (tm-helix). Further, we implemented an in silico tryptic digest approach to split sequences after Lysine (K) and Arginine (R) of the whole ASTRAL database and classified the peptide fragments concerning their secondary structures. These were divided in fragments containing a (i) continuous dominating SSE (dc), (ii) discontinuous dominating SSE (dd), (iii) no dominating SSE but only two different structures (no-), (iv) all three secondary structures (all) or (v) transmembrane sheet or helix fragments (krtm-).

Pool separation via hydrophobicity scales and parameter Cloud algorithm
The algorithm to calculate the cloud is a two-step approach. All single peptide sequences of a specific secondary structure dissection pool (Table 1) or in silico K/R-digestion pool ( Table 2) were used as input to calculate the cloud. Thereby, each peptide is represented by an n-dimensional vector (hyper volume; convex cover), where the values of the different hydrophobicity parameters (n ≤ 5 dimensions; represented by the minimum or maximum value using a specific hydrophobicity parameter from Table 5) calculated based on a specific hydrophobicity scale (Tables 3, 4) are the components of this vector.
(I) The initial cloud is calculated based on a randomly chosen as subset of ~30 points (peptides defined by vectors). Then, the cloud is expanded until each point is considered. In general, the algorithm calculates all distances and directions within the n-dimensional space between all given points (peptides) and tests if these sites are valid. A site is valid if all points of the entire cloud follow the direction of the hypersurface. By this it is determined if an added point lies within the so far calculated cloud.
The existing cloud is expanded point by point to determine the cloud by a set of sites between points. After each point the set of sites is updated by the procedure (i) and all remaining points are tested if these points are inner points or putative boundary points. (III) The final cloud volume is calculated based on the outer sites between the boundary points in the n dimensional space that form a convex envelope. All points are placed inside the cloud (inner points) or on the convex envelope (boundary points).

Separation calculation
All peptides of two structure pools in a given scenario were used to calculate the heuristic hypervolumes of each pool, respectively, defined by the hypersurface via a pipeline. A scenario is defined by the used number of dimensions represented by the selected hydrophobicity parameters, the selection of the hydrophobicity scale and which two structure pools are used for comparison. The number of the points (peptides) positioned by the according vector within the clouds was counted as well as the points within the overlap of both cloud volumes.

Convex envelope
We remove the boundary points (petides) of both structure pools building the convex envelope to avoid big volumes of the clouds based on outliers. We analyzed the volume and number of peptides per structure pool for all combinations with n = 5 dimensions and calculate the loss of peptides and loss of volume in percentage. Due to the high amount of combinations per structure pool (defined by number of hydrophobicity parameter and number of hydrophobicity scales) we calculate the minimum, maximum and average of volume and peptide reduction removing the boundary points (Additional file 11: Table S7). In average, this procedure leads to an elimination of 6.8 % peptide sequences of the structure pool, but a decrease of the according volume of 44.6 %.
By that, removal of putative outlier cause on average a sevenfold increase of the volume per peptide. For pools with low amount of peptides (krtm-sheet, krtm-helix, no-random) the increase of the volume per peptide is lower, namely in the order of twofold. Nevertheless, this increase of density justifies the procedure.

Hydrophobicity scale clustering
For the hydrophobicity scale clustering the dissimilarity of the different pairs of hydrophobicity values for each amino acid was calculated. This was done by using autocorrelation between all pairs of the 98 different hydrophobicity scales. Afterwards, the Pearson correlation values were normalized to get the dissimilarity and used by MEGA6 [34] to create an UPGMA tree of the dissimilarity. The clustering of the hydrophobicity scales was done by determining a threshold of 0.05 (5 %) for dissimilarity to split the tree in groups.

Amino acid pattern search
For the amino acid pattern search the different structure pools were used. First, the peptide fragments were analyzed for all occurring amino acid patterns of a specific length based on a Markov chain algorithm of the MEME and MAST suite package (fasta-get-markov) [43]. The algorithm estimates a Markov model from a FASTA file of sequences with previous filtering of ambiguous characters. For example a peptide of four amino acids in length has a conditional probability that one amino acid follows the other amino acid given a specific pool of peptide sequences. So the Markov chain allows the calculation of the transition probability from one state to another state and by this determines the probability of an amino acid occurring in an amino acid peptide of a certain length of a specific pool of peptides.
In this approach all possible patterns were detected in the peptides starting from a pattern length of one and incrementing by all different 20 possibilities for each amino acid. The occurrence of the different pattern was normalized to one and compared to the occurrence of the other structure pools to determine the pairwise difference between the pools to detect pool specific pattern of specific length. Furthermore, we performed multiple testing with our identified pattern of length four and five amino acids. We used the Fisher exact test to calculate p values examining the significance of the contingency between occurrences of a specific pattern in relation to a specific structure pool. As reference we pooled all 17 structure pools together. To overcome artificial errors using multiple times the fisher exact test we used as post hoc test Benjamini/Hochberg false discovery rate (fdr) multiple test correction to adjust our p values (Additional file 5: Table S4, Additional file 6: Table S5, p values). All amino acid pattern of length four (Table 6) and five (Table 7) with an adjusted p value below α = 0.05 were marked in bold.

In silico creation of random hydrophobicity scales
The generation of in silico hydrophobicity scales is based on the minimum and maximum hydrophobicity values extracted out of the 98 analyzed hydrophobicity scales, which were determined as borders for the interval. We used five structure pools to calculate the separation capacity score (dd-sheet, dd-helix, dd-random, krtmsheet, krtm-helix). Two hundred random hydrophobicity scales were created. Based on the best in silico random hydrophobicity scale of the previous steps 2000 scales were created; 100 per amino acid. Half of the hydrophobicity scales per amino acid changed the hydrophobicity value of the single amino acid in the positive [0.001:5] and negative [−0.001:−5] interval (evo1 and evo2). In the following in silico evolution steps (evo3 to evo5) the top 100 newly generated hydrophobicity scales with best performance were analyzed to filter for amino acids which have an influence on the separation capsacity. Only these amino acids were changed in the evo steps evo3 to evo5 to analyze their influence.