Alacat

Alacat𝕕𝕖𝕤𝕚𝕘𝕟𝕖𝕣 is a consensus tool which uses a variety of simulations and empirical data to identify 'quantotypic' peptides. From these peptides it can also help you to construct QConcats for your chosen protein(s). A simplified "Consequence-like" interface is provided here.

Help

Basic configuration: In the boxes below you should specify your proteins of interest. You can tell the program to focus on specific peptides and Qconcats, or leave these boxes blank and the program will suggest them for you.

Note: This web GUI is interactive and will provide additional context if you click or hover over items with your mouse.

Advanced configuration: Individual analyses can be turned on and off depending on whether you are not interested in their output, but for more thorough customisation it is recommended to use the Python library instead of the web interface.

Method details: See the workflow diagram here.
Alacat𝕕𝕖𝕤𝕚𝕘𝕟𝕖𝕣 is a consensus tool that uses a variety of "scoring methods" to determine the suitability of peptides, qbricks, etc. The list of individual scoring methods can be viewed in the advanced section, below, and links to the literature are provided for relevant methods on the about screen.

API: Alacat𝕕𝕖𝕤𝕚𝕘𝕟𝕖𝕣 is available to download as a web client, desktop application, command line utility and Python library. It can be downloaded to your local computer from pypi.org using python -m pip install alacat . The web GUI is designed for modern web browsers in a desktop environment. Old/mobile/esoteric browsers are not supported.

Examples

Try an example:

E1. Proteins from Uniprot
We are looking for several walnut (Juglandaceae) 7S globulin proteins, as described in Xiong 2019^{[📰︎|📋︎]}. The Uniprot ID of these proteins are specified as the protein input. We leave all the other boxes blank, allowing the system to decide. Note that the task is ambiguous since the proteins share much homology with each other, and it's not clear if we should favour unique peptides, or detectable peptides. Nonetheless this serves as a nice set for the user to explore the results and come up with their own conclusions!
E2. Proteins from FASTA
This is the same as example E1. This time we use FASTA format instead of listing the Uniprot accessions of the proteins.
E3. Proteins from Uniprot, specific peptides
This is the same as example E1. This time we also denote the exact peptides we wish to use for the Qconcats. These go into the `peptides` input box.
E4. Peptide analysis only
Similar to example E3, but now we we specify *only* the peptides - and *not* the proteins. This will mean we will skip the analysis of any other peptides the proteins might have. For this we specify *Custom digestion* and enter the peptide sequences directly into the protein box.
E5. Titin
This is the longest human protein. Providing a stress-test of the system it should produce plenty of peptides to browse through!
E6. Declan
Declan's Alacats
E6b. Declan x4
Declan's Alacats, first four only
E7. Shuford (thyroglobulin)
Peptides used in the paper: Shuford, C. M., Walters, J. J., Holland, P. M., Sreenivasan, U., Askari, N., Ray, K., & Grant, R. P. (2017). Absolute protein quantification by mass spectrometry: not as simple as advertised. Analytical chemistry, 89(14), 7406-7415.

Options

Protein

Constraints

Peptides

Qblocks

Qmenus

Advanced

Title

Num qblocks

Digestion

Organism(s)

9606 -- Homo sapiens, human 4932 -- Saccharomyces cerevisiae, yeast 9606, 4932 -- Human and yeast 942147 -- Spongiforma squarepantsii

Mandate

Reset

SQL

Dry run

Handlers

Help

The four fields correspond to the weight, assessment, parameters and column name.

WEIGHT

Python: help(alacat.EWeight)

The weight fields (Column.weight) control how important the various scoring metrics are.

When a subject (e.g. Peptide/Qbrick) is assessed, for every value (Score) that passes (EGrade.PASS) the rule (Assessor) of its column (Column), the total-points (EntitySummary.total_points)) of the subject are increased by the weight of the Column. I.e.:

POINTS[subject] = sum(COLUMN.WEIGHT for COLUMN in COLUMNS if COLUMN[subject] is PASS)

Note

When comparing (or ranking) subjects, points are only counted from columns in which both subjects to be compared did not report an error. See EGrade.ERROR for more details.

The default weights are orders of magnitude apart, which means that, for instance, a subject with a pass in one HIGH weighted column will always have a higher score than a subject with multiple passes in LOW weighted columns.

If columns are not assessed (i.e. if they have their Column.assessor set to Assessors.indifferent) then the Column.priority is immaterial. Setting the priority to EPriorities.DISABLED will still turn off the output for that column however.
The descriptions below cover the default meanings of the weights. The user is free to set column weights however they choose, including to values outside this range. To avoid overflow and rounding errors, any custom weights should be integer-typed.

DISABLED:	Disabled. Setting the `Column.weight` to `DISABLED` will remove the `Column` from the output entirely. In the GUI, if all columns are `DISABLED` for a `Handler`, the `Handler` itself will be disabled. Note that disabling a `Column` only suppresses its output, side effects, such as generating peptides, still occur, providing the handler itself is not also disabled.
VERY_HIGH:	Highest priority. By default, this is not used internally.
HIGH:	High priority. By default, major constraints, e.g. peptide length. NG in peptide.
NORMAL:	Normal priority. "Red" priority. By default, unambiguous constraints, e.g. H in peptide.
LOW:	Low priority. "Amber" priority By default, ambiguous constraints, e.g. scores derived from simulations or limited databases. Also, by default, unambiguous constraints that require human opinion, e.g. C in peptide.
VERY_LOW:	Lowest priority. By default, unreliable, irrelevant or redundant sources. e.g. PPA digestion scores for a different experiment type.

ASSESSMENT

Python: help(alacat.Assessor)

The Assessors determine rules to denote what signifies a "good value":

When a subject is assessed, for every metric (Score) that passes (EGrade.PASS) the rule (Assessor) of its column (Column), the points of the subject are increased by the weight of the Column. -- See EWeight for more details.

Assessment's are mostly binary, meaning that they typically return PASS or FAIL, though ERROR and INFO flags are also supported internally (see EGrade for details).

The indifferent assessment (Assessors.indifferent) is a special value, which means an assessment is not applied.

Python: help(alacat.Assessors.Indifferent)

Rule: Indifferent

Overview

When applied to a column, this rule denotes that the column is for information only.

Usage notes

This rule does not pass or fail any cell values. All cell values are marked as EGrade.INFO ("Informational message only").

Python: help(alacat.Assessors.Equal)

Rule: Equal to

Xᵢ = θ₁

Overview

When applied to a column, only cell values that are equal to the parameter are passed. Cell values not equal to the parameter are failed.

Python: help(alacat.Assessors.MinMax)

Rule: In range

θ₁ ≤ Xᵢ ≤ θ₂

Overview

When applied to a column, only cell values that lie within the range are passed. Cell values outside of this range are failed.

Usage notes

The range is defined by the two parameters, and includes both endpoints.

Python: help(alacat.Assessors.Max)

Rule: Less than or equal to

Xᵢ ≤ θ₁

Overview

When applied to a column, only cell values that are less than or equal to the parameter are passed. Cell values greater than the parameter are failed.

Python: help(alacat.Assessors.Min)

Rule: Greater than or equal to

Xᵢ ≥ θ₁

Overview

When applied to a column, only cell values that are greater than or equal to the parameter are passed. Cell values less than the parameter are failed.

Python: help(alacat.Assessors.NotEqual)

Rule: Not equal to

Xᵢ ≠ θ₁

Overview

When applied to a column, only cell values that are not equal to the parameter are passed. Cell values equal to the parameter are failed.

Python: help(alacat.Assessors.Percentile)

Rule: First percentile rank

rank(Xᵢ) / |X| ≤ θ₁

Overview

When applied to a column, cell values are ranked. Ranks comprising the first n% are passed, while all others are failed.

Usage notes

n is set based on the parameter, which should lie in the range [0..1].

percentiles are based on the number of values below. i.e. There can be a score at the 0%ile, but not at the 100%ile.

PARAMETERS

If the assessment requires parameters then they are specified here. Multiple parameters should be separated with a comma. Only numeric parameters and a small set of fixed values are supported in the GUI.

NAME

This field can be used to rename the column. If no name is specified the column is not renamed.

GRADES (output, for information only)

Python: help(alacat.Assessor)

After values have been assigned to each subject (e.g. Peptide, Qbrick) and Column, each value is assessed by the rule of the column (Column.assessor) and given a grade (EGrade)

Along with the weights of the columns (Column.weight) the grades (Score.grade) determine the final number of points (EntitySummary.total_points) each subject receives and how subjects are ranked against each other. See EWeight for calculation details.

PASS:	Subject passed evaluation (i.e. hint to use this `Peptide`/`Qbrick`). Subject receives points equal to the weight of the column (`Column.weight`).
FAIL:	Subject failed evaluation. Subject receives no points.
INFO:	Informational message only. Subject receives no points. Unlike `FAIL`, to compare `INFO` against `PASS`/`FAIL` is considered a programmatic error. i.e. `INFO` values should be consistent across all subjects for a given `Column`.
ERROR:	Subject could not be evaluated. Metrics with this grade are not comparable. This is because an error indicates that the value is unknown and thus subjects cannot be compared based on this data. Hence, when two individual subjects are ranked, if either subject reports an ERROR for a particular column then any points from the other subject, for the same column, are discounted. Note that, in rare circumstances, this caveat actually makes it possible for a subject with fewer total-points to rank above a subject with more total-points, when the points for only the intersection of non-error columns are greater for the first subject.

Protein providers

The protein providers translate the user input from the "protein" field of the parameters into actual proteins (their names and sequences).

Enabling or disabling the various protein providers will change the types of input supported. Generally, all providers should be left enabled unless they are causing problems (for instance to resolve ambiguities when it isn't clear which provider should parse the input).

BlockProvider

Overview

When this provider is enabled, the user may input proteins as a single solid block of text.

Each line of the text is assumed to represent a protein and is sent to the other protein providers for parsing.

If this provider is not enabled, proteins must be provided as a Python list ([]).

This provider mainly exists to support the web GUI, which is only capable of providing a single protein string. If disabled, the web GUI may not function correctly.

Scores

The score is a boolean value on a protein, indicating if the protein was produced by this provider. It is for information only and has no analytical purpose.

From text?

FileProvider

Overview

When this provider is enabled, proteins/peptides may be input by the user as either a solid block of text or the name of a text file.

File formats

The content is "sniffed" to try and determine the format, but hints can be given, either a file extension can be present (e.g. c:proteins.FASTA) or a prefix "FILE_TYPE@" can be specified before the file-name or content-block.

Example:

FASTA@c:\proteins

Scores

The score is a boolean value on a protein, indicating if the protein was produced by this provider. It is for information only and has no analytical purpose.

From file?

SequenceProvider

Overview

When this provider is enabled, proteins can be provided in JSON format.

JSON format is used internally by the application and the user is not expected to craft such data manually.

JSON format is used in the web GUI. If this provider is disabled, the web GUI may not function correctly.

Scores

The score is a boolean value on a protein, indicating if the protein was produced by this provider. It is for information only and has no analytical purpose.

From sequence?

UniprotProvider

Overview

When enabled, the user may input proteins using database accessions:

Uniprot accession
Uniprot ID
Ensembl genome protein ID

The database accessions supported may be changed in the API by setting the constructor parameter. Any accession convertible by Uniprot is supported. In the web interface, no configuration is possible and only the default values are supported.

Scores

The score is a boolean value on a protein, indicating if the protein was produced by this provider. It is for information only and has no analytical purpose.

From uniprot?

Peptide providers

The peptide providers digesting the Proteins to Peptides during the digestion stage of the workflow.

Where available, allowing multiple providers to perform the same digestion routine can be useful to compare various simulated digestion methods. However, this naturally means the workflow will take longer to process.

Some peptide providers may also provide peptide scores at the same time as performing the digest. These scores are carried through into the next stage of the workflow.

Providers can be disabled to speed up the workflow if their output is redundant or of no use. Note that providers themselves already ensure the workflow parameters are honoured. For instance, disabling the TRYPSIN digester in a PEPSINA workflow will have no effect, since the digester will be naturally disabled.

ConsequenceDigester

Overview

When enabled, Consequence online is used to digest the protein and obtain scores indicating the flyability of the digested peptides.

Scores

The score output is the "consensus" value from Consequence. This indicates the number of Consequence's predictors that indicate positive peptide flyability. The value lies in the range [0,4], with all 4 predictors being the ideal outcome for a flyable peptide.

Consequence score

#Consequence score

McPredDigester

Overview

When enabled, McPred online is used to digest the protein and score the peptides based on their simulated successful cleavage probability.

Scores

The output is a "mis-cleavage chance" in the range 0 to 1. Lower scores indicate less chance of mis-cleavage and are generally considered to be better.

McPred C score

#McPred C score

McPred N score

#McPred N score

OpenMsDigester

Overview

When enabled, OpenMS is used to provide a wide variety of digests from a protein.

The same enzyme specified in the model is assumed.

Scores

The score is a boolean value on a peptide, indicating if the peptide was produced by this provider. This has some analytical value, peptides not produced by this digester may be the result of non-standard digestion, such as miscleavage.

From OpenMs?

PpaDigester

Overview

When enabled, this peptide provider uses PPA online to digest the peptides and score them based on their simulated detectability.

Scores

The score for each peptide is a single numerical value, indicating the simulated detectability of the peptide. Higher values indicate that the peptide is more likely to be detected.

PPA score

#PPA score

PrespecifiedPeptideProvider

Overview

When enabled this provider allows peptides to be entered into the workflow before the actual digest stage.

Such peptides include peptides entered manually and proteins with known digestions.

Disabling this provider in either of these cases will result in an error.

The base class, PrespecifiedProvider providers specific implementation details.

Scores

The score is a boolean value on a peptide, indicating if the peptide was produced by this provider. It is for information only and has no analytical purpose.

From spec?

Peptide scorers

The peptide scorers are responsible for assigning scores (metrics) to the peptides.

Scores are assigned into the columns of the scores matrix, with the peptides forming the rows.

The values of the scores may be of use in identifying whether the peptides are quantotypic, or may be for information only.

Disabling a scorer means that those scores will not be present in the output. Scorers can be disabled to speed up processing if their output is of no use.

AdvancedRobRulesScorer

Overview

When enabled, this scorer asserts a sequence of "Rob rules" designed to isolate a set of "good" peptides.

Only the rank of the score is assessed (not the actual value), hence we pick from the set of "best" peptides.

Scores

One score is output, which follows the following protocol for each peptide:

To all but these are assigned SCORE 0:

H, M, NG, NV, n-term Q

Of the remainder, to all but these are assigned SCORE 1:

H > 1, NG, NV

Of the remainder, to all but these are assigned SCORE 3 (then, 4, etc...):

H > 2, NG, NV

To the remainder, to all are assigned SCORE 100.

EliminationChain

#EliminationChain

BasicScorer

Overview

When enabled, basic information about the peptides is output into several columns.

This information includes peptide start/end positions, terminal type, and flanking sequences.

Scores

Several scores are produced:

Length

The length of the peptide, in amino-acids. Excessively small or large peptides may be problematic for different reasons. The default assessor flags up lengths outside of the 6-30 range.
Terminal type

Which terminal the peptide appears on, if any. While non-terminal peptides are ideal, this issue is usually caught by the missing-linkers assessment.
Start position.

The starting position of the peptide within the protein. By default, for information only.
End position.

The end position of the peptide within the protein. By default, this column is for information only.
N-flanking sequence.

The adjacent amino-acid sequence to the N-side. By default, this column is for information only.
C-flanking sequence.

The adjacent amino-acid sequence to the C-side. By default, this column is for information only.
Organism of protein.

This is the NCBI taxonomic ID of the organism for the protein, if known. By default, this column is for information only.
Accession of protein.

This is the accession as provided in the input. By default, this column is for information only.
Title of protein.

This is the human-readable title of the protein, if known. By default, this column is for information only.

Usage notes

Note this Handler only considers the originating (specified) protein, not any other proteins this peptide might occur in. Hence, viable output is not produced if no protein was specified (i.e. if the peptide was specified manually).

C flank

End position

Length

N flank

Protein accession

Protein title

Protein organism

Sequence

Start position

Terminal type

DeepmspeptideScorer

Overview

When enabled, this scorer uses the DeepMsPeptide model to score the peptides based on their estimated proteotypicality.

Scores

The score produced is a real-typed classification per peptide, with 0 indicating the class of non-proteotypic peptides, and 1 indicating the class of proteotypic peptides. The ideal value is above 0.5.

Usage notes

Keras must be installed if you wish to use this scorer, it's rather big so it is not included in the program by default.

DeepMsPeptide

#DeepMsPeptide

DigestableLinkersScorer

Overview

When enabled, a column is produced that red-flags digestable linkers.

Scores

The value in the output columns is the number of cuts obtained by digesting the linker. Ideally this is 0. This is 1 less than the number of fragments.

If there is no linker, or if the linker is too short, the output is "SHORT_LINKER". This is also unfavourable.

Digestable N linker?

If this is non-zero, the N linker is digestible by the selected enzyme. This means that the linker will need to be modified in the Qbrick, making the environment of the peptide different to that in the original protein.
Digestable C linker?

If this is non-zero, the N linker is digestible by the selected enzyme. This means that the linker will need to be modified in the Qbrick, making the environment of the peptide different to that in the original protein.

Digestable N

Digestable C

DoesItFlyScorer

Overview

When enabled, a scores column is produced for the peptides based on a machine learning method that determines the quantotypicness of each peptide.

Scores

For the default model, the output is an indication of the peptides variability, with higher values being more variable (less quantotypic).

The default model has been trained to assess the variability of the peptide in relation to the protein quantification, as assessed by SeaMassSigma's Bayesian MCMC. Variabilities of the training set have been scaled to lie in the range [0, 1]. Output is thus predicted variability, with lower values indicating more quantotypic peptides. Note that the the median value of the training data is 0.14, so values above this suggest that at least half of all peptides in the training data may be more quantotypic.

Usage notes

A file_name may be provided to specify a custom ML model.

Example using the default ML algorithm:

scorer = HandlerFactory( DoesItFlyScorer )

Example using a custom ML algorithm which performs classification, where the "1" classification is more favourable:

scorer = HandlerFactory(
    klass            = DoesItFlyScorer,
    kwargs           = { "file_name": "my_classification_model.doesitfly" },
    column_overrides = [ ColumnOverride( column   = "doesitfly",
                                         weight = EWeight.NORMAL,
                                         assessor   = Assessors.Equal( 1 ) ) ] )

DoesItFly score

#DoesItFly score

IsoelectricPointScorer

Overview

When enabled, an output column is produced that shows the isoelectric point estimate for each of the peptides.

Scores

Isoelec

An estimate of the isoelectric point for the peptide.

Isoelec

NextProtScorer

Overview

When enabled, output is produced for NextProt's uniqueness scorer.

Scores

NextProt idents
NextProt variants

The output is in two columns- the number of similar peptides including and not including variants. Ideally, peptides will have only 1 match - for the protein they originate from. Values above 1 indicate the peptide is non- proteotypic, while zero-values indicate the data is not present in NextProt.

NextProt idents

NextProt variants

PeptideAtlasScorer

Overview

When enabled this scorer retrieves data from peptide-atlas, notably the "empirical_proteotypic_score".

Scores

A number of scores are output into columns for each peptide:

Isoelectric point
Molecular weight
Rel hydrophobicity
N observations
N genome location
N protein_samples
Proteotypic score

Usage notes

This scorer requires prior setup:

Ensure the mysqlclient Python package is installed.
Ensure MySql is running on localhost.
Ensure the alacat@localhost user exists

Ensure the alacat user has write-access to the alacat_* databases:

GRANT ALL PRIVILEGES ON `alacat\_%` .  * TO 'alacat'@'localhost';

Ensure AlacatDesigner knows the database password by adding it to the keyring:
```
python -m mhelper.password_helper --set alacat dbpassword
```

Isoelectric point

Molecular weight

N genome locations

N observations

N protein_samples

Proteotypic score

#Proteotypic score

Rel hydrophobicity

PeptideMassScorer

Overview

When enabled, the monoisotopic and average masses are output for each peptide. Peptide formulae are also output as a byproduct.

Scores

Average mass
Monoisotopic mass
Molecular formula
AA composition

Very large or small values may be problematic, and are detected by the default assessor on the monoisotopic mass column.

Average mass

Monoisotopic mass

Molecular formula

AA composition

PeptideSieveScorer

Overview

When enabled this scorer invokes PeptideSieve.

Scores

Scores are produced indicating the likelihood each peptide is proteotypic for various mass-spec platforms:

ICAT ESI
MUDPIT ESI
PAGE ESI
PAGE MALDI

If PeptideSieve does not return a score, the peptide is assigned a score of 0, since this usually indicates that PeptideSieve considers the peptide to be non- proteotypic.

By default, all non-zero values are considered good. To avoid multiple similar scores affecting the output, only the MUDPIT ESI has a non VERY-LOW priority.

Sieve ICAT ESI

#Sieve ICAT ESI

Sieve MUDPIT ESI

#Sieve MUDPIT ESI

Sieve PAGE ESI

#Sieve PAGE ESI

Sieve PAGE MALDI

#Sieve PAGE MALDI

PrideClustersScorer

Overview

When enabled, output is produced that indicates if this peptide is present in the Pride Clusters human data dump.

Scores

The output is a single presence/absence (boolean) column and indicates whether this peptide has been seen before, in the Pride Clusters database

Exists in Pride

ProteomeScorer

Overview

When enabled, this scorer counts the number of times this peptide's sequence and monoisotopic mass appears in the proteome.

The "proteome" .

By default, the scorer applies the following filters to the proteome:

Forbid protein isoforms
Forbid historic (Uniparc) proteins
Forbid protein fragments
Permit unreviewed (TREMBL) proteins
Use the organisms of the state (the organism(s) mentioned in the model plus any organism(s) covered by the input protein(s)).

In code, these values may be changed via their respective constructor arguments, whilst in the web interface, the values are hardcoded to the defaults.

Scores

The output values are the sequence occurrence count, and the monoisotopic mass occurrence count, within the specified proteome(s). Two more columns are provided that give the actual IDs of the proteins in which the sequences and MMIs are repeated:

Rep count
MMI-rep
Rep IDs
MMI-rep IDs

By default, the rep-count and MMI count are assessed by the rule "n<=1". This allows peptides found in their own protein to pass, as well as peptides not found in any protein (e.g. if the organism differs or the reference proteome is missing the user's entry). A more ideal rule might be "n=1" or "n!=1" depending on whether the user expects to see their protein in the proteome or not.

Known limitations

The scorer has some limitations and manual checking the IDs of the proteins actually detected is recommended:

By default, duplicates are red-flagged, however in some cases, such as when trying to find protein isoforms, these might be exactly what the user is looking for.
Duplicate sequences may appear in the proteome due to multiple entries for the same protein under different names in the FASTA file.
The simulated digest does not include missed cleavages and the scorer is unable to offer accurate results on miscleaved peptides.

Usage notes

An initial index file is generated, whereby the digester must be invoked to produce the digests for the entire proteome. This process may take some time: proteomes require a download of many megabytes and a simulated digest must be performed on their contents.

MMI-rep count

MMI-rep IDs

Rep count

Rep IDs

RepeatScorer

Overview

When enabled, this scorer counts the number of times the peptide occurs within its protein and input set.

Scores

Repeats in protein

Indicates the within-protein sequence occurrence count
Repeats in input

Indicates the within-input sequence occurrence count

Repeats in input

Repeats in protein

RobRulesScorer

Overview

When enabled, this scorer asserts certain string-based no-go peptide rules, the "Rob rules".

Scores

For the amino-acid checks, the scores are counts of how many violations occur. The ideal score is 0. Non-zero scores indicate the peptide contains the specified subsequence and may not make a suitable Qbrick. Note that the "GLUTAMINE_START" rule only looks at one residue and so only two possible values are allowed (true or false).

C count

Number of C/cysteine/cys.
M count

Number of M/methionine/met.
H count,

Number of H/histidine/his.
NG count

Number of NG (N/asparagine/asn, G/glycine/gly).
DG count

Number of DG (D/aspartic-acid/asp, G/glycine/gly).
Q start

Peptide starts with Q/glutamine/gln.

For the linker checks, the scores are strings, "N" indicating the N-terminal violates the rule, and "C" indicating the C-terminal violates the rule. The ideal value for such scores is blank (""). Non-blank scores indicate the amino acid will need to be substituted in the Qbrick meaning the environment of the peptide in the Qconcat will be less similar to the environment in the original protein.

R in linker

R/arginine/arg in linker.
K in linker

K/lysine/lys in linker.
Linker missing

No linker (terminal peptide) or linker unknown.

Source

Email from Rob B:

The problem is one of charge state, and distribution of signal over multiple charge states.

So, a ‘normal’ peptide: doubly charged.. [M+2H]++ Add a histidine, we now have [M+2H]++ and [M+3H]+++ which will, at best, split the signal Add more histidines and it gets worse

Ideally, we’d have rules that say:

Eliminate H Of all [M+2H]++ peptides, Eliminate M Eliminate NG, NV Eliminate n-term Q

Are there any left?

If not, then relax rules Add back M Add back n-term Q (note I do not want to add back NG…)

Allow one more H

Of all [M+3H]+++ peptides…. Etc

R in linker

NG count

NV count

DG count

C count

DP count

Linker missing

Q start

H count

#H count

KP count

K in linker

M count

RP count

YolandaScorer

Overview

This scorer contacts the Yolanda database to score the peptides, based on whether they are found in various experiments.

Scores

One output column is produced, which indicates the presence of the peptide within the database. Positive scores indicate evidence has been recorded for the peptide, while zero scores indicate a lack of evidence. -1 indicates the peptide is unknown to Yolanda.

Note that no attempt is made to filter out false positives, non-proteotypic evidence, non-intensity evidence, or zero-valued evidence.

Usage notes

If Yolanda is not installed, no output is produced.

Yolanda evidence

Qbrick providers

The Qblock providers are to Qbricks and Qblocks as the Peptide providers are to to Peptides. Please see the PeptideProvider documentation for more details.

PermutationQblockProvider

Overview

When enabled, Qblocks are generated by trying different possible permutations.

Scores

The score is a boolean value on a qblock, indicating if the qblock was produced by this provider. It is for information only and has no analytical purpose.

Usage notes

Usually the number of peptides in a qblock is the same as the number of peptides in a brick, which is 2. There will thus only be 2 possible permutations, both of which will be attempted. However max_permutations may be used to provide a limit in case of more esoteric scenarios.

From permutation?

PrespecifiedQblockProvider

Overview

When enabled this provider allows qblocks to be entered into the workflow before the actual qblock-producing stage.

Such qblocks include qblocks entered manually.

Disabling this provider will result in an error if qblocks are entered into the initial model.

The base class, PrespecifiedProvider providers specific implementation details.

Scores

The score is a boolean value on a qblock, indicating if the qblock was produced by this provider. It is for information only and has no analytical purpose.

From spec?

Qblock scorers

The QblockScorers are to Qbrickss and Qblockss as PeptideScorers are to Peptides. See PeptideScorer for more details.

NaturalQblockScorer

Overview

When enabled, a score is produced that indicates if qbricks occur naturally in the protein sequence.

Scores

If the score is zero then the Qbrick sequence does not occur in the original protein. This means the environment of the peptides in the Qbrick will be less like the environment of the peptides in the original protein. This is the case for almost all Qbricks, unless concurrent peptides were chosen and no linker substitution was performed.

Is natural?

Qmenu providers

The Qmenu providers are to Qconcats and Qmenus as the Peptide providers are to to Peptides. Please see the PeptideProvider documentation for more details.

PermutationQmenuProvider

Overview

When enabled, Qmenus are generated by trying different possible permutations.

Scores

The score is a boolean value on a Qmenu, indicating if the Qmenu was produced by this provider. It is for information only and has no analytical purpose.

From permutation?

Qmenu scorers

The QmenuScorers are to Qconcats and Qmenus as PeptideScorers are to Peptides. See PeptideScorer for more details.

BasicAlacatScorer

Overview

When enabled a meaningless test value is produced for each Qmenu.

Scores

The output value has no purpose, please ignore it.

Test value

YOU ARE WAITING

Specify the proteins to query.

This field is mandatory. The formats accepted are[*]_:

Uniprot or Ensembl accessions
FASTA protein sequences
FASTA peptide sequences
TSV format
Alacat-UID

Uniprot or Ensembl accessions

Provide a list of database accessions, one per line.

GUI: See examples E1 and E5. Example E3 also uses this input format, but then goes on to focus the peptide search.

FASTA protein sequences

Provide FASTA content with protein accessions and sequences.

GUI: See example E2.

FASTA peptide sequences

Sometimes we want to evaluate specific peptides without being concerned about other peptides in the protein. In this case a slash (/) can be used to denote the splits in the protein. Small (l=3) peptides should be included to correctly identify the flanking sequences for each peptide.

GUI: See example E4.

TSV format

Protein accessions and sequences can also be provided in tabular format, e.g. CSV or TSV.

Alacat-UID

This is a JSON format internal to the software. You won't generally type it manually, but might see it if you export and recover a previous model.

Notes

[*]	This list gives the default values. GUI: The exact protein formats supported depend on the providers enabled in the advanced section below. API: The exact protein formats supported depend on the `handlers` argument.

Specify the peptides to use. Peptides in the following formats are generally accepted [*].

Peptide list:
(PEPTIDE 1)
(PEPTIDE 2)
...
FASTA-like peptide list per protein:
>(PROTEIN 1)
(PEPTIDE 1)
(PEPTIDE 2)
...
Network edge-list:
(PROTEIN 1):(PEPTIDE 1)
(PROTEIN 1):(PEPTIDE 2)
...

Alacat-UID (obtained from a previous model) are also accepted.

If you leave this field blank the system will suggest the peptides for you.

[*]	This list gives the default values. GUI: The exact protein formats supported depend on the providers enabled in the advanced section below. API: The exact protein formats supported depend on the `handlers` argument.

Specify the qbrick sets (qblocks) to use. Qbricks are generally provided as an Alacat-UID (obtained from a previous model). The exact formats supported depend on the handlers selected. If you leave this field blank the system will suggest the qblocks for you.

Specify the lists of qconcat (qmenus) to favour. Qmenus are generally provided as an Alacat-UID (obtained from a previous model). The exact formats supported depend on the handlers selected. If you leave this field blank the system will suggest the qmenus for you.

Model title, for your reference only. Can be left blank.

GUI: If the model already exists on the system, this field will be ignored and the existing title will be used.

Minimum number of qbricks per protein.

Specify the digestion you are using. This will almost always be set to TRYPSIN.

Description of options:

Digestions that the pipeline is capable of simulating.

Warning

Most of the digestion methods are included to support OpenMs's digester, which is a fast regular expression based digester. However, digestion and analysis are two different stages of the pipeline.

While we are easily able to simulate non-tryptic digests, the ability of many tools and databases to actually provide insight into such peptides is questionable. Proceed with caution when using these.

Regular expressions and descriptions are provided from OpenMs.
Preprogrammed:
Preprogrammed. Indicates a digestion method other than those listed in this enumeration. Typically this means the user has specified the peptides verbatim, or provided a custom Handler capable of simulating a unique digestion. Only for advanced API use, not useful from GUI.

Inherit model:
Inherit model. Inherits the digestion of the parent model. This is only valid for Handlers and does not make sense to use on the model itself. Only for advanced API use, not useful from GUI.
Custom digestion, with known flanks:
Custom digestion, with known flanks. Indicates the input protein sequences are the peptides.

Separate multiple peptides assigned to a single protein by a slash ('/') in the sequence.

Peptides between 0 and 3 amino acids will not be included but should be included to specify the flanks.

You do not need to include all peptides for the protein(s) - only those you wish to be analysed.

Important

Flanking sequences must be specified! If you do not know the flanks, use the "Custom digestion, with unknown flanks" option instead.

Example: Given peptides A, B and C, where A and B are adjacent and C is at the end of the protein, the sequence may read:
xxx/AAAAA/BBBBB/yyy/zzz/CCCCC
Where x, y and z are the flanks for A's N-terminus, B's C-terminus and C's N-terminus respectivly. Note that all flanking sequences could (redundandly) be specified if this is easier.
Custom digestion, with unknown flanks:
Custom digestion, with unknown flanks.

This is identical to "Custom digestion, with known flanks", above, but indicates that the peptide linkers are unknown.

Dummy "AAA" linkers will be assumed.

Example: Given peptides A, B and C, where their positions and flanking sequences are unknown, the sequence may read:
AAAAA/BBBBB/CCCCC
TrypChymo:
TrypChymo CUT: TrypChymo cuts after F, Y, W, L(or J), K or R if not followed by P. RGX: '(?<=[FYWLJKRX])(?!P)'

Asp-N_ambic:
Asp-N_ambic CUT: Asp-N Ammonium bicarbonate cleaves before D(or B) or E(or Z). RGX: '(?=[DBEZX])'

Lys-C/P:
Lys-C/P CUT: Lys-C/P cuts after K. RGX: '(?<=[KX])'

V8_DE:
V8-DE CUT: V8-DE cuts after D(or B) or E(or Z) if not followed by P. RGX: '(?<=[DBEZX])(?!P)'

V8_E:
V8-E CUT: V8-E cuts after E(or Z) if not followed by P. RGX: '(?<=[EZX])(?!P)'

Formic acid:
Formic acid CUT: Formic_acid cuts after D(or B) and next residue is D (or B). RGX: '((?<=[DBX]))|((?=[DBX]))'

Chymotrypsin/P:
Chymotrypsin/P CUT: Chymotrypsin cleaves following F, Y, W or L(or J) residue. RGX: '(?<=[FYWLJX])'

Lys-C:
Lys-C AKA: lys_c CUT: Lys-C cuts after K if not followed by P. RGX: '(?<=[KX])(?!P)'

PepsinA:
PepsinA CUT: PepsinA cuts after F or L(or J). RGX: '(?<=[FLJX])'

Trypsin/P:
Trypsin/P CUT: Trypsin/P cuts after K or R. RGX: '(?<=[KRX])'

Arg-C:
Arg-C AKA: arg_c; Clostripain; argc CUT: Arg-C cleaves following R residue unless the next residue is P. RGX: '(?<=[RX])(?!P)'

Trypsin:
Trypsin CUT: Trypsin cleaves following a K or R residue unless the next residue is P. RGX: '(?<=[KRX])(?!P)'

Chymotrypsin:
Chymotrypsin CUT: Chymotrypsin cleaves following F, Y, W or L(or J) residue unless the next residue is P. RGX: '(?<=[FYWLJX])(?!P)'

Asp-N:
Asp-N AKA: asp_n CUT: Asp-N cleaves before D(or B). RGX: '(?=[DBX])'

CNBr:
CNBr CUT: CNBr cleaves following M. RGX: '(?<=[MX])'

Arg-C/P:
Arg-C/P CUT: Arg-C/P cleaves after R residues. RGX: '(?<=[RX])'``

Asp-N/B:
Asp-N/B CUT: Asp-N/B cleaves before D(while B is ignored). RGX: '(?=[DX])'

Lys-N:
Lys-N AKA: lys_n CUT: Lys-N cuts before K. RGX: '(?=[KX])'

leukocyte elastase:
leukocyte elastase CUT: leukocyte elastase cuts after A or L or I(or J) or V if not followed by P. RGX: '(?<=[ALIJVX])(?!P)'

cyanogen-bromide:
cyanogen-bromide CUT: cyanogen-bromide cuts after M. RGX: '(?<=[MX])'

iodosobenzoate:
iodosobenzoate CUT: ? RGX: '(?<=W)'

staphylococcal protease/D:
staphylococcal protease/D AKA: staphylococcal protease/D; Glu-C/D CUT: staphylococcal protease/D cuts after E(or Z). RGX: '(?<=[EZX])'

PepsinA + P:
PepsinA + P CUT: PepsinA + P cuts after F or L(or J) unless followed by P. RGX: '(?<=[FLJX])(?!P)'

proline endopeptidase:
proline endopeptidase CUT: proline endopeptidase cuts after HP, KP or RP if not followed by P. RGX: '(?<=[HKRX][PX])(?!P)'

Clostripain/P:
Clostripain/P CUT: Clostripain/P cuts after R. RGX: '(?<=[RX])'

elastase-trypsin-chymotrypsin:
elastase-trypsin-chymotrypsin CUT: elastase-trypsin-chymotrypsin cuts after A,L,I(or J),V,K,R,W,F,Y unless followed by P. RGX: '(?<=[ALIVKRWFYX])(?!P)'

Alpha-lytic protease:
Alpha-lytic protease CUT: Alpha-lytic protease (aLP) cuts after T, A, S, or V. RGX: '(?<=[TASVX])'

2-iodobenzoate:
2-iodobenzoate CUT: 2-iodobenzoate cuts after W. RGX: '(?<=[WX])'

proline-endopeptidase/HKR:
proline-endopeptidase/HKR CUT: proline-endopeptidase/HKR cuts after P. RGX: '(?<=[PX])'

glutamyl endopeptidase:
glutamyl endopeptidase AKA: Glu-C; glu_c; staphylococcal protease CUT: glutamyl endopeptidase cuts after D(or B) or E(or Z). RGX: '(?<=[DBEZX])'

Glu-C+P:
Glu-C+P AKA: staphylococcal protease+P; Glu-C+P CUT: Glu-C+P cuts after D(or B) or E(or Z) unless followed by P. RGX: '(?<=[DBEZX])(?!P)'

Preprogrammed:	Preprogrammed. Indicates a digestion method other than those listed in this enumeration. Typically this means the user has specified the peptides verbatim, or provided a custom `Handler` capable of simulating a unique digestion. Only for advanced API use, not useful from GUI.
Inherit model:	Inherit model. Inherits the digestion of the parent model. This is only valid for `Handler`s and does not make sense to use on the model itself. Only for advanced API use, not useful from GUI.
Custom digestion, with known flanks:	Custom digestion, with known flanks. Indicates the input protein sequences are the peptides. Separate multiple peptides assigned to a single protein by a slash ('/') in the sequence. Peptides between 0 and 3 amino acids will not be included but should be included to specify the flanks. You do not need to include all peptides for the protein(s) - only those you wish to be analysed. Important Flanking sequences must be specified! If you do not know the flanks, use the "Custom digestion, with unknown flanks" option instead. Example: Given peptides A, B and C, where A and B are adjacent and C is at the end of the protein, the sequence may read: xxx/AAAAA/BBBBB/yyy/zzz/CCCCC Where x, y and z are the flanks for A's N-terminus, B's C-terminus and C's N-terminus respectivly. Note that all flanking sequences could (redundandly) be specified if this is easier.
Custom digestion, with unknown flanks:	Custom digestion, with unknown flanks. This is identical to "Custom digestion, with known flanks", above, but indicates that the peptide linkers are unknown. Dummy "AAA" linkers will be assumed. Example: Given peptides A, B and C, where their positions and flanking sequences are unknown, the sequence may read: AAAAA/BBBBB/CCCCC
TrypChymo:	TrypChymo CUT: TrypChymo cuts after F, Y, W, L(or J), K or R if not followed by P. RGX: `'(?<=[FYWLJKRX])(?!P)'`
Asp-N_ambic:	Asp-N_ambic CUT: Asp-N Ammonium bicarbonate cleaves before D(or B) or E(or Z). RGX: `'(?=[DBEZX])'`
Lys-C/P:	Lys-C/P CUT: Lys-C/P cuts after K. RGX: `'(?<=[KX])'`
V8_DE:	V8-DE CUT: V8-DE cuts after D(or B) or E(or Z) if not followed by P. RGX: `'(?<=[DBEZX])(?!P)'`
V8_E:	V8-E CUT: V8-E cuts after E(or Z) if not followed by P. RGX: `'(?<=[EZX])(?!P)'`
Formic acid:	Formic acid CUT: Formic_acid cuts after D(or B) and next residue is D (or B). RGX: `'((?<=[DBX]))\|((?=[DBX]))'`
Chymotrypsin/P:	Chymotrypsin/P CUT: Chymotrypsin cleaves following F, Y, W or L(or J) residue. RGX: `'(?<=[FYWLJX])'`
Lys-C:	Lys-C AKA: lys_c CUT: Lys-C cuts after K if not followed by P. RGX: `'(?<=[KX])(?!P)'`
PepsinA:	PepsinA CUT: PepsinA cuts after F or L(or J). RGX: `'(?<=[FLJX])'`
Trypsin/P:	Trypsin/P CUT: Trypsin/P cuts after K or R. RGX: `'(?<=[KRX])'`
Arg-C:	Arg-C AKA: arg_c; Clostripain; argc CUT: Arg-C cleaves following R residue unless the next residue is P. RGX: `'(?<=[RX])(?!P)'`
Trypsin:	Trypsin CUT: Trypsin cleaves following a K or R residue unless the next residue is P. RGX: `'(?<=[KRX])(?!P)'`
Chymotrypsin:	Chymotrypsin CUT: Chymotrypsin cleaves following F, Y, W or L(or J) residue unless the next residue is P. RGX: `'(?<=[FYWLJX])(?!P)'`
Asp-N:	Asp-N AKA: asp_n CUT: Asp-N cleaves before D(or B). RGX: `'(?=[DBX])'`
CNBr:	CNBr CUT: CNBr cleaves following M. RGX: `'(?<=[MX])'`
Arg-C/P:	Arg-C/P CUT: Arg-C/P cleaves after R residues. RGX: '(?<=[RX])'``
Asp-N/B:	Asp-N/B CUT: Asp-N/B cleaves before D(while B is ignored). RGX: `'(?=[DX])'`
Lys-N:	Lys-N AKA: lys_n CUT: Lys-N cuts before K. RGX: `'(?=[KX])'`
leukocyte elastase:	leukocyte elastase CUT: leukocyte elastase cuts after A or L or I(or J) or V if not followed by P. RGX: `'(?<=[ALIJVX])(?!P)'`
cyanogen-bromide:	cyanogen-bromide CUT: cyanogen-bromide cuts after M. RGX: `'(?<=[MX])'`
iodosobenzoate:	iodosobenzoate CUT: ? RGX: `'(?<=W)'`
staphylococcal protease/D:	staphylococcal protease/D AKA: staphylococcal protease/D; Glu-C/D CUT: staphylococcal protease/D cuts after E(or Z). RGX: `'(?<=[EZX])'`
PepsinA + P:	PepsinA + P CUT: PepsinA + P cuts after F or L(or J) unless followed by P. RGX: `'(?<=[FLJX])(?!P)'`
proline endopeptidase:	proline endopeptidase CUT: proline endopeptidase cuts after HP, KP or RP if not followed by P. RGX: `'(?<=[HKRX][PX])(?!P)'`
Clostripain/P:	Clostripain/P CUT: Clostripain/P cuts after R. RGX: `'(?<=[RX])'`
elastase-trypsin-chymotrypsin:	elastase-trypsin-chymotrypsin CUT: elastase-trypsin-chymotrypsin cuts after A,L,I(or J),V,K,R,W,F,Y unless followed by P. RGX: `'(?<=[ALIVKRWFYX])(?!P)'`
Alpha-lytic protease:	Alpha-lytic protease CUT: Alpha-lytic protease (aLP) cuts after T, A, S, or V. RGX: `'(?<=[TASVX])'`
2-iodobenzoate:	2-iodobenzoate CUT: 2-iodobenzoate cuts after W. RGX: `'(?<=[WX])'`
proline-endopeptidase/HKR:	proline-endopeptidase/HKR CUT: proline-endopeptidase/HKR cuts after P. RGX: `'(?<=[PX])'`
glutamyl endopeptidase:	glutamyl endopeptidase AKA: Glu-C; glu_c; staphylococcal protease CUT: glutamyl endopeptidase cuts after D(or B) or E(or Z). RGX: `'(?<=[DBEZX])'`
Glu-C+P:	Glu-C+P AKA: staphylococcal protease+P; Glu-C+P CUT: Glu-C+P cuts after D(or B) or E(or Z) unless followed by P. RGX: `'(?<=[DBEZX])(?!P)'`

Specify one or more organisms to run your query against. You should specify organisms by their NCBI taxonomy ID, or their scientific name. If you leave this field blank the field will be completed automatically based on the protein sequences.

GUI: Use a comma to delimit multiple organisms.

In mandated mode all of the peptides you specify will be used. This may result in multiple qbricks per protein. In non-mandated mode if you specify more peptides than necessary those that the system considers least quantotypic will be dropped. If you specify fewer peptides than in a qbrick then, regardless of this selection, the system will always supplement your selection with those from the pool of remaining peptides that it considers most quantotypic.

The same logic is applied to qblock and qmenu selection.

Regenerate the results, even if they already exist on the system.

SQL backing mode. Turning backing on means you might not get the latest version of the scores, but operation will be considerably faster. Old scores can be purged from the database manually using their date column, see the alacat.utilities.sql_backing module for details.

Description of options:

No database: No database. The database is not used.

Read-only: Read-only. When set, scores are restored from the database. Use this to use, but not change, the database.

Write-only: Write-only. New scores are stored in the database. Use this to update scores with new ones.

Read-write: Read-write. Use this to retrieve scores from the database where possible, and to remember newly acquired scores in the database.

No database:	No database. The database is not used.
Read-only:	Read-only. When set, scores are restored from the database. Use this to use, but not change, the database.
Write-only:	Write-only. New scores are stored in the database. Use this to update scores with new ones.
Read-write:	Read-write. Use this to retrieve scores from the database where possible, and to remember newly acquired scores in the database.

Shows the parameters but does not execute the workflow. You can use this to review your parameters before proceeding, or to copy the resulting parameter set for use in Python.

This input sets the assessment and weight of the column.

This input sets the assessment and weight of the column. This is a rank column, scores produced are the ranks of the values in the non-ranked column. Values are ordered such that more quantotypic values receive lower ranks.

This input sets the assessment and weight of the column.