Update for TPP Users
December 1 2017
This update fixes an issue with validating PSMs that use 15N with PeptideProphet in TPP when using the browser front end. Thanks much for the user reports that found this edge case. Essentially, 15N-labeled proteins were being grouped with the 14N counterparts, rather than being treated as independent proteins. The solution was a quick fix to the Kojak pepXML output in version 1.6.1 AND when using TPP add ‘-nR’ to the “Enter additional options to pass directly to the command-line (expert use only!)” field of the PeptideProphet options on the Analyze Peptides page. That ‘-nR’ parameter instructs the TPP to also resist the urge to group 15N and 14N proteins into the same group based on sequence homology. Future versions of TPP will be able to figure this out on its own, but TPP is on a different release schedule.
To put this latest version of Kojak in the TPP, backup your existing Kojak.exe file in the C:\TPP\bin folder by renaming it. Then copy the new Kojak into that folder. It will then work natively in the TPP environment.
Next update to the page will include a series of new demos for using Kojak in the TPP.
Big upgrades in 1.6.0
November 27 2017
Today marks the official 1.6.0 release. There are some pretty hefty changes beyond the usual minor feature additions and bugfixes. As mentioned in the June alpha release, one new analysis type incorporates 15N for homomultimer cross-linking. Briefly, to perform 15N analysis, the labeled protein must be duplicated in the sequence database, with a unique identifier:
>protein1 MASTHAKEEILSVNAQWKADRGHLSELED... >15N_protein1 MASTHAKEEILSVNAQWKADRGHLSELED...
Then add the following parameter to your Kojak configuration file:
15N_filter = 15N
More details are provided in the June news entry, but also note that the parameter name has changed slightly for clarity.
The second big change alters the search method for cross-linked peptides. Originally, all peptides were searched individually, and a list of the top hits (user defined with the top_count parameter) was kept to find pairs of peptides that explained the spectra. This was a fast and smart way to reduce a large search space of all possible peptide combinations to just the most relevant combinations given an observed spectrum. It also had a caveat in that cross-links in which one of the peptides had little or no fragmentation could not be suggested as the best possible hit because it did not make the top hits. Furthermore, there was no clear definition of how large top_count should be set (the answer is appropriately large for your data set; 250 is just a recommended starting point). Often users would set a small value, which is appropriate for targeting large peptides with few modifications, not small peptides with many modifications. So this particular design implementation was revisited and revised.
The new method searches the upper half of the peptides (precursor mass divided by two and larger) and keeps the top hits. Then among those top hits, all remaining peptides in the database that can be paired to this list are searched and scored. This has the added benefit of testing and scoring even the smallest of peptides that would not have made it onto the list under the previous method. In some ways this is equivalent to searching more peptide combinations. At the same time, it is no longer necessary to maintain a large top_count for each spectrum, which reduces the number of peptide combinations - BUT distinctly reducing the combinations among the least likely candidates.
A additional clarifications should be made. First, there is an assumption that the larger peptide in a cross-link will have largest contribution to the final score for the PSM. If it is not, then hopefully it is at least large enough to make the top hits anyway. Second, a lot more PSMs with one good scoring peptide will make it into the results, often paired with a very small counterpart peptide. These large-to-small PSMs are the most highly suspect. Best consider this when performing validation on your PSMs by your method of choice. Third, this method also means that top_count as a parameter has been repurposed. A more appropriate value might now be 25 instead of 500. If you use a large value such as 500, there may be performance issues on systems with insufficient memory. And likely there is no benefit on any system.
Finally, there are a lot of new diagnostic features and reports. As these are intended for advanced users, I will spend more time explaining them in a future post.
Thanks everyone for your patience while this release was being prepared. I know the website documentation is now slightly out of date, but updating it is on my radar. Also, I cannot recommend Percolator for validation anymore. The lastest Percolator version (3.1) has standard, single peptide assumptions that prevent it from working with Kojak (or perhaps any cross-linking) output. I recommend switching to PeptideProphet in the Trans-Proteomic Pipeline.
15N analysis with version 1.6.0 alpha
June 9 2017
Thank you to everyone who said hello during ASMS. It was great to talk to so many users and get feedback. I appreciate all the support for Kojak and look forward to implementing new features and suggestions from everyone’s comments. Particularly, there were a lot of requests for an early release of Kojak’s newest feature, 15N-labeled analysis for identification of protein homomultimers.
Cross-links between homo-multimers are difficult to decipher, because it may be impossible to tell if these are intra-protein or inter-protein cross-links. Labeling one of the multimers with 15N and linking it to its 14N counterpart can distinguish inter-protein cross-links. The newest feature in Kojak allows this type of analysis.
To perform 15N analysis, two steps must be taken. First, the labeled protein must be duplicated in the sequence database, with a unique identifier. Here is an example:
>protein1 MASTHAKEEILSVNAQWKADRGHLSELED... >n15_protein1 MASTHAKEEILSVNAQWKADRGHLSELED...
Notice in the example above that the sequences are identical, but the labeled protein now has “n15” as a unique identifier. The second step is to add the following parameter to your Kojak configuration file:
n15_filter = n15
Here, the new n15_filter parameter indicates that any protein name prefaced with “n15” will have the mass of 15N incorporated into all its amino acids. Also note, that if you have many other proteins that are not labeled, they will be analyzed using the normal 14N masses so long as they do not begin with “n15” in their protein name.
This newest feature has not been fully tested yet, so bugs may still exist. Despite this, there seemed to be overwhelming support from users to start using this new feature now, so I am providing my developers version of Kojak (1.6.0-dev). Please note that this is outside my usual release cycle, and should be considered alpha software that may change before the official release. Also, I can only provide it for Windows 64-bit at this time. I will be out of the lab for most of June, but please email me any feedback and I will respond as soon as I am able.
MGF File Support
May 4 2017
I’ve always preferred file formats with richer meta information, but that doesn’t diminish the fact that there are a lot of MGF data files floating around. Simply converting them back to mzML doesn’t restore the lost metainformation. This fact turned out to be problematic for a lab that had only MGF files for DDA scan events, and did not have the original raw files containing the precursor scans. Kojak did not work after converting the MGF files to mzML format, because it was assumed the user could provide the precursor scans in the data file. As a result, it would appear MGF files were not supported.
The solution was to add a new parameter to version 1.5.5: precursor_refinement
This parameter toggles the Kojak precursor analysis routines. These routines must be disabled when there are no precursor scans in the data file. But it is not limited to MGF files. The parameter can also be used with mzML and mzXML files. When skipping precursor refinement, Kojak will use the instrument-predicted precursor mass to define peptide search boundaries. If a scan does not have a predicted precursor mass, the selected ion m/z and predicted charge states will be used. Optimal performance is most frequently achieved when using the most accurate precursor mass possible, and so it is recommended to keep precursor refinement turned ON. But in cases where the precursor scans are no longer available, this option just isn’t possible. By default, precursor refinement is turned on. Set precursor_refinement = 0 in your configuration file to disable it.
A consequence of this update is better MGF support. Hopefully few, if any, MGF files will fail with Kojak. Please notify me if you have any such cases. Otherwise, search away on your MGF collections - no conversion to other formats necessary.
Improved Feature Customization
February 9 2017
It all started with a simple request on the code repo message board: could the variable modifications on peptide c- and n-termini be restored? To give this post a little context, early versions of Kojak allowed for specification of modifications on the c- and n-termini of peptides using $ and @, respectively, as amino acid wildcards. Admittedly, this was not well thought out. The issue of modifications is much broader. Fixed or differential modifications? To the peptide termini or protein termini? Single or multiple modifications to the same amino acid? At the time, there was conflict in the code on how to resolve some of these considerations, so the peptide termini modifications were quietly shelved in favor of the typically necessary (in XL experiments) protein termini modifications that were causing the conflict.
This latest update (1.5.4) restores the peptide termini modifications. More importantly to users, there have been some parameter and syntax changes designed to clarify and facilitate use of modifications in Kojak. The changes in the Kojak code were not trivial, and this is really a discussion for a different time. However, the user interface need not reflect that complexity. So here is a brief summary of the new user interface, which can be explored in more detail in the parameter documentation.
All differential modifications to peptides are specified using the modification parameter. The parameter accepts a single uppercase amino acid letter and the differential mass. A lowercase ‘c’ or ‘n’ is used to specify the modification is on the peptide c-terminus or peptide n-terminus, respectively. If more than one differential modification is required, specify a unique modification parameter line for each modification. As many as you want. You can even list the same amino acid in multiple lines with a different modification mass each time. This can be used, for example, to identify singly, doubly, and triply methylated lysines.
Differential modifications to protein termini are specified with their own modification_protC and modification_protN parameters. These parameters need only the differential mass as values. It is possible to specify more than one differential protein modification with multiple instances of these parameters.
Fixed modifications are changes to the mass values that are applied to all instances of the specified amino acids and termini. The syntax for fixed modifications to peptides is identical to the syntax for differential modifications, except the parameter is named fixed_modification. The amino acid is specified in upper case. The peptide c-terminus or n-terminus is specified with lowercase ‘c’ or ‘n’, respectively. Add multiple fixed_modification lines to the Kojak configuration file to indicate multiple mass differences in the analysis.
Like the differential modifications to protein termini, fixed modifications can also be specified to the protein c-terminus or n-terminus. Use fixed_modification_protC and fixed_modification_protN to specify these mass differences. These parameters need only the mass as values.
To summarize, parameters to indicate modifications have been expanded from two to six. Under the new rules, there are specific ways to indicate the mass differences be applied to the peptides and peptide termini, or the protein termini. All mass values are in addition to the existing default amino acid values. There are no special characters (e.g. $ and @) to specify termini. Use lowercase ‘c’ and ‘n’ to specify peptide termini modifications, or the appropriate protein termini modification parameters.
Chemical cross-linkers pose a special case. They frequently target multiple sites at the protein level (multiple amino acids and the protein termini). Cross-linker that binds on only one side (hydrolyzing on the other side) can therefore create a diverse set of differential modifications to search for. Rather than list all possibilities as a large set of modification parameters, Kojak has a simple shortcut: the mono_link parameter. This parameter accepts a set of amino acid characters. It also accepts ‘c’ or ‘n’ to specify the protein C-terminus or N-terminus, respectively. The final value is the differential mass to apply. This is a very convenient shortcut. For example, by specifying “mono_link = cDE -0.9837153”, the necessary parameters to define an EDC mono-link has been reduced from three differential modification parameters to a single mono_link parameter.