With SIRIUS 6 we moved away from the file based project space to a Nitrite database. This was necessary to improve performance and enable new features. CLI, API and GUI store the computed results as SIRIUS project-space, which in turn can also be an input for the GUI or the CLI. This allows the user to review results in the GUI that have been computed with an automated workflow using the CLI or API.
Input
SIRIUS supports MS data in several formats: .ms
, .mgf
and Agilent’s .cef
contain already processed peak lists for each feature.
For .mzml
and .mzxml
feature detection and alignment will be performed by SIRIUS. However, all data must be centroided.
Peak list-based formats
MGF-Format
SIRIUS also supports the MGF (Mascot Generic Format). This file format was developed for peptide spectra for the mascot search engine. Each spectrum in a MGF file can contain many spectra each starting with and ending with . Peaks are again written as pairs of m/z and intensity values separated by whitespaces with one peak per line. Further meta information can be given as NAME=VALUE pairs. SIRIUS recognizes the following meta information:
-
PEPMASS: contains the measured mass of the ion (e.g. the parent peak)
-
CHARGE: contains the charge of the ion. As SIRIUS supports only single charged ions, this value can be either 1+ or 1-.
-
MSLEVEL: should be 1 for MS spectra and 2 for MS/MS spectra. SIRIUS will treat higher values automatically as MS/MS spectra, although, it might be that it supports MSn spectra in future versions.
This is an example for a MGF file:
BEGIN IONS
PEPMASS=438.32382
CHARGE=1+
MSLEVEL=2
185.041199 4034.674316
203.052597 12382.624023 245.063171 50792.085938 275.073975 124088.046875
305.084106 441539.125 335.094238 4754.061035 347.09494 13674.210938
365.105103 55487.472656
END IONS
See also the GNPS database for other examples of MGF files.
SIRIUS MS-Format
A disadvantage of these data formats is that
they do not contain all information necessary for SIRIUS to perform the
computation. Missing meta information have to be provided via the
commandline. Therefore, SIRIUS supports also an own file format very
similar to the MGF format above. The file ending of this format is .ms
.
Each file contains one measured feature (but arbitrary many spectra).
Each line may contain a peak (given as m/z and intensity separated by a
whitespace), meta information (starting with the > symbol followed
by the information type, a whitespace and the value) or comments
(starting with the # symbol). The following fields are recognized
by SIRIUS:
TODO: Is feature/compound already changed here?
-
>compound: The name of the measured feature (or any placeholder). This field is mandatory.
-
>parentmass: the mass of the parent peak
-
>formula: The molecular formula of the feature. This information is helpful if you already know the correct molecular formula and just want to compute a fragmentation tree or recalibrate the spectrum
-
>ion: the ionization mode. See for the format of ion modes.
-
>charge: is redundant if you already provided the ion mode. Otherwise, it gives the charge of the ion (1 or -1).
-
>ms1: All peaks after this line are interpreted as MS peaks
-
>ms2: All peaks after this line are interpreted as MS/MS peaks
-
>collision: The same as >ms2 with the difference that you can provide a collision energy
An example for a .ms file:
>compound Gentiobiose
>formula C12H22O11
>ionization \[M+Na\]+
>parentmass 365.10544
>ms1
365.10543 85.63 366.10887 11.69 367.11041 2.67
>collision 20
185.041199 4034.674316
203.052597 12382.624023
245.063171 50792.085938
275.073975 124088.046875
305.084106 441539.125
335.094238 4754.061035
347.09494 13674.210938
365.105103 55487.472656
Note: The .ms
file format is SIRIUS’ internal format and might change with as same speed as the SIRIUS
developend went forward. Nevertheless we try to provide backward compatibilty where possible.
A more detailed and commented but also WIP example of the an .ms
file can be found here
LCMS-Runs
SIRIUS can import full LCMS-Runs in (.mzML
/.mzXML
) format via the
prepossessing tool. In the GUI this is done implicitely when importing the data. In the CLI you can also run the LC-align subtool.
[todo: describe lcms-align]
Output
SIRIUS project-space
In SIRIUS 6, the project space is stored in a singular .sirius file that is no longer human or machine-readable. This step was necessary to ensure performance for many of the new features and is in no way intended to “close off” results. Summaries can be written as usual using the GUI or CLI. Advanced information or intermediate results (e.g. predicted fingerprints) can be accessed using the new API.
Summary files
The summaries written by CLI or GUI are in tsv (tab-separated-values) format and named “formula_identifications.tsv”, “canopus_formula_summary.tsv”, “canopus_structure_summary.tsv”, “structure_identifications.tsv” and “denovo_structure_identifications.tsv”. They provide easy access to the results for further downstream analysis, data sharing and data visualization. The summaries are not imported into SIRIUS but are (re-)created based on the actual results every time a project-space is exported. Summaries are created for molecular formula annotation, compound class prediction and structure annotation separately.
Molecular formula results summary
formula_identifications.tsv
contains the top-ranked formula result of each feature as
determined by the SIRIUS score, or the ZODIAC if available.
However, different adduct candidates with the same precursor ion molecular formula
will have identical score (e.g. [C20H14O6 + NH4]+
and [C20H19NO7 - H2O + H]+
).
In such cases, the top-ranked candidate in formula_identifications.tsv
is resolved
to [C20H17NO6 + H]+
only considering the ion type but ignoring adduct types.
formula_identifications_adducts.tsv
contains all top-ranked adducts (in this case
[C20H14O6 + NH4]+
and [C20H19NO7 - H2O + H]+
). This summary additionally contains all scores
shown in the GUI as well as potential lipid class annotations.
Molecular structure results summary
structure_identifications.tsv
contains the top-ranked structure result of each feature
as determined using the CSI:FingerID score; the molecular formula of the top structure
does not have to be the top-ranked molecular formula of this feature. The formula rank shows the
original rank of the molecular formula belonging to the top hit.
The summary contains confidence scores for exact and approximate mode, CSI:FingerID, ZODIAC and SIRIUS scores. Links to structure databases containing the structure hit can be found in the “links” column.
CANOPUS results summary
CANOPUS results are specific to a molecular formula. Since the top molecular formula annotated in the molecular formula annotation sub tool can potentially differ from the molecular formula of the top structure hit, both are reported separately.
canopus_formula_summary.tsv
contains compound classes predicted to be present by CANOPUS for
the top-ranked molecular formula of each feature. most specific class denotes the most specific compound class for this feature. The columns
level 5, subclass, class, and superclass refer to the ancestors of this most specific class.
canopus_structure_summary.tsv
contains compound classes predicted to be present by CANOPUS for
the molecular formula belonging to the top-ranked structure of each feature. most specific class denotes the most specific compound class for this feature. The columns
level 5, subclass, class, and superclass refer to the ancestors of this most specific class.
If there are multiple molecular formulas with
same score (which should happen only for adducts, see the molecular formula results summary) then the
canopus_summary.tsv
will decide for one molecular formula for each feature. We always choose the molecular formula
for which the CANOPUS probability of the most specific class is maximal.
Standardized project-space summary with mzTab-M
The project-space is a SIRIUS-specific format that allows the user to
access all results and analysis details, but may not be optimal for
sharing this data with third party tools or data archives. For this
purpose, SIRIUS provides an analysis report (analysis_report.mztab
) in
the standardized mzTab-M format. All results summarized in this report
are linked to the results in the corresponding SIRIUS project-space,
allowing the user to share summarized results using mzTab-M without
losing the connection to the detailed results provided in the
project-space. Furthermore, SIRIUS passes meta information (if available in the input)
such as scan numbers and ids of the input data into this analysis report. This allows
for an easy combination of the SIRIUS results with the results of other
analyses such as MS1-based quantification.
Parameter formats
Ion modes
Whenever SIRIUS requires the ion mode, it should be given in the following format:
[M+ADDUCT]+ for positive ions
[M+ADDUCT]- for negative ions
[M-ADDUCT]+/[M-ADDUCT]- for losses
[M]+/[M]- for instrinsically charged compounds
[M+?]+ for positive ions with unknown adduct
[M+?]- for negative ions with unknown adduct
ADDUCT is the molecular formula of the adduct. The most common
ionization modes are [M+H]+, [M+Na]+,[M-H]-, [M+Cl]-
.
Currently, SIRIUS supports only single-charged compounds, so [M+2H]2+
is not valid and will be skipped during import. Further dimers are also not yet
supported so that everything containing [2M]
will be skipped during import.
Molecular formulas
Molecular Formulas in SIRIUS must not contain brackets. Hence, 2(C2H2)
is not a
valid molecular formula; write C4H4
instead. Furthermore, all molecular
formulas in SIRIUS are always neutral, and there is no possibility to
add a charge on a molecular formula (instead, charges are given
separately). Hence, CH3+
is not a valid molecular formula. Write CH3
instead, and
provide the charge separately via commandline option.
Chemical alphabets
Whenever SIRIUS requires the chemical alphabet, you have to provide which elements should be considered and what is the maximum amount for each element. Chemical alphabets are written like molecular formulas. The maximum amount of an element is written in square brackets behind the element. If no square brackets are given, the element might occur arbitrary often. The standard alphabet is CHNOP[5]S, allowing the elements C, H, N O and S as well as up to five times the element P.