TogoMetabolomeDataFormat

From Metabolonote
jump-to-nav Jump to: navigation, search
Ja.gif 日本語ページへ

TogoMD: the Togo Metabolome Data Format

The Togo Metabolome Data Format (TogoMD) defines an easy-to-use data format with the aim of advanced utilization of metabolomics data. Based on this format, we endeavor to integrate domestic metabolome databases.

Contents

Definition regarding Description Fields

XML Definition File (XSD)

From metadata to peak data, those fields necessary for describing metabolome data have been carefully selected and the field names and descriptions defined. This definition is provided as the XML schema described below.


URI http://metabolonote.kazusa.or.jp/TogoMetabolomeDbSchema.xsd
Version 1.2.0
Last modified Nov. 5, 2014

 

Correspondence of the XML Element/Attribute and Metabolonote Field Name

This shows the description of the XML element and attribute. In addition, this section also shows the correspondence of the field name and the property name described on each Metabolonote page.

* Peak information (P) is not used in Metabolonote.


Metabolonote XML schema Value format *2 Description
ID Label Page's field name Property name Element name Attribute name or subelement name *1
SE sample_set Sample set information.
Indicates a set of experiments or data obtaining projects.
ID SE_ID id /SE\d+/ Sample set ID.
This is the unique ID in the system. When data is private, any given alphanumeric characters can be used for the tentative ID.
Title SE_Title title STRING Short title
Description SE_Description description STRING Describes important concepts for interpreting data, such as experiment purposes and relevancy between samples.
Authors SE_Authors authors STRING Author
Reference SE_Reference reference STRING Related reference information
Comment SE_Comment comment TEXT *3 Comment
S sample Sample information.
Describes the preparation methods for each individual sample.
ID S_ID id /S\d+/ Sample ID.
This ID does not duplicate in the sample set (SE).
Title S_Title title STRING Short sample name
Organism - Scientific Name S_Organism - Scientific Name organism_scientific_name STRING Scientific name.

This is required when biological samples are handled.

Organism - ID S_Organism - ID organism_id Database Name:ID[|Database Name:ID]... *4 Classification ID of the organism.
Compound - ID S_Compound - ID compound_id Database Name:ID[|Database Name:ID]... *4 Compound ID
Compound - Source S_Compound - Source compound_source STRING Information about the availability of the reagent: the company name and catalog id.
This is required when standard compounds are handled.
Preparation S_Preparation preparation STRING Growing methods, conditions, particular processing, sampling portions, sampling methods, and preparation methods for reagents
Sample Preparation Details ID S_Sample Preparation Details ID sample_preparation_details_id /SS\d+/ The ID of sample preparation details information (SS) applied.
Comment S_Comment comment TEXT *3 Comment
M analytical_method Analytical method information
Describes the instrumental analysis methods for individual samples.
ID M_ID id /M\d+/ Analysis method ID that does not duplicate in the sample (S).
ID M_Title title STRING Short title.
Method Set ID M_Method Set ID analytical_method_details_id /MS\d+/ Detailed analysis information ID (MS) applied.
Sample Amount M_Sample Amount sample_amount STRING An amount of sample used.
This information is necessary for normalizing quantitative data to compare with other samples.
Comment M_Comment comment TEXT *3 Comment
D data_analysis Data analysis information.
Describes data analysis methods based on the use of computer, such as peak extraction.
ID D_ID id /D\d+/ Data analysis method ID that does not duplicate in the analysis method (M).
Title D_Title title STRING Short Title.
Data Analysis Set ID D_Data Analysis Set ID data_analysis_details_id /DS\d+/ Detailed data analysis method information ID (DS) applied.
Recommended decimal places of m/z D_Recommended decimal places of m/z recommended_decimal_places_of_mass {default OR INT}{[|peak INT] OR [|Instrument X INT]}... *5 Number of significant figures.
Comment D_Comment comment TEXT *3 Comment
SS sample_preparation_details Detailed information about sample preparation.
Shared in the sample set.
ID SS_ID id /SS\d+/ The sample preparation details ID that does not duplicate in the sample set (SE).
Title SS_Title title STRING Short title
Description SS_Description description STRING Details about sample preparation.
In the case of biological samples, for example, details of growth conditions and drug treatments are described. Descriptions that depend on analytical methods should not be included here, and they should be included in the details of analytical methods (MS).
Comment_of_details SS_Comment of details comment_of_details TEXT *3 Comment
MS analytical_method_details Detailed analysis method information.

Shared within the sample set.

ID MS_ID id /MS\d+/ Detailed analysis information ID that does not duplicate in the sample set (SE).
Title MS_Title title STRING Short title
Instrument MS_Instrument instrument STRING Instrument name and vendor name
Instrument Type MS_Instrument Type instrument_type *6 Instrument type
Ionization MS_Ionization ionization_method *6 Ionization method
Ion Mode MS_Ion Mode ion_mode *6 Distinction of positive analysis and negative analysis
Description MS_Description description STRING Details about methods of instrumental analysis.

Describes all details regarding analytical instruments and analysis conditions. Describes sample preparation methods too, other than information that depends on the sample. For example, homogenization and metabolite extraction method should be described here.

Comment_of_details MS_Comment of details comment_of_details TEXT *3 Comment
DS data_analysis_details Detailed information of data analysis methods.
Shared within the sample set.
ID DS_ID id /DS\d+/ Detailed analysis method information that does not duplicate in the sample set (SE).
Title DS_Title title STRING Short title
Description DS_Description description STRING Describes all details regarding data analysis methods such as software programs used and the parameters adopted.
Comment_of_details DS_Comment of details comment_of_details TEXT *3 Comment
AM annotation_method_details Detailed information about annotation methods.
ID AM_ID id /AM\d+/ Annotation method ID that does not duplicate in the sample set (SE)
Title AM_Title title STRING Short title
Description AM_Description description STRING Describes details regarding annotation methods. Describes standards by which annotation has been assigned.
Comment_of_details AM_Comment of details comment_of_details TEXT *3 Comment
P *7 peak Peak information.
Detailed description of each individual peak obtained and its annotation.
Peak ID *7 @id /P\d+/ Peak ID that does not duplicate in data analysis method information (D)
Intensity *7 intensity DOUBLE Peak intensity
The interpretation of value, if it is the relative value or the absolute value, is described in data analysis method information (D).
Retention Time (min) *7 retention_time DOUBLE Retention time. The unit is minutes.
If CE-MS, this indicates Migration Time.
Retention Index *7 retention_index DOUBLE Retention time index.
If CE-MS, this indicates Migration Index.
Mass Detected *7 mass_detected DOUBLE m/z value of the parent ion that was detected.
If GC-MS, this indicates null.
Ion Species *7 ion_species STRING *6 If LC-MS, this indicates the type of ion detected.
[M+H]+, etc.
Isotope Peaks *7 isotope_peaks MI:MASS INT[|13C1:MASS INT[|13C2:MASS INT[|13C3:MASS INT...]]] *8 The m/z value of isotope peak and intensity information
EI MS spectrum *7 ei_mass_spectrum *9 *10 If GC-MS, this indicates MS spectrum information with EI.
MSn spectrum *7 msn_spectrum *9 *10 If LC-MS and CE-MS, this indicates the MSn spectrum.
UV absorption spectrum *7 uv_absorption_spectrum *9 *11 If LC-MS, this indicates the UV-Vis absorption spectrum.
NIR and IR will also be available in the future.
Annotation *7 annotation STRING Annotation information.
Describes information regarding the elemental formula, the compound name, the compound group name, and the degree of annotation confidence.
Annotation Method ID *6 annotation_method_details_id /AM\d+/ ID of the detailed information of annotation methods (AM)
Annotated Compound ID *7 annotated_compound_id Database Name:ID[|Database name:ID]... *4 Annotated compound ID
Comment *7 comment STRING Comment
  • *1 "@" indicates the attribute name, while other are indicated by element name.
  • *2 "STRING" indicates a non-breaking string. "TEXT" indicates a breaking string. "INT" indicates an integer. "DOUBLE" indicates a double floating-point number. "MASS" is the value that indicates m/z value. "ID" indicates the database ID. A string separated with "/" indicates the regular expression. A portion between "[" and "]" indicates the block that can be added arbitrarily. The character "..." indicates the repetition of the last portion separated by "[" and "]" or the similar pattern. The character "|" indicates the delimiter, which does not mean "OR" used as one of the regular expressions. The portion separated by "{" and "}" indicates the block that can be added to the pattern before or after "OR". "OR" means the "OR" of one of the regular expressions. Other expressions indicate reserved words.
  • *3 When the line head is prefixed with "[", the portion up to the next character "]" is considered to be the subfield name. The portion up to the line end is considered to be the content of the subfield. This specification is prepared for future function enhancement.
  • *4 Only the determined STRING for the database name is inserted, but is not always defined with XSD.
  • *5 "default": A reserved word that means "just as described". Can still be used even though changed to an integer value. "peak": The number of digits of m/z detected within peak information. "Instrument X": The number of digits of mass within msn_spectrum.
  • *6 Only the determined STRING is inserted, but is not always defined with XSD.
  • *7 Peak information (P) is not used in Metabolonote.
  • *8 "MI": A reserved word that indicates the monoisotopic ion. MASS becomes identical with m/z detected. "Isotope (e.g. 13C1)" indicates the isotope of the isotopic peaks and the number of isotopic peaks within a molecule.
  • *9 Not written on the peaktable file. See Spectrum Data Format for details on how to describe this information.
  • *10 The xml definitions of MSn and EI MS. This value can have multiple ion elements with "mass" and "intensity" as attributes.
  • *11 The xml definitions of UV-Vis. This value can have multiple absorption elements with "wave_length" and "value" as attributes.


Other rules

Omission of top-level ID

When the metadata ID is described with omission of its top-level ID, the metadata ID is recognized to come under the same top-level metadata. For example, a ID "DS2" written in the description of metadata "SE1_DS1" represents the ID "SE1_DS2."

"PSEUDO: " a blank node

A metadata whose Title starts with "PSEUDO: " represents a blank node which is conveniently prepared for placing the lower-level metadata. Several processed data (D) can be further used for another integrated data analysis (D). In this case, the metadata for the integrated analysis should not be related to a certain substance of sample or raw data. To describe such metadata, a blank node to construct the metadata hierarchy is needed. The description "PSEUDO: " in the head of Title in sample (S) or analytical method (M) class is a marker of such a conveniently prepared metadata as blank node.

ID Assignment

See here for the rules for ID Assignment.

File Type and Extension

Data type Example of ID File descriptor (extension) Description File format
Metadata SE** .info.txt Files that contains metadata of each class (SE, S, M, D, MS, DS, AM) The Element name, the Attribute name or subelement name of XML schema, and the values of thme are described in tab delimited format.
Sample file is here.
SE**_S**
SE**_S**_M**
SE**_S**_M**_D**
SE**_S**_M**_D**_P**
Peak related data (for multiple peaks) SE**_S**_M**_D** .peak-table.txt Information of detected peaks are described in a table. The attribute name or subelement name of the XML schema for peak information (P) (excluding the spectrum data) and their values are described in tab delimited text format.
Sample file is here.
.msn-list.txt MSn spectrum data in list. See the section "Fromat of spectrum data file" in detail.
A sample of msn-list file is here.
.uv-list.txt UV-Vis spectrum data in list.
.ei-list.txt EI mass spectrum data in list.
Peak related data (for a single peak) SE**_S**_M**_D**_P** .peak.txt Information of a detected peak. The format is same as that of "peak-table.txt", although data for only one peak is included.
.msn.txt MSn spectrum data for a single peak Same as ".msn-list.txt" file
.uv.txt UV-Vis spectrum data for a single peak Same as ".uv-list.txt" file
.ei.txt EI mass spectrum data for a single peak Same as ".ei-list.txt" file
.peak-all.txt All information related to a single peak Data in .info.txt, .peak.txt, .msn.txt, .uv.txt, and .ei.txt (if exists) are concatenated in a file.
Data type Example of ID File descriptor (extension) Description File format
Raw data (binary) SE**_S**_M** .bin.zip The binary raw data generated by the analytical instrument. A zip compressed file includes the binary raw file, .info.txt file and other additional files such as license information.
Raw data (text) SE**_S**_M**_D** .txt.zip Text files that contain unprocessed near-raw data extracted from the binary raw data. A zip compressed file includes the text files below, .info.txt file, and other additional files such as license information.
SE**_S**_M**_D** .raw-ms.txt chromatogram data It will be discussed and defined according to requirements. If the full mass data and MSn data are prepared in separate files, the raw-ms.txt files can be provided with branch numbers. At least one of raw-ms.txt or raw-ms-table.txt must be provided. If in the case UV-Vis data exists, at least one of raw-uv.txt file or raw-uv-table.txt file must be provided.
SE**_S**_M**_D** .raw-uv.txt Raw UV-Vis spectrum data
SE**_S**_M**_D** .raw-ms-table.txt Mass chromatogram data in table format.
SE**_S**_M**_D** .raw-uv-table.txt UV-Vis spectrum data in table format.

Format of data file

Described in text files.

Common file header

Files must contain a header line shown below as the first line.

  • " <tab> " means a tab (control character). The data values are shown in parentheses "[]".
# <tab> id <tab> [Database name]:[Metadata ID].[File descriptor]

(Example)

# <tab> id <tab> kazusa:SE01_S01_M01_D01.info.txt

Optional header

Other information can be attached after the first line.

# <tab> license <tab> [License information]

(Example)

# <tab> license <tab> CC BY-SA

Peak table

Data file that contains information of multiple peaks in tab delimited table format.


A column header line is described after the common header line.


The attribute name and subelement name of the XML schema for Peak information (P) (described in the section "Correspondence of the XML Element/Attribute and Metabolonote Field Name") should be described with being delimited by tab.

  • The spectrum data (ei_mass_spectrum, msn_spectrum, and uv_absorption_spectrum) should not be included in this file.

The data values are described in the following lines with being delimited by tab.

(Example)

Help TogoMD PeakTable.png

Sample file is here.

Format of spectrum data file

This file format is defined to describe the data below.

  • MSn spectrum data
  • EI mass spectrum data generated by GC-MS analysis
  • UV-Vis absorption spectrum data


More than one data block defined below should be described after the common header line.

The header line starts with ">" and following data line(s) containing a pair of values. Tab is used as delimiter.

(Example) In the case of msn-list data

Help TogoMD MsnList.png

A sample of msn-list file is here.

Header line

Each column of the header line contains below.

Column Description Requirements Value format *1
1 Peak ID mandatory />P\d+/ (">" + Peak ID)
2 Descriptor for MSn and detector type mandatory STRING *2
3 Type of instrument mandatory STRING *3
4 Ion mode mandatory for MSn data /[+|-]/ (positive or negative)
5 Mass scan mode mandatory for MSn and EI data /[c|p]/ (centroid or profile)
6 Ionization method mandatory for MSn and EI data STRING *4
7 Collision energy mandatory for MSn and EI data STRING *5
8 m/z scan range mandatory for MSn and EI data /[\d\.]+-[\d\.]+/

*1 Same as *2 of the table in the section "Correspondence of the XML Element/Attribute and Metabolonote Field Name".

*2 Details of the descriptor is described in the next section.

*3 Specified strings such as ITMS, FTMS, TOF-MS for EI, and PDA for UV-Vis analysis should be described.

*4 Specified strings such as ESI and EI should be described.

*5 Different descriptions can be described according to the type of instruments. (Example) cid35.00, 70eV, etc.

Descriptor for MSn and detector type

Multi-stage MS (MSn) msn event descriptor [mass value of precursor ion @ msn event descriptor that generate the precursor ion]
Electron ionization EI
UV-Vis absorption spectrum PDA, etc.

Multiple msn event descriptors that having the same name should not be contained in the data for a single peak ID.

In the case of MS2, the part of [mass value of precursor ion @...] can be omitted, because the precursor ion is explicitly same as the peak metabolite.

In the case of MS3 or further stage of MSn, the part of [mass value of precursor ion @...] must be described, because the origin of the precursor ion should be identified.

(Example)

ms3_1 [123.456@ms2_1]

The msn event descriptor

"ms" followed by the number of the stage. If multiple data exist for the same stage, they should be identified by branch numbers.

(Example)

Derived from the peak metabolite

ms2

Derived from the peak metabolite (such as the case that multiple MS2 data are acquired at multiple retention times).

ms2_1, ms2_2, etc.

Derived form a product ion generated by a MS2 analysis.

ms3, ms3_1, etc.

Data lines

Column Description Requirement Value format *1
1 m/z value or wave length (nm) mandatory DOUBLE
2 intensity mandatory DOUBLE
  • 1 Same as *2 of the table in the section "Correspondence of the XML Element/Attribute and Metabolonote Field Name".
Personal tools
View and Edit Metadata
Variants
Views
Actions