Context
I'm testing schema validation of the output mzML and mzXML's from msconvert and finding that the mzXML writer produces validation errors when multiple processing steps occur. This seems like an issue with the mzXML 3.2 schema itself that disallows multiple occurrences of within a group. Based on the code in pwiz/data/msdata/Serializer_mzXML.cpp, this seems to be an intentionally flexible field where processingOperation or comments can serve to document the processing done.
|
xmlWriter.startElement("dataProcessing", attributes); |
|
|
|
BOOST_FOREACH(const ProcessingMethod& pm, dpPtr->processingMethods) |
|
{ |
|
CVParam fileFormatConversion = pm.cvParamChild(MS_file_format_conversion); |
|
|
|
string softwareType = fileFormatConversion.empty() ? "processing" : "conversion"; |
|
|
|
if (pm.softwarePtr.get()) |
|
writeSoftware(xmlWriter, pm.softwarePtr, msd, cvTranslator, softwareType); |
|
|
|
write_processingOperation(xmlWriter, pm, MS_file_format_conversion); |
|
write_processingOperation(xmlWriter, pm, MS_peak_picking); |
|
write_processingOperation(xmlWriter, pm, MS_deisotoping); |
|
write_processingOperation(xmlWriter, pm, MS_charge_deconvolution); |
|
write_processingOperation(xmlWriter, pm, MS_thresholding); |
|
|
|
xmlWriter.pushStyle(XMLWriter::StyleFlag_InlineInner); |
|
BOOST_FOREACH(const UserParam& param, pm.userParams) |
|
{ |
|
xmlWriter.startElement("comment"); |
|
xmlWriter.characters(param.name + (param.value.empty() ? string() : ": " + param.value)); |
|
xmlWriter.endElement(); // comment |
|
} |
|
xmlWriter.popStyle(); |
|
} |
|
|
|
xmlWriter.endElement(); // dataProcessing |
However, the schema itself seems to enforce unusually rigid
To reproduce, I converted a thermo raw file with the following msconvert config file:
mzXML=true
zlib=true
mz64=true
inten64=true
simAsSpectra=true
filter="peakPicking vendor msLevel=1-2"
filter="scanNumber 22289-22486"
This produced an mzXML containing the following lines:
<dataProcessing centroided="1">
<software type="conversion" name="ProteoWizard software" version="3.0.18342"/>
<processingOperation name="Conversion to mzML"/>
<software type="processing" name="ProteoWizard software" version="3.0.18342"/>
<comment>Thermo/Xcalibur peak picking</comment>
</dataProcessing>
Running a validator on the full mzXML with the appropriate schema gives the following validation error:
austin@austin-vm-ubuntu:~/gdrive/data/20181208_fix_demux_mzXML_schema$ xmllint --schema raw/mzXML_schema/schema_revision/mzXML_3.2/mzXML_idx_3.2.xsd data/01_trimmed/23aug2017_hela_serum_timecourse_4mz_narrow_1.mzXML --noout
data/01_trimmed/23aug2017_hela_serum_timecourse_4mz_narrow_1.mzXML:20: element software: Schemas validity error : Element '{http://sashimi.sourceforge.net/schema_revision/mzXML_3.2}software': This element is not expected. Expected is one of ( {http://sashimi.sourceforge.net/schema_revision/mzXML_3.2}processingOperation, {http://sashimi.sourceforge.net/schema_revision/mzXML_3.2}comment ).
data/01_trimmed/23aug2017_hela_serum_timecourse_4mz_narrow_1.mzXML fails to validate
Here's the xmllint validator version:
austin@austin-vm-ubuntu:~/gdrive/data/20181208_fix_demux_mzXML_schema$ xmllint -version
xmllint: using libxml version 20908
compiled with: Threads Tree Output Push Reader Patterns Writer SAXv1 FTP HTTP DTDValid HTML Legacy C14N Catalog XPath XPointer XInclude Iconv ICU ISO8859X Unicode Regexps Automata Expr Schemas Schematron Modules Debug Zlib Lzma
The problem from the error seems to be that only one software element is allowed per dataProcessing element. Here is the visual XSD diagram from XMLSpy:
![image](https://user-images.githubusercontent.com/5126731/50126626-8715b080-0222-11e9-8244-326b2ea58fde.png)
Editing the XML manually to ensure that there was at least one processingOperation element per software element did not fix the issue. E.g., the following still produces the same validation error:
<dataProcessing centroided="1">
<software type="conversion" name="ProteoWizard software" version="3.0.18342"/>
<processingOperation name="Conversion to mzML"/>
<software type="processing" name="ProteoWizard software" version="3.0.18342"/>
<processingOperation name="Dummy processing op"/>
<comment>Thermo/Xcalibur peak picking</comment>
</dataProcessing>
Splitting the dataProcessing group into multiple groups, each with its own software element still fails validation in cases where there is a comment without a specific processingOperation. Here is the XML snippet and corresponding error:
<dataProcessing centroided="1">
<software type="conversion" name="ProteoWizard software" version="3.0.18342"/>
<processingOperation name="Conversion to mzML"/>
</dataProcessing>
<dataProcessing>
<software type="processing" name="ProteoWizard software" version="3.0.18342"/>
<comment>Thermo/Xcalibur peak picking</comment>
</dataProcessing>
austin@austin-vm-ubuntu:~/gdrive/data/20181208_fix_demux_mzXML_schema$ xmllint --schema raw/mzXML_schema/schema_revision/mzXML_3.2/mzXML_idx_3.2.xsd data/01_trimmed/23aug2017_hela_serum_timecourse_4mz_narrow_1_edited.mzXML --noout
data/01_trimmed/23aug2017_hela_serum_timecourse_4mz_narrow_1_edited.mzXML:23: element comment: Schemas validity error : Element '{http://sashimi.sourceforge.net/schema_revision/mzXML_3.2}comment': This element is not expected. Expected is ( {http://sashimi.sourceforge.net/schema_revision/mzXML_3.2}processingOperation ).
data/01_trimmed/23aug2017_hela_serum_timecourse_4mz_narrow_1_edited.mzXML fails to validate
However, by adding in a dummy processingOperation element then the XML validates:
<dataProcessing centroided="1">
<software type="conversion" name="ProteoWizard software" version="3.0.18342"/>
<processingOperation name="Conversion to mzML"/>
</dataProcessing>
<dataProcessing>
<software type="processing" name="ProteoWizard software" version="3.0.18342"/>
<processingOperation name="User-defined"/>
<comment>Thermo/Xcalibur peak picking</comment>
</dataProcessing>
austin@austin-vm-ubuntu:~/gdrive/data/20181208_fix_demux_mzXML_schema$ xmllint --schema raw/mzXML_schema/schema_revision/mzXML_3.2/mzXML_idx_3.2.xsd data/01_trimmed/23aug2017_hela_serum_timecourse_4mz_narrow_1_edited.mzXML --noout
data/01_trimmed/23aug2017_hela_serum_timecourse_4mz_narrow_1_edited.mzXML validates
Problem
Tools such as OpenSWATH that perform schema validation during mzXML import can fail on MSConvert mzXML files. Does the 3.2 schema needs to be updated to allow the sort of desired flexibility that the Serializer_mzXML allows? Can we constrain the Serializer_mzXML instead?