Transforming a XML according to a XSD using XSLT

18,471

Solution 1

You have two challenges here: (1) identifying the set of element names and attributes declared in the schema, with appropriate context information for local declarations, and (2) writing XSLT to retain elements and attributes which match those names or names-and-contexts.

There is also a third issue, namely specifying clearly what you mean by "elements and attributes that are (or are not) defined in the XSD schema". For purposes of discussion I'll assume you mean elements and attributes which could be bound to element or attribute declarations in the schema, in a validation episode (a) rooted at an arbitrary point in the input document tree and (b) starting with a top-level element declaration or attribute declaration. This assumption means several things. (a) Local element declarations will only match things in context -- in your example, keptElement1 and keptElement2 will be retained only when they are children of parent, not otherwise. (b) There is no guarantee that the elements in the input would in fact be bound to the element declarations in question: if one of their ancestors is locally invalid, things get complicated fast both in XSD 1.0 and in 1.1. (c) We don't allow for starting validation from a named type definition; we could, but it doesn't sound as if that's what you're interested in. (d) We don't allow for starting validation from local element or attribute declarations.

With those assumptions explicit, we can turn to your problem.

The first task requires that you make a list of (a) all the elements and attributes with top-level declarations in your schema, and (b) all the elements and attributes reachable from them. For top-level declarations, all we need to record is the kind of object (element or attribute) and the expanded name. For local objects, we need the kind of object and the full path from a top-level element declaration. For your sample schema, list (a) consists of

  • element {}parent

(I am using the convention of writing expanded names with the namespace name in braces; some call this Clark notation, for James Clark.)

List (b) consists of

  • element {}parent/{}keptElement1
  • element {}parent/{}keptElement2
  • attribute {}parent/{}keptAttribute1
  • attribute {}parent/{}keptAttribute2

In more complicated schemas, there will be a certain amount of bookkeeping as you go through the process of generating this list.

Your second task is to write an XSLT stylesheet that keeps the elements and attributes in the list and drops the rest. (I'm assuming here that when you drop an element, you drop all its contents, too; your question talks about elements, not tags.)

For each element in the list, write an appropriate identity transform, using the context given in the list:

<xsl:template match="parent">
  <xsl:copy>
    <xsl:apply-templates select="@* | node()"/>
  </xsl:copy>
</xsl:template>

You can write a separate template for each element, or you can write several elements into the match pattern:

<xsl:template match="parent
                    | parent/keptElement1 
                    | parent/keptElement2">
  <xsl:copy>
    <xsl:apply-templates select="@* | node()"/>
  </xsl:copy>
</xsl:template>

For each attribute in the list, do the same:

<xsl:template match="parent/@keptAttribute1">
  <xsl:copy/>
</xsl:template>

Override the default templates for elements and attributes, to suppress all other elements and attributes:

<xsl:template match="*|@*"/>

[Alternatively, as suggested by DrMacro, you can write a function or named template in XSLT to consult the list you generated in task 1, instead of writing it out into repetitive templates with explicit match patterns. Depending on your background, you may find that that approach makes it easier, or harder, to understand what the stylesheet is doing.]

Solution 2

This cannot be done with generic XSLT processing because the XSLT engine has no knowledge of the XSD.

This leaves a couple of options:

  1. Process the XSD document directly with XSLT to determine what element types are and are not actually declared and then use that information in your transform. For example, if an element is in a namespace that isn't governed by your XSD schema, then you know it's not defined, or if the element's namespace is specified by an xs:any element with a "lax" validation, you know it's not declared.

  2. Use the commercial version of Saxon, which provides XSD parsing and validation and provides access to the additional properties added to elements by the XSD processing. See the Saxon documentation for details.

The Apache xerces project includes an XSD parser in Java that can be used to process complex XSDs to do whatever you need, such as build a list of element types or namespaces that are or are not governed by a given schema. So if your schema is relatively static, it might be most efficient to preprocess the schema to build a simple data file your XSLT can then use when processing documents.

You didn't say if you can use XSLT 2, but if you can, the general solution would be to define a function that determines if a given element or attribute is declared and then use that function as part of a standard identity transform. With XSLT 1 you can get the same effect with a named template.

For example:

<xsl:function name="local:isGoverned" as="xs:boolean">
   <xsl:param name="context" as="node()"/>
   <xsl:variable name="isGoverned" as="xs:boolean">
   <!-- Do whatever you do to determine governedness,
        whether this is to look at your collected data
        or use Saxon-provide info or whatever.
    -->
  </xsl:variable>
  <xsl:sequence select="$isGoverned"/>
</xsl:function>

And then in your identity transform:

<xsl:template match="*">
  <xsl:copy>
    <xsl:apply-templates 
      select="
         @*[local:isGoverned(.)], 
         (*[local:isGoverned(.)] | 
          node())"
    />
  </xsl:copy>
</xsl:copy>

<xsl:template match="@* | text() | comment() | processing-instruction()">
  <xsl:sequence select="."/>
</xsl:template>

This will have the effect of only passing through those elements and attributes that are governed by the XSD, however you figure that out.

Eliot

Share:
18,471
Stian Standahl
Author by

Stian Standahl

Work as a full stack developer. Outside work I make small apps for fun, and trawl stackoverflow for problems that i can solve. The most fun on stackoverflow is super-optimizing and benchmarking simple problems and make them more efficient.

Updated on June 27, 2022

Comments

  • Stian Standahl
    Stian Standahl almost 2 years

    I would like to create a XSLT that can transform a XML so that all of the elements and attributes that is not defined in the XSD is excluded in the output XML (from the XSLT).

    Lets say you have this XSD.

    <xs:element name="parent">
        <xs:complexType>
            <xs:sequence>
                <xs:element name="keptElement1" />
                <xs:element name="keptElement2" />
            </xs:sequence>
    
            <xs:attribute name="keptAttribute1" />
            <xs:attribute name="keptAttribute2" />
        </complexType>
    </xsd:element>
    

    And you have this input XML

    <parent keptAttribute1="kept" 
        keptAttribute2="kept" 
        notKeptAttribute3="not kept" 
        notKeptAttribute4="not kept">
    
        <notKeptElement0>not kept</notKeptElement0>
        <keptElement1>kept</keptElement1>
        <keptElement2>kept</keptElement2>
        <notKeptElement3>not kept</notKeptElement3>
    </parent>
    

    Then i would like to have the output Xml looking like this.

    <parent keptAttribute1="kept" 
        keptAttribute2="kept">
    
        <keptElement1>kept</keptElement1>
        <keptElement2>kept</keptElement2>
    </parent>
    

    I am able to do this by specifying the elements, but this is about as far as my xslt skills reach. I have problem doing this generally for all elements and all attributes.