How do I find duplicate data in xml document using XQuery?

10,853

Solution 1

Everything MarkLogic returns is just a sequence of nodes, so we can count the sequence size of the whole and compare it to the count of the sequence of distinct values. If they're not distinct, they're duplicate, and you have your subset.

for $c in doc()//colors
where fn:count($c/color) != fn:count(fn:distinct-values($c/color))
return $c

Solution 2

This should do the trick. I am not too familiar with MarkLogic, so the first line to get the set of documents may be wrong. This will return all documents which have 2 or more color elements with the same string value.

for $doc in doc()
let $colors = $doc//color/string(.)
where some $color in $colors
      satisfies count($colors[. = $color] > 1)
return doc()

Solution 3

Or you could do it completely out of indexes :)

for $c in doc()//colors is likely to create an EXPANDED TREE CACHE error on larger data sets.

Here is a slightly more complicated way to attack this when the data is huge, make sure the URI Lexicon is turned on and then add a element range index on the element color and compute the distinct color values that have duplication somewhere. Then loop over only the documents that have this color one by one and compute the item-frequency counts of the colors of interest in the documents. If you get a frequency over 1, this document needs de-duplication.

let $qn := xs:QName("color")
let $colorsWithItemFreq := cts:element-values($qn, (), ("ascending", "item-order", "item-frequency"))
let $colorsOfInterest := 
    for $color at $i in cts:element-values($qn, (), ("ascending", "item-order", "fragment-frequency"))
    let $fragFrequency := cts:frequency($color)
    let $itemFrequency := cts:frequency($colorsWithItemFreq[$i])
    where $itemFrequency gt $fragFrequency
    return 
        $color

for $uri in cts:uris( (), ("document"), cts:element-value-query($qn, $colorsOfInterest)
let $colorsWithDuplicationInThisDoc :=
    for $color in cts:element-values($qn, (), ("item-frequency"), cts:document-query($uri) )
    where $color = $colorsOfInterest and cts:frequency($color) gt 1
    return
        $color
where fn:count( $colorsWithDuplicationInThisDoc ) gt 1
return
    $uri

Hope that helps.

Solution 4

For this XML:

<?xml version="1.0"?>
<colors>
    <color>Red</color>
    <color>Red</color>
    <color>Blue</color>
</colors>

Using this XSD:

<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

    <xsl:output method = "text" />  
    <xsl:strip-space elements="*"/>

    <xsl:template match="colors">

        <xsl:for-each select="color">
            <xsl:variable name="node_color" select="text()"/>
            <xsl:variable name="numEntries" select="count(../color[text()=$node_color])"/>
            <xsl:if test="$numEntries &gt; 1">
                <xsl:text>Color value of </xsl:text><xsl:value-of select="."/><xsl:text> has multiple entries &#xa;</xsl:text>      
            </xsl:if>
        </xsl:for-each>
    </xsl:template>
</xsl:stylesheet>

I got this output:

Color value of Red has multiple entries 
Color value of Red has multiple entries 

So that will at least find them, but it will report each occurrence of a repeated color, not just every repeated color.

Share:
10,853
Sixty4Bit
Author by

Sixty4Bit

Knowledgeable in Java, Ruby, PHP, JavaScript, HTML, CSS, database design, GIS and BigData. http://osparna.com

Updated on June 05, 2022

Comments

  • Sixty4Bit
    Sixty4Bit almost 2 years

    I have a bunch of documents in a MarkLogic xml database. One document has:

    <colors>
      <color>red</color>
      <color>red</color>
    </colors>
    

    Having multiple colors is not a problem. Having multiple colors that are both red is a problem. How do I find the documents that have duplicate data?