Parse XML and find all instances of a string

14,517

Solution 1

Assuming that XML located at file.xml, following XPath with returns you Name attribute: String "C:\" could be at:

//Task[contains(text(), "C:\") or //*[contains(text(), "C:\")] or //*[@*[contains(., "C:\")]]]/@Name

Explanations:

  • Text of Task tag
  • Text of any children
  • In any attribute of any children

PowerShell sample:

#read xml
$xml = [xml](gc -Encoding utf8 .\test.xml) 

#process it
$xml | 
   Select-Xml '//Task[contains(text(), "C:\") or //*[contains(text(), "C:\")] or //*[@*[contains(., "C:\")]]]/@Name' | 
   % { $_.Node."#text" }

Solution 2

When you cast to [xml], you can access everything using a really nice "property" syntax. Multiple nodes with the same tag will be exposed as arrays. Then you can use the InnerXml property to get at the raw XML string defining the current node. You then just need to do a simple "-like" match against your search string.

Assuming you have multiple "Task" nodes under a single "Tasks" node in one file:

$tasks = [xml] (Get-Content .\Tasks.xml)
$tasks.Tasks.Task |?{ $_.InnerXml -like '*C:\*' } | select -expand Name

Or, if there is a single Task node in each of multiple files:

dir *.xml |%{ [xml] (Get-Content $_) } |?{ $_.Task.InnerXml -like '*C:\*' } | select -expand Name

These will get you the task names. Getting every line within the node which contains the search string is a bit trickier. Here's a hacky regex approach (I know I know, don't parse XML with regex...). Again, assuming a single Task node in each XML file:

$taskXmls = dir *.xml |%{ [xml](Get-Content $_) }

foreach($taskXml in $taskXmls)
{
   if($taskXml.Task.InnerXml -like '*C:\*')
   {
       $hits = [Regex]::Matches($taskXml.Task.InnerXml, '<[^<]*C:\\[^>]*>')
       $hitList = $null
       if($hits)
       {
            $hitList = $hits | select -expand Value
       }
       new-object psobject -prop @{TaskName = $taskXml.Task.Name; Hits = $hitList}
   }
}
Share:
14,517
mhopkins321
Author by

mhopkins321

Updated on August 02, 2022

Comments

  • mhopkins321
    mhopkins321 over 1 year

    I'm working with an xml file that looks similar to the following. However it is the following thousands of times over. I will be using powershell to parse through the xml

    I need to find the task name of all the tasks where the string "c:\" shows up. While this could be easy if there was only one area that the string might show up, it can quite literally show up all over the task. In this particular task I have put the C:\ in 4 different times.

    I'm hoping to get an output of the task name, and the places that the given path was referenced...

    <Task ID="00000000" Name="Task name goes here" Active="0" NextEID="22" CacheNames="random" AR="0" TT="COS">
            <Info>
                <Description>
                </Description>
                <Notes>
                </Notes>
            </Info>
            <Parameters>
                <moreParameters>C:\pathGoesHere</moreParameters>
            </Parameters>
            <Schedules/>
            <Source HostID="0" Type="FileSystem" Path="C:\path" FileMask="[Parm:parameter].txt" DeleteOrig="0" NewFilesOnly="0" SearchSubdirs="0" Unzip="0" RetryIfNoFiles="0" UseDefRetryCount="1" UseDefRetryTimeoutSecs="1" UseDefRescanSecs="1" UDMxFi="1" UDMxBy="1" ID="11"/>
            <For ID="13">
                <Destination HostID="000000" Type="siLock" FolderID="" FolderType="4" FolderName="Home/[Parm:parameter]/" Subject="" FileName="[OnlyName]_[YYYY][MM][DD].bai" UseOrigName="0" ForceDir="1" OverwriteOrig="1" UseRelativeSubdirs="1" Zip="0" UseDefRetryCount="1" UseDefRetryTimeoutSecs="1" UseDefUser="1" UseDefClientCert="1" ID="12"/>
                <If ID="14">
                    <When>
                        <Criteria>
                            <comp a="[ErrorCodeFile]" test="NEQ" b="0"/>
                        </Criteria>
                        <UpdOrig Action="d" ID="15"/>
                        <Destination HostID="0000000000" Type="Share" Path="C:\anotherCPath" FileName="[Parm:parameter]_[YYYY][MM][DD].bai" UseOrigName="0" ForceDir="1" OverwriteOrig="1" UseRelativeSubdirs="1" Zip="0" UseDefRetryCount="1" UseDefRetryTimeoutSecs="1" ID="17"/>
                    </When>
                </If>
            </For>
            <If ID="19">
                <When>
                    <Criteria>
                        <comp a="[ErrorCodeTask]" test="NNE" b="0"/>
                    </Criteria>
                    <Email HostID="385322183" Subject="[TaskStatus]-[TaskName]" Message="" AddressTo="[email protected]" Attachment = "C:\path\" UseDefRetryCount="1" UseDefRetryTimeoutSecs="1" ID="20"/>
                </When>
            </If>
        </Task>