How to parse a YAML file with multiple documents?

28,053

Solution 1

The error message is quite specific that a document needs to start with a document start marker. Your first document doesn't have such a marker, although it has a document end marker. After you explicitly end the first document with ... you can no longer use a document without document boundary markers in PyYAML, you explicitly have to start it with ---:

The end of your file should look like:

    kind: UndergroundDistributionLineSegment
...
---
ingests:
  - timestamp: 1970-01-01T00:00:00.000Z
    id: OverheadDistributionLineSegment_31168454

You can leave out the explicit document start marker from the first document, but you need to include a start marker for every following document. Document end markers are optional.

If you don't have complete control over the input, using .load_all() is not safe. There normally is no reason to take that risk and you should be using .safe_load_all() and extend the SafeLoader to handle any specific tags that your YAML might contain.

Apart from that you should start your YAML documents with an explicit version directive before the document start indicator (which you should also add to the first document):

%YAML 1.1
---

This is for the benefit of future editors of your YAML files, because you are using PyYAML, which only supports (most of) YAML 1.1 and not the YAML 1.2 specification (form 2009). The alternative is of course to upgrade your YAML parser to e.g ruamel.yaml, which would also have warned you about your use of the unsafe load_all() (disclaimer: I am the author of that parser). ruamel.yaml doesn't allow you to have a bare document after an explicit end-of-document marker (which is allowed as @flyx pointed out), which is a bug.

Solution 2

I think you have an invalid yaml

Look at the second document in the sample it begins with a ... instead of ---

... 
ingests:
  - timestamp: 1970-01-01T00:00:00.000Z
    id: OverheadDistributionLineSegment_31168454
Share:
28,053
BigBoy1337
Author by

BigBoy1337

I am trying to learn ruby on rails

Updated on July 09, 2022

Comments

  • BigBoy1337
    BigBoy1337 almost 2 years

    Here is my parsing code:

    import yaml
    
    def yaml_as_python(val):
        """Convert YAML to dict"""
        try:
            return yaml.load_all(val)
        except yaml.YAMLError as exc:
            return exc
    
    with open('circuits-small.yaml','r') as input_file:
        results = yaml_as_python(input_file)
        print results
        for value in results:
             print value
    

    Here is a sample of the file:

    ingests:
      - timestamp: 1970-01-01T00:00:00.000Z
        id: SwitchBank_35496721
        attrs:
          Feeder: Line_928
          Switch.normalOpen: 'true'
          IdentifiedObject.description: SwitchBank
          IdentifiedObject.mRID: SwitchBank_35496721
          PowerSystemResource.circuit: '928'
          IdentifiedObject.name: SwitchBank_35496721
          IdentifiedObject.aliasName: SwitchBank_35496721
        loc: vector [43.05292, -76.126800000000003, 0.0]
        kind: SwitchBank
      - timestamp: 1970-01-01T00:00:00.000Z
        id: UndergroundDistributionLineSegment_34862802
        attrs:
          Feeder: Line_928
          status: de-energized
          IdentifiedObject.description: UndergroundDistributionLineSegment
          IdentifiedObject.mRID: UndergroundDistributionLineSegment_34862802
          PowerSystemResource.circuit: '928'
          IdentifiedObject.name: UndergroundDistributionLineSegment_34862802
        path:
        - vector [43.052942000000002, -76.126716000000002, 0.0]
        - vector [43.052585000000001, -76.126515999999995, 0.0]
        kind: UndergroundDistributionLineSegment
      - timestamp: 1970-01-01T00:00:00.000Z
        id: UndergroundDistributionLineSegment_34806014
        attrs:
          Feeder: Line_928
          status: de-energized
          IdentifiedObject.description: UndergroundDistributionLineSegment
          IdentifiedObject.mRID: UndergroundDistributionLineSegment_34806014
          PowerSystemResource.circuit: '928'
          IdentifiedObject.name: UndergroundDistributionLineSegment_34806014
        path:
        - vector [43.05292, -76.126800000000003, 0.0]
        - vector [43.052928999999999, -76.126766000000003, 0.0]
        - vector [43.052942000000002, -76.126716000000002, 0.0]
        kind: UndergroundDistributionLineSegment
    ... 
    ingests:
      - timestamp: 1970-01-01T00:00:00.000Z
        id: OverheadDistributionLineSegment_31168454
    

    In the traceback, note that it starts having a problem at the ...

    Traceback (most recent call last):
      File "convert.py", line 29, in <module>
        for value in results:
      File "/Users/conduce-laptop/anaconda2/lib/python2.7/site-packages/yaml/__init__.py", line 82, in load_all
        while loader.check_data():
      File "/Users/conduce-laptop/anaconda2/lib/python2.7/site-packages/yaml/constructor.py", line 28, in check_data
        return self.check_node()
      File "/Users/conduce-laptop/anaconda2/lib/python2.7/site-packages/yaml/composer.py", line 18, in check_node
        if self.check_event(StreamStartEvent):
      File "/Users/conduce-laptop/anaconda2/lib/python2.7/site-packages/yaml/parser.py", line 98, in check_event
        self.current_event = self.state()
      File "/Users/conduce-laptop/anaconda2/lib/python2.7/site-packages/yaml/parser.py", line 174, in parse_document_start
        self.peek_token().start_mark)
    yaml.parser.ParserError: expected '<document start>', but found '<block mapping start>'
      in "circuits-small.yaml", line 42, column 1
    

    What I would like is for it to parse each of these documents as a separate object, perhaps all of them in the same list, or pretty much anything else that would work with the PyYAML module. I believe the ... is actually valid YAML so I am surprised that it doesn't handle it automatically.

  • flyx
    flyx about 7 years
    ... ends the previous document. The scalar ingests then starts a new document implicitly. Using --- instead would also work, because that explicitly starts a new document, while it implicitly ends the previous document.
  • flyx
    flyx about 7 years
    Addendum: That's only valid for YAML 1.2. In YAML 1.1, you indeed need a ---.
  • flyx
    flyx about 7 years
    You should change your links to lead to the YAML 1.1 specification, because in YAML 1.2, it is perfectly valid to have an implicit document after a document suffix. And Example 9.3, which you linked, in the 1.2 spec directly shows that.
  • Anthon
    Anthon about 7 years
    @flyx thanks for pointing that out. Updated the answer, fixing ruamel.yaml to conform to that takes a bit more. I think you can argue that in YAML 1.1 this is not needed either ('A line beginning with "---" may be used to explicitly denote the beginning of a new YAML document' (emphasis mine)').
  • flyx
    flyx about 7 years
    The relevant production in YAML 1.1 is l-yaml-stream, which captures all documents after the first one as l-next-document, which resolves to an l-explicit-document, and that must start with ---.