Parsing EDGAR filings

python parsing python-2.7 sgml

16,779

Solution 1

The pysec project looks promising. It's a basic Django app that downloads the Edgar index and then allows you to download specific filings and extract financial parameters from the XBRL.

Solution 2

Look at the OpenSP toolkit, which has programs to process SGML files. Your simplest option is probably to use the osx program to get an XML version of the input file, after which you can use XML processing tools.

There may be some setup to do first, as the OpenSP package doesn't come with the EDGAR DTD or its SGML declaration (the first part of the stuff in your reference at page 48, starting with <!SGML "ISO 8879-1986"). You will have to get these as text files and add them to the catalogs where the SP parser can find them.

UPDATE: This document seems to be a more up-to-date version. A casual google search doesn't turn up any immediately machine processable versions, though. So you may have to copy-paste from the PDF.

However, if you do so, there will be some extraneous formatting you'll have to remove: there seem to be page break indicators, labelled "C-1", "C-2", and so on. They are not part of SGML and need to be deleted.

You can either add the SGML declaration and the EDGAR DTD to the catalog (in which case the DTD file should only have the part inside the [ after <!DOCTYPE submission and the matching ] at the end) or you can create a "prolog" file consisting of both parts together as is (i.e. including the <!DOCTYPE submission [ and ]>) and run any program in the toolkit on the prolog and your SGML file - i.e. put both names on the command line, with the prolog file first, so that the parser will read both files in the correct order. To understand what's happening, you need to know that an SGML parser needs three pieces of information for a parse: an SGML declaration to set some environmental and processing parameters, then a DTD to describe the structural constraints on a document, and finally the document itself.

16,779

Author by

philq

Updated on July 26, 2022

Comments

philq almost 2 years
I would like to use python2.7 to remove anything that isn't the documents' text from EDGAR filings (which are available online as .txt files). An example of what the files look like is here:

Example

EDGAR provides its Document Type Definitions starting on page 48 of this file:

DTD

The first part of my program gets the .txt file from the EDGAR online database into a local file that I've named "parseme.txt". What I would like to know is how to use the DTD to parse the .txt file. I would use a canned parsing module like BeautifulSoup for the job, but EDGAR's format appears unique, and I hope to avoid a large regex to get the job done.
```
import os
filename = 'parseme.txt'
with open(filename) as f:
    lines = f.readlines()
```
My question is related to the question at Parse SGML with Open Arbitrary Tags in Python 3 and Use lxml to parse text file with bad header in Python but I believe distinct as my question relates to python2.7 and I'm not concerned with the header - I'm just concerned with the text of the file.
- mzjn over 11 years
  
  I don't think the version of Python matters much here. Did you try any of the ideas that were provided in the answers to the linked questions? Where exactly are you stuck?
mzjn over 11 years

I posted a similar answer to one of the linked questions. But I haven't received any feedback.
arayq2 over 11 years

These PEM-encapsulated messages don't look like EDGAR filings. Rather they seem to be taken from the correspondence archive. The relevant DTD must be elsewhere.
chrislondon almost 11 years

An answer that is mostly a link is discouraged on SO for many reasons. Could you paraphrase the important aspects of the link to help other users?
prewett over 8 years

The link seems to require a password now
m3nda over 7 years

The link seems to return 404 not found now :-)