Portfolio Holdings Data: Filing Conversion and Document Database
- Claude Paugh
- Apr 9
- 3 min read
Updated: Apr 18
As I mentioned in the previous post, the next step after downloading the filings from the SEC is to convert them from XML to JSON. The previous script "acquirePortolfioFilings.py", and the script below uses the xmltodict package, and performs the conversion task on each downloaded file. It also stores the converted output in a target location.
Python conversion of XML to JSON
SQL for PostgreSQL download status
Data Returned by SQL
The converted files have been renamed to the use the CIK, filing date, and file name for each file. The files are then loaded into the warehouse. For this use case, I decided to use Couchbase Server because of its built-in SQL like support, analytics, and document search capabilities. I had used MongoDB previously, and I preferred the base features of Couchbase, plus the capability of horizontally scaling services, like deploying indexing, data and document search separately from one another over multiple hosts. I also wanted to try something new for document databases. You can find a white-paper from Couchbase here if you're interested in learning more about it's capabilities.
I created Couchbase buckets with various groupings to form subject areas, in a manner similar to data modeling subject areas. The below image contains the bucket breakdown of scope and collections for "us-fund-filings". Each loaded JSON file from the previous step has section(s) which map to each collection, with documents loaded for section as part of the load process.

Below is Fund GeneralInformation and SQL++ query to retrieve

Below is Fund Portfolio Collection with SQL++ to get details. The schema reflects the uploaded document JSON structure, with the fields in bold highlighted because they are indexed.
References to document elements are used similar to how tables and columns are referenced in SQL. Since JSON elements can have nested elements, you reference elements from the collection down, e.g. Investments.debtSec.couponKind, Investments.derivativeInfo.counterparties.counterpartyName

Couchbase provides a web ui after install to use for admin or user tasks. I use the query console for some tasks, which is built in the Couchbase web interface. Most of the time I use DataGrip from Jetbrains. It provides a rich and robust interface, with data extraction capabilities. Jetbrains also provides a "Couchbase Query" driver that provides access to buckets, scopes, and collections on the server. A couple of screenshots of DataGrip are below for those who are unfamiliar.
Couchbase Query
Couchbase Query Result
In the next post, I will elaborate on the document load process from JSON files to the Couchbase database.