ETF & Mutual Funds Portfolios: Infrastructure
- Claude Paugh
- Apr 19
- 12 min read
Updated: May 20
After I decided to load data from Couchbase into a graph database, I went through some short evaluations. First on which graph database, and second what would be the most efficient method to synchronize the data. I ended up with Neo4J, despite trying Memgraph, TigerGraph, and others. My choice was due to Neo4J being the most mature and widely supported. This occurred four to five years ago, so the graph database market is obviously different that it is today. Neo4j was also stable in my limited infrastructure that I was using. I did not try graph offerings from AWS, Google, Microsoft, or Neo4J because of product maturity and the trial resources had limitations, which by doing the simple math, I would have to pay to host the proof of concept.
Hardware Config
My setup consisted of a refurbished PC with an 8-Core Processor and 128GB of RAM and 2TB of storage for Couchbase, Docker deployed Kafka, and Neo4J. Due to resource constraints I moved the Neo4J to another PC which had 8 Cores and 16GB of RAM. The latter was also my "work" PC
After deciding on Neo4J, I then did some planning on what type of graph might be suitable for presenting holdings of portfolio investment. After researching it, I decided that a Knowledge Graph would be the best fit for the information I was presenting.
An essential concept in this knowledge graph would be to identify types of investment portfolios that held various types of assets. I included ETF's, Mutual Funds, Family Offices, and Hedge Funds where there were SEC Filings that I could find.
I also decided to use different asset types for identify holdings, e.g. Common Stocks, Preferred Equity, Municipal Bonds, Corporate Bonds, Futures, Forwards, etc. The current data I had been loading in the warehouse was only portfolio filings from the SEC.
I went looking for additional sources for currencies, countries, regions of the world, and eventually Legal Entity(LEI) data which tracks companies and their business relationships. This would add valuable content to the graph to be able to associate assets types to their issuers, derivatives to the parties that participate in a transaction, and municipality information for issues debt, and other drill through data. I went through a brainstorming process, that I represented below:

I created a new bucket and scope in Couchbase to store the assets types from the portfolio filings, and each type would become a collection, results are below:

It took some time analyzing and profiling the existing filings data, that I had loaded in order to come up with valid logic. It ended up being based on the assetCat, issuerCat, and derivCat fields in the filing document. Some examples are below:
Futures
Municipal Bonds
Mortgage Backed Debt
Forwards
My intent was to separate by asset type, and I was able to accomplish that. The flow of the analysis looked like this:

The filings can come in late, so I wanted to make sure I captured changes. Initially, I decided to keep the integration simple, by building SQL++ queries and using DataGrip to run them, and extracting pipe-delimited formatted results. I then would load and CREATE or MERGE data into the graph that mirrored the asset types. Below are the nodes in the graph:


After getting off to a decent start with basic extract -> load processing, I thought of upgrading the infrastructure to have a more hands-off pipeline. So I went looking for tools and extensions that could bridge the gaps more elegantly.
Apache Kafka was getting a lot of notice at that time for its scalability, so I thought that may be a good place to start. Couchbase also had a Kafka connector where you could source collections and create topics in Kafka, and essentially publish document additions or changes to topics. Kafka also has a REST API for subscribing to topics and consume the messages in the topics. I created a python client for Kafka REST to consume information from the topics and CREATE/MERGE the JSON from the topic into Neo4j. It ended up looking like this:

It worked reasonably well, most of my issues were operational with Docker, which is where I deployed Confluent's Apache Kafka cluster. My resources were challenged for memory and cpu, which was the root of most of the problems. I had also attempted to use the Neo4J connector for Kafka as well, but at the time (3-4 years ago), it was not ingesting the data correctly at all times, so I abandoned it.
I created Python scripts, per the diagram above, one that contains functions to make requests to topics using the REST api. I didn't optimize it, as I was adding one topic at a time. It would have been more efficient to optimize the URL construction generically and pass the topic and parameters and receive the final URL. The payload processing can be optimized as well, which would have reduced code size significantly.
consumer_kafka_rest.py
This worked out well functionally, but the MERGE does slow down record processing on the Neo4J side(putCE.py script). I had scheduled the REST "client" scripts to run every 5 minutes while loading data into Couchbase, and it went reasonably well, but due to my resource constraints, slower than I had hoped for. I was going to attempt streaming, but REST worked well enough that I decided not to go that route.
I eventually decided on a simple diff -> export -> import process using Python, since it was simple to maintain after it was built. I will cover that in a future post, as well as expand upon the knowledge graph development.