AI Investing Machine: Part I
- Claude Paugh

- 5 days ago
- 4 min read
Several years ago I started with an idea of creating a knowledge graph around mutual funds, their holdings, and what legal entities were backing holdings. So I started to build a graph in Neo4J based on what I was storing for SEC filings in my Couchbase warehouse.
I had amassed several years of filings for Mutual Fund and ETF holdings, but their quality improved significantly starting in 2021. In approx. 2023, I had enough filings that I could build a knowledge graph, but my use case was not exactly knowledge graph friendly.
Why? I had historical data that I wanted to maintain for ML cases in addition to my initial use case. I decided to go forward and store it all in Neo4J anyway to follow KISS for my initial setup. Since Neo4J tried to load much of what it does into memory, it can be a challenge with large node numbers.

When I initially scoped loading the filings, mostly N-PORT forms, from the SEC, I wanted to restructure them from XML to JSON, to reduce memory and storage consumption in Couchbase. This meant an inventory of keys that I would search the JSON, which also flowed into the node structures in Neo4J. The filings has dates the filing occurred, the backing CIK (central index key from SEC), and your typical security/instrument identifiers: ISIN, CUSIP, SEDOL, and internal identifiers from the trades. But the SEC added the most import new key, which was LEI, or Legal Entity Identifier.
There was now a source to resolve relationships between different held instruments and what the legal structure where backing them. So I decided to ingest the LEI information from the only real public source: https://www.gleif.org/en.
Images displaying the nodes, relationships, and attributes of Knowledge Graph
The knowledge graph was constructed with the following nodes types and relationships, to answer specific questions. This is not a complete list of nodes or relationships below. It's to provide an idea of scope.
Nodes
Common Equity | Floating Rate Debt | Futures | Funds |
Preferred Equity | Collateralized Debt | SWAPs | Fund Parent Company |
Corporate Debt | Asset Backed Debt | SWAPTIONS | Fund Borrowers |
Mortgage Backed Debt | Warrants | Country | Legal Entity |
Municipal Debt | Options | Currency | Legal Entity LEI ISIN Reference |
Convertible Debt | Forwards | Region | Legal Entity Relationships |
Relationships
Asset Type -> Fund Holdings | Asset Type -> Legal Entity | Asset Type -> Legal Jurisdiction | Fund Series -> Fund Series Classes |
Asset Type -> Country | Fund -> Legal Entity | Asset Type -> Headquarters Country | Fund -> Fund Flows |
Asset Type -> Currency | Fund Parent -> Legal Entity | Asset Type -> Location Country | Fund -> Fund Performance |
Country -> Region | Fund -> SubFunds | Asset Type -> Index Member | Common Equity -> Company Profile |
There are other relationships that resolve many-to-many Legal Entities that are explicitly defined as well.
The use case objective for the combination of this data initially was:
risk measurement of currency exposures
position risk to institutions and markets
legal entity tracing for multi-national corporations
what is the impact chain for asset purchases
To demonstrate a portion of this, I started with expanding a fund in Neo4J Bloom. The fund has Loans and SWAPtions as asset holdings:


Next, the Legal Entity (LE) of the underlying SWAP asset wrapped by the SWAPtion, was a Counter Party for the SWAP traced to a company called Citigroup Global Markets:

Which itself was under another Citigroup LE, called Citigroup Financial Products:

Then the Citigroup company above was expanded, which revealed it as the Legal Entity and Not Centrally Cleared (NCC) Counter Party for several Repurchase agreements, and the Legal Entity for Floating Rate Debt:

Looking into the Repurchase agreements a little further, they eventually link to the asset holder, which in this specific case are two mutual funds -> the light green objects with bank icons. We also see the Repurchase agreement currency/collateral currency, which is the Euro.
This means that the Citigroup linked company has exposure to the Euro currency, and could be impacted by larges changes in exchange rates on these assets. Its the same situation with the two mutual funds regarding Euro currency exposure. Many mutual funds and ETF's offset their currency risks by purchasing forward currency contracts. Had I expanded the funds in this case and they were based in the United States, we would probably see those 'forwards' as assets held by the fund(s).

Following along another path from the Legal Entity of Citigroup Financial Products, leads us to additional firms that are operating in other countries, within another holding company. The first orange circle is a fund company with various other funds connected to it (four orange circles on right)

Finally we trace the Citigroup Financial Products LE which is responsible for the above assets and funds, back to the Citigroup Inc. LE that is at the top of the hierarchy.

As you can see, the ability to map Legal Entity Identifiers (LEI) between different organizations and their underlying structures can have a significant impact on our ability to navigate, assess, trace, and understand risks that are not easily discovered. I started with this as the foundation keystone to start building my AI application, since it offers powerful capabilities to quantify impacts across organizations and financial systems. Next I will dive into the architecture of the application and broaden its scope to reflect more features I was looking for.












