top of page

Portfolio Holdings Data: Introduction

Updated: Jun 24

Several years ago, I started a side project that I thought would be fun: collecting and loading SEC filings for portfolio holdings of ETF and Mutual Fund Holdings on a monthly basis. I wanted to essentially automate the collection of the SEC filings by using the index files that are provided and updated by companies as they make their filings.

Directory listings for SEC filings
Directory listings for SEC filings

As you drill into each folder with a year label, you can then drill by quarter. Under the quarter, you will find the master index files.

Master index files for 2025 QTR1
Master index files for 2025 QTR1






Each master index file contains an inventory of all filings for the period, including the "form type," which tells you what the content of the filing contains. For example, the "NPORT-P" form contains the holdings filing of the ETF/Mutual Fund, which is what I started with. The SEC filings follow a filesystem storage pattern, so once you decipher that, it's easily scriptable.


I started by creating two tables: one for the filing reference data (metadata from filing) and a second table for the status of the file download.


Tables in PostgreSQL for metadata tracking of SEC filings
Tables in PostgreSQL for metadata tracking of SEC filings

The downloads occur in multiple steps:

  1. Download the master.idx file from the desired year/quarter. I used to have this automated, but changes to the SEC website blocked it.
  2. Run the "assembleSECFundFilings.py" script, which loops through the master.idx file and creates new records if they do not already exist in the table "filing_ref" on the diagram above. The script is below:
import pandas as pd
from datetime import datetime
from ref_data.connect_util import getconn
from sqlalchemy import text

year_target = "2024"
quarter_target = "QTR3"
# read the dataset
df = pd.read_csv(f"/tmp/{year_target}/{quarter_target}/master.idx",
                 delimiter="|",
                 skiprows=1,
                 header=4,
                 low_memory="false")

df.CIK = df.CIK.astype(str)
df["Date Filed"] = df["Date Filed"].apply(pd.to_datetime)
df["Company Name"] = df["Company Name"].astype(str)
df["Filename"] = df["Filename"].astype(str)

# get DB connection
conn, conn_r = getconn()

# Filter each frame for portfolio filings
form_list: list = ["NPORT-P", "10-Q", "10-K", "8-A12B"]
df2 = df[df["Form Type"].isin(form_list)]
print(df2.count())

try:
    for row in df2.itertuples():
        CIK = row[1]
        company = str(row[2]).replace("'", " ")
        form_type = row[3]
        dt_filed = row[4]
        filename = row[5]

        sql_count_str = (
            f"SELECT COUNT(cik) FROM inbd_raw.filing_ref WHERE cik = '{CIK}' "
            f"and date_filed = to_date('{dt_filed}', 'YYYY-MM-DD') "
            f"and form_type = '{form_type}' "
            f"and file_name = '{filename}'"
        )

        rec_count = conn.execute(text(sql_count_str))
        row_result = rec_count.fetchall()

        for rec in row_result:
            count = rec.count

            if count == 0:
                sql_stmt_str = (
                    f"INSERT INTO inbd_raw.filing_ref(cik,company_name,form_type,date_filed,file_name)"
                    f"values ('{CIK}', '{company}', '{form_type}', to_date('{dt_filed}', 'YYYY-MM-DD'), '{filename}')"
                )

                # print(sql_stmt_str)
                print(
                    f"Adding record for {CIK} and company {company} at: "
                    + str(datetime.now())
                )
                cur = conn_r.cursor()
                cur.execute(sql_stmt_str)
            else:
                print(f"{CIK} Record for {company} already exists, skipping...")

except Exception as e:
    print("Exeception occurred...." + str(e))
    exit(1)
finally:
    if conn:
        conn_r.commit()
        conn_r.cursor().close()
        conn_r.close()
        print("PostgreSQL connection is closed")
    else:
        exit(0)
    
3. The file_ref table is populated with data
Table data from filing_ref
Table data from filing_ref

Since the table data uses surrogate keys, i.e., "dumb keys," as the creator of primary keys, I now add all the values for "filing_ref_sid" from the "filing_ref" table into the "filing_ref_status" table via SQL:

insert into filing_ref_status(filing_ref_sid)
select distinct filing_ref.filing_ref_sid from filing_ref
      where filing_ref_sid not in
       (select distinct filing_ref_sid from filing_ref_status);

There are multiple columns defaulting on "filing_ref_status", so I only need the primary key from "filing_ref".

  1. Metadata and tracking data are ready, so I run the "acquirePortfolioFiling.py" script to download the data. Recently, I downloaded 13.5k of files over two+ days without interruption. The script is below:

acquirePortfolioFiling.py

import os
import requests
from ref_data.connect_util 
import getconn
from time import sleep
# Constants
BASE_URL = "https://www.sec.gov/Archives/"USER_AGENT = ("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9 (Company: Perardua Consulting) (email: claude.paugh@<perarduaconsulting>.com>")
ACCEPT_ENCODING = "gzip, deflate, zstd"
SLEEP_INTERVAL = 12
FORM_TYPE = "NPORT-P"
DIRECTORY = "/Volumes/ExtShield/SecFundFilings/"

# Get DB connection
conn, conn_r = getconn()

def get_query_results():    
    sql_query_select = (f"select fr.filing_ref_sid, fr.cik, fr.company_name, to_char(fr.date_filed, 'YYYY-MM-DD'), fr.file_name  from inbd_raw.filing_ref fr inner join inbd_raw.filing_ref_status frs on frs.filing_ref_sid = fr.filing_ref_sid where (frs.response_received_ind = false or frs.response_received_ind is null) and fr.form_type = '{FORM_TYPE}'")    
    cur = conn_r.cursor()    
    cur.execute(sql_query_select)    
    return cur.fetchall()

def download_and_write_file(query_result, headers):    
    try:        
      ref_sid, cik, _, filed_date, source_filing = query_result        
      file_dir = source_filing.replace(".txt", "").translate({ord(i): None for i in "-"})
      url = f"{BASE_URL}{file_dir}/primary_doc.xml"        
      sleep(SLEEP_INTERVAL)        
      response = requests.get(url.strip(), headers=headers, timeout=15)        
      if not os.path.exists(f"{DIRECTORY}{filed_date}/{cik}/{file_dir}"):            
         os.makedirs(f"{DIRECTORY}{filed_date}/{cik}/{file_dir}")        
      with open( f"{DIRECTORY}{filed_date}/{cik}/{file_dir}/primary_doc.xml", "wb") as output_file:            
         output_file.write(response.content)            
         print("Downloading Completed for: " + url)  
         update_database(ref_sid) 
    except Exception as e:        
        print(e) 

def update_database(ref_sid):    
    sql_update = (f"update inbd_raw.filing_ref_status set request_status_ind = true, response_received_ind = true, record_ts = current_timestamp, data_persisted_ts = current_timestamp  where filing_ref_sid = {ref_sid}")    
    cur = conn_r.cursor()    
    upd_stat = cur.execute(sql_update)    
    conn_r.commit()    
    return upd_stat
    
headers = {"User-Agent": USER_AGENT, "Accept-Encoding": ACCEPT_ENCODING, "Host": "www.sec.gov"}

if __name__ == "__main__":    
  try:        
     for row in get_query_results():
        download_and_write_file(row, headers)    
   except Exception as e:        
     print(e)

The downloads look like this:

Running Job Output
Running Job Output

Files stored in a path that follows the SEC structure:

Downloaded SEC Filings
Downloaded SEC Filings

Each filing is in XML format as you can see above.


  1. I created a script labelled, "convertFilingsXMLtoJSON.py" which loops through all the directories looking for the XML files, and converts them to JSON format.

import datetime
import os
import shutil
import time
from DBSchema.convertXML import convertXML2JSON
from ref_data.connect_util import getconn

# Constants
FORM_TYPE = "NPORT-P"
DIRECTORY = "/Volumes/ExtShield/SecFundFilings/"

# Get DB connection
conn, conn_r = getconn()

def get_query_results(f_name: str):
    sql_query_select = (
        f"select fr.filing_ref_sid "
        f"from inbd_raw.filing_ref fr "
        f"where fr.form_type = '{FORM_TYPE}' and fr.file_name = '{f_name}'"
    )
    cur = conn_r.cursor()
    cur.execute(sql_query_select)
    return cur.fetchone()

def update_database_status(ref_sid: int):
    sql_update = (
        f"update inbd_raw.filing_ref_status "
        f"set json_converted_ind = true, record_ts = current_timestamp, "
        f"json_converted_ts = current_timestamp "
        f"where filing_ref_sid = {ref_sid}"
    )
    cur = conn_r.cursor()
    upd_stat = cur.execute(sql_update)
    conn_r.commit()
    return upd_stat

def getFundFileList(target_dir=None, target_size=None):
    if os.path.exists(target_dir):
        target_files: list = []
        for root, dirs, files in os.walk(target_dir):
            for file in files:
                # Set utime to current time
                file_path = root + "/" + file
                file_path = str(file_path).replace("\\", "/")
                target_files.append(file_path)
                if len(target_files) == target_size:
                    return target_files
    else:
        print("Path does not exists: " + target_dir)
        exit(1)
+
if __name__ == "__main__":
    t_size = 5000
    files = getFundFileList(target_dir="/Volumes/ExtShield/SecFundFilings/raw/", target_size=t_size)
    l_count: int = 0

    for file in files:
        split_file = file.split("/")
        file_end = str(split_file[-2])[:10]
        file_middle = str(split_file[-2])[10:12]
        file_start = str(split_file[-2])[12:]
        file_parts = file_end + '-' + file_middle + '-' + file_start + '.txt'
        file_name = split_file[-5] + "/" + split_file[-4] + "/" + split_file[-3] + "/" + file_parts
        filing_sid = str(get_query_results(file_name)).replace('(', '').replace(')', '').replace(',', '').replace('--.txt', '.txt')
        print(filing_sid)
        time.sleep(0.1)
        l_count = l_count + 1
        sid = get_query_results(file_name)
        print("File to be processed: " + file)
        print("Processing file " + str(l_count) + " of " + str(t_size))

        base_file = file.replace("/Volumes/ExtShield/SecFundFilings/raw/", "")
        complete_file = "/Volumes/ExtShield/SetupFilings/setup_complete/" + base_file
        target_path = complete_file.replace("primary_doc.xml", "")

        if not os.path.exists(target_path):
            os.makedirs(target_path)
        shutil.copy(file, complete_file)

        processing_dt = datetime.datetime.today()
        status = convertXML2JSON(inputXMLFile=file, processing_dt=processing_dt)
        print(str(status))

        if status == 0:
            print("Continue to next file...")
            update_database_status(filing_sid)
            os.remove(file)
            continue
        else:
            print("Fund did not complete successfully: " + str(status))
            exit(1)

I will end this initial post here and pick up after the JSON conversion of the files. As you can see in the script, the "target_size" of the conversion can be anything, and the original source files are moved to another location during processing.


So you can run the script multiple times with different target sizes, or rerun if there is a failure, and it picks up where it left off. The conversion of the files is updated in the "filing_ref_status" table, so it only works on files it can find and match with an entry in the metadata.



bottom of page