top of page


ORC vs Parquet which file format flexes harder in the data storage showdown
In the world of big data, choosing the right file format can significantly impact your project's success. The performance, storage efficiency, and usability are all key factors influenced by your choice. Two leading contenders in this arena are Apache ORC (Optimized Row Columnar) and Apache Parquet. This post explores these formats in detail, focusing on their structure, performance, and practical applications to help you decide which suits your needs best.
Claude Paugh
1 day ago4 min read
Â


Datalake and Lakehouse: Comparison of Apache Kylin and Trino for Business Intelligence Analytics
In today's dynamic business landscape, having the right tools for data analysis can make all the difference. With the vast amount of data available, businesses need efficient ways to process and analyze it for better decision-making. Two powerful platforms that stand out in this area are Apache Kylin and Trino, also known as Presto. While both serve important functions in analytics, understanding how they differ is key for data professionals looking to leverage these technolo
Claude Paugh
2 days ago6 min read
Â


Comparing Apache Hive, AWS Glue, and Google Data Catalog
Navigating the landscape of data processing and management tools can be a daunting task for software engineers. With so many options available, it is crucial to identify which solution aligns best with your specific workflow needs. In this post, we will compare three popular tools: Apache Hive, AWS Glue, and Google Data Catalog.
Claude Paugh
Jul 86 min read
Â


Data Lake or Lakehouse: Distinctions in Modern Data Architecture
In today's data-driven world, organizations face challenges related to the sheer volume and complexity of data. Two major frameworks, data lakes and lakehouses, have emerged to help businesses manage and harness their data effectively. This post provides a clear comparison of both concepts, highlighting their unique features and practical applications within modern data architecture.
Claude Paugh
May 186 min read
Â


Apache Iceberg and Pandas Analytics: Part I
I generally like to try new things, and technology is no different. So, I decided to do some more in-depth research on the mechanics under the covers of Apache Iceberg.
Apache Iceberg with Industrial Piping
I was specifically looking at some key items that are usually part of data management practices, regardless of the technology
Claude Paugh
May 76 min read
Â


Data Vault Modeling Design Uses
Data Vault is really a design paradigm, and not a technology. It can be used on any relational database or datalake for that matter. It came about due to a desire to get away from the star/star-cluster/constellation and snowflake (not the DB company) schema designs that are frequently used in Data Marts and Data Warehouses.
Claude Paugh
May 29 min read
Â


How I Optimized Apache Spark Jobs to Prevent Excessive Shuffling
When working with Apache Spark, I often found myself facing a common yet challenging performance issue: excessive shuffling. Shuffling can drastically slow down your application,
Claude Paugh
Apr 243 min read
Â


How I Optimize Data Access for Apache Spark RDD
Optimizing data access in Apache Spark's Resilient Distributed Datasets (RDDs) can significantly boost the performance of big data applications. Using effective strategies can lead to faster processing times and improved resource utilization.
Claude Paugh
Apr 243 min read
Â


Understanding HDF5 The Versatile Data Format Explained with Examples
HDF5, or Hierarchical Data Format version 5, is an open-source file format that enables efficient storage and management of large data sets. Built by the HDF Group, it is extensively used across various fields such as science, engineering, and data analysis.
Claude Paugh
Apr 224 min read
Â


ETF & Mutual Funds Portfolios: Infrastructure
I ended up with Neo4J despite trying Memgraph, TigerGraph, and others, it seemed that Neo4J was the most mature and widely supported. It also was stable in my limited infrastructure that I was using.
Claude Paugh
Apr 1912 min read
Â


Apache Spark Best Practices: Optimize Your Data Processing
Apache Spark is a powerful open-source distributed computing system that excels in big data processing. It is lauded for its speed and ease of use, making it a favorite among software engineers and data scientists.
Claude Paugh
Apr 164 min read
Â


Gathering Data Statistics Using PySpark: A Comparative Analysis with Scala
Data processing and statistics gathering are essential tasks in today's data-driven world. Engineers frequently find themselves choosing between tools like PySpark and Scala when embarking on these tasks.
Claude Paugh
Apr 155 min read
Â


Portfolio Holdings Data: Analytics Content Retrieval
The analytics console looks very much like the query console with the exception of the panels on the right. This is where you can map data structures from the local or remote Couchbase collections as sources. The analytics service makes a copy of the original data, and provides the ability to index it separately from the original source.
Claude Paugh
Apr 152 min read
Â


Harnessing the Dask Python Library for Parallel Computing
Dask is a flexible library for parallel computing in Python. It is designed to scale from a single machine to a cluster of machines seamlessly. By using Dask, you can manage and manipulate large datasets that are too big to fit into memory on a single machine.
Claude Paugh
Apr 155 min read
Â


Benefits of Data Architecture and Its Impact on Company Costs
Data architecture refers to the design and organization of data structures and systems within an organization. It defines how data is collected, stored, and used, serving as a blueprint for managing data assets.
Claude Paugh
Apr 155 min read
Â
bottom of page