top of page


ORC vs Parquet which file format flexes harder in the data storage showdown
In the world of big data, choosing the right file format can significantly impact your project's success. The performance, storage efficiency, and usability are all key factors influenced by your choice. Two leading contenders in this arena are Apache ORC (Optimized Row Columnar) and Apache Parquet. This post explores these formats in detail, focusing on their structure, performance, and practical applications to help you decide which suits your needs best.
Claude Paugh
1 day ago4 min read
Â


Comparing Apache Parquet, ORC, and JSON File Formats for Your Data Processing
In today's data-rich environment, selecting the right file format can make a world of difference. Whether you're handling big data projects, engaging in machine learning, or performing simple data tasks, knowing what each file format offers is key. In this blog post, we will explore the unique features, advantages, and limitations of three widely used formats: Apache Parquet, Apache ORC, and JSON.
Claude Paugh
Jul 84 min read
Â


Apache Iceberg, Hadoop, & Hive: Open your Datalake (Lakehouse) -> Part II
In this article I will demonstrate user access to the Hive metadata, and the mechanisms that are used for creating result sets. I hope to be able to demonstrate how you can open up datalake or lakehouse data for users.
Claude Paugh
Jun 247 min read
Â


Apache Iceberg and Pandas Analytics: Part I
I generally like to try new things, and technology is no different. So, I decided to do some more in-depth research on the mechanics under the covers of Apache Iceberg.
Apache Iceberg with Industrial Piping
I was specifically looking at some key items that are usually part of data management practices, regardless of the technology
Claude Paugh
May 76 min read
Â


Data Vault Modeling Design Uses
Data Vault is really a design paradigm, and not a technology. It can be used on any relational database or datalake for that matter. It came about due to a desire to get away from the star/star-cluster/constellation and snowflake (not the DB company) schema designs that are frequently used in Data Marts and Data Warehouses.
Claude Paugh
May 29 min read
Â


How to Leverage Python Dask for Scalable Data Processing and Analysis
In today’s data-driven world, processing and analyzing large datasets efficiently can be a major challenge for software engineers and data scientists. Traditional data processing libraries like Pandas, while user-friendly, may struggle with the vast volumes of data that many organizations face.
Claude Paugh
Apr 257 min read
Â


Mastering Aggregations with Apache Spark DataFrames and Spark SQL in Scala, Python, and SQL
If you want to harness the power of big data, Apache Spark is your go-to framework. It offers robust APIs and a rich ecosystem, perfect for processing large datasets. In particular, Spark's ability to conduct aggregations using DataFrames and Spark SQL makes it an invaluable tool.
Claude Paugh
Apr 244 min read
Â


How I Optimized Apache Spark Jobs to Prevent Excessive Shuffling
When working with Apache Spark, I often found myself facing a common yet challenging performance issue: excessive shuffling. Shuffling can drastically slow down your application,
Claude Paugh
Apr 243 min read
Â


How I Optimize Data Access for Apache Spark RDD
Optimizing data access in Apache Spark's Resilient Distributed Datasets (RDDs) can significantly boost the performance of big data applications. Using effective strategies can lead to faster processing times and improved resource utilization.
Claude Paugh
Apr 243 min read
Â


Understanding HDF5 The Versatile Data Format Explained with Examples
HDF5, or Hierarchical Data Format version 5, is an open-source file format that enables efficient storage and management of large data sets. Built by the HDF Group, it is extensively used across various fields such as science, engineering, and data analysis.
Claude Paugh
Apr 224 min read
Â


Exploring Apache Iceberg and HDF5 Use Cases in Modern Data Management
Choosing between HDF5 and Apache Iceberg can feel overwhelming due to their distinct features and advantages. Armed with the right knowledge, you are better equipped to make a decision tailored to your data science needs.
Claude Paugh
Apr 224 min read
Â


Unlocking the Potential of Apache Iceberg in Cloud-Based Data Engineering Strategies
In today's fast-paced digital world, data is a powerful asset for organizations. With the increasing volume of data, companies need innovative solutions to handle this wealth of information efficiently. One such breakthrough technology is Apache Iceberg.
Claude Paugh
Apr 224 min read
Â


Harnessing the Power of Dask for Scalable Data Science Workflows
In our data-driven world, organizations face a significant challenge: processing and analyzing vast amounts of data efficiently. As data volumes increase—projected to reach 175 zettabytes by 2025
Claude Paugh
Apr 225 min read
Â


ETF & Mutual Funds Portfolios: Infrastructure
I ended up with Neo4J despite trying Memgraph, TigerGraph, and others, it seemed that Neo4J was the most mature and widely supported. It also was stable in my limited infrastructure that I was using.
Claude Paugh
Apr 1912 min read
Â


Apache Spark Best Practices: Optimize Your Data Processing
Apache Spark is a powerful open-source distributed computing system that excels in big data processing. It is lauded for its speed and ease of use, making it a favorite among software engineers and data scientists.
Claude Paugh
Apr 164 min read
Â


Gathering Data Statistics Using PySpark: A Comparative Analysis with Scala
Data processing and statistics gathering are essential tasks in today's data-driven world. Engineers frequently find themselves choosing between tools like PySpark and Scala when embarking on these tasks.
Claude Paugh
Apr 155 min read
Â


Portfolio Holdings Data: Analytics Content Retrieval
The analytics console looks very much like the query console with the exception of the panels on the right. This is where you can map data structures from the local or remote Couchbase collections as sources. The analytics service makes a copy of the original data, and provides the ability to index it separately from the original source.
Claude Paugh
Apr 152 min read
Â


Harnessing the Dask Python Library for Parallel Computing
Dask is a flexible library for parallel computing in Python. It is designed to scale from a single machine to a cluster of machines seamlessly. By using Dask, you can manage and manipulate large datasets that are too big to fit into memory on a single machine.
Claude Paugh
Apr 155 min read
Â


Benefits of Data Architecture and Its Impact on Company Costs
Data architecture refers to the design and organization of data structures and systems within an organization. It defines how data is collected, stored, and used, serving as a blueprint for managing data assets.
Claude Paugh
Apr 155 min read
Â


Spark Data Engineering: Best Practices and Use Cases
In today's data-driven world, organizations are generating vast amounts of data every second. This data can be a goldmine for insights when processed and analyzed effectively. One of the most powerful tools in this realm is Apache Spark.
Claude Paugh
Apr 154 min read
Â
bottom of page