Data Engineering | Perardua Consultinghttps://www.perarduaconsulting.com/blog/tags/data-engineering https://static.wixstatic.com/media/5584dc

ORC vs Parquet which file format flexes harder in the data storage showdown

In the world of big data, choosing the right file format can significantly impact your project's success. The performance, storage efficiency, and usability are all key factors influenced by your choice. Two leading contenders in this arena are Apache ORC (Optimized Row Columnar) and Apache Parquet. This post explores these formats in detail, focusing on their structure, performance, and practical applications to help you decide which suits your needs best.

Claude Paugh

1 day ago4 min read

Comparing Apache Parquet, ORC, and JSON File Formats for Your Data Processing

In today's data-rich environment, selecting the right file format can make a world of difference. Whether you're handling big data projects, engaging in machine learning, or performing simple data tasks, knowing what each file format offers is key. In this blog post, we will explore the unique features, advantages, and limitations of three widely used formats: Apache Parquet, Apache ORC, and JSON.

Claude Paugh

Jul 84 min read

Apache Iceberg, Hadoop, & Hive: Open your Datalake (Lakehouse) -> Part II

In this article I will demonstrate user access to the Hive metadata, and the mechanisms that are used for creating result sets. I hope to be able to demonstrate how you can open up datalake or lakehouse data for users.

Claude Paugh

Jun 247 min read

Apache Iceberg and Pandas Analytics: Part I

I generally like to try new things, and technology is no different. So, I decided to do some more in-depth research on the mechanics under the covers of Apache Iceberg. Apache Iceberg with Industrial Piping I was specifically looking at some key items that are usually part of data management practices, regardless of the technology

Claude Paugh

May 76 min read

Data Vault Modeling Design Uses

Data Vault is really a design paradigm, and not a technology. It can be used on any relational database or datalake for that matter. It came about due to a desire to get away from the star/star-cluster/constellation and snowflake (not the DB company) schema designs that are frequently used in Data Marts and Data Warehouses.

Claude Paugh

May 29 min read

How to Leverage Python Dask for Scalable Data Processing and Analysis

In today’s data-driven world, processing and analyzing large datasets efficiently can be a major challenge for software engineers and data scientists. Traditional data processing libraries like Pandas, while user-friendly, may struggle with the vast volumes of data that many organizations face.

Claude Paugh

Apr 257 min read

Mastering Aggregations with Apache Spark DataFrames and Spark SQL in Scala, Python, and SQL

If you want to harness the power of big data, Apache Spark is your go-to framework. It offers robust APIs and a rich ecosystem, perfect for processing large datasets. In particular, Spark's ability to conduct aggregations using DataFrames and Spark SQL makes it an invaluable tool.

Claude Paugh

Apr 244 min read

How I Optimized Apache Spark Jobs to Prevent Excessive Shuffling

When working with Apache Spark, I often found myself facing a common yet challenging performance issue: excessive shuffling. Shuffling can drastically slow down your application,

Claude Paugh

Apr 243 min read

How I Optimize Data Access for Apache Spark RDD

Optimizing data access in Apache Spark's Resilient Distributed Datasets (RDDs) can significantly boost the performance of big data applications. Using effective strategies can lead to faster processing times and improved resource utilization.

Claude Paugh

Apr 243 min read

Understanding HDF5 The Versatile Data Format Explained with Examples

HDF5, or Hierarchical Data Format version 5, is an open-source file format that enables efficient storage and management of large data sets. Built by the HDF Group, it is extensively used across various fields such as science, engineering, and data analysis.

Claude Paugh

Apr 224 min read

Exploring Apache Iceberg and HDF5 Use Cases in Modern Data Management

Choosing between HDF5 and Apache Iceberg can feel overwhelming due to their distinct features and advantages. Armed with the right knowledge, you are better equipped to make a decision tailored to your data science needs.

Claude Paugh

Apr 224 min read

Unlocking the Potential of Apache Iceberg in Cloud-Based Data Engineering Strategies

In today's fast-paced digital world, data is a powerful asset for organizations. With the increasing volume of data, companies need innovative solutions to handle this wealth of information efficiently. One such breakthrough technology is Apache Iceberg.

Claude Paugh

Apr 224 min read

Harnessing the Power of Dask for Scalable Data Science Workflows

In our data-driven world, organizations face a significant challenge: processing and analyzing vast amounts of data efficiently. As data volumes increase—projected to reach 175 zettabytes by 2025

Claude Paugh

Apr 225 min read

ETF & Mutual Funds Portfolios: Infrastructure

I ended up with Neo4J despite trying Memgraph, TigerGraph, and others, it seemed that Neo4J was the most mature and widely supported. It also was stable in my limited infrastructure that I was using.

Claude Paugh

Apr 1912 min read

Apache Spark Best Practices: Optimize Your Data Processing

Apache Spark is a powerful open-source distributed computing system that excels in big data processing. It is lauded for its speed and ease of use, making it a favorite among software engineers and data scientists.

Claude Paugh

Apr 164 min read

Gathering Data Statistics Using PySpark: A Comparative Analysis with Scala

Data processing and statistics gathering are essential tasks in today's data-driven world. Engineers frequently find themselves choosing between tools like PySpark and Scala when embarking on these tasks.

Claude Paugh

Apr 155 min read

Portfolio Holdings Data: Analytics Content Retrieval

The analytics console looks very much like the query console with the exception of the panels on the right. This is where you can map data structures from the local or remote Couchbase collections as sources. The analytics service makes a copy of the original data, and provides the ability to index it separately from the original source.

Claude Paugh

Apr 152 min read

Harnessing the Dask Python Library for Parallel Computing

Dask is a flexible library for parallel computing in Python. It is designed to scale from a single machine to a cluster of machines seamlessly. By using Dask, you can manage and manipulate large datasets that are too big to fit into memory on a single machine.

Claude Paugh

Apr 155 min read

Benefits of Data Architecture and Its Impact on Company Costs

Data architecture refers to the design and organization of data structures and systems within an organization. It defines how data is collected, stored, and used, serving as a blueprint for managing data assets.

Claude Paugh

Apr 155 min read

Spark Data Engineering: Best Practices and Use Cases

In today's data-driven world, organizations are generating vast amounts of data every second. This data can be a goldmine for insights when processed and analyzed effectively. One of the most powerful tools in this realm is Apache Spark.

Claude Paugh

Apr 154 min read

ORC vs Parquet which file format flexes harder in the data storage showdown

Comparing Apache Parquet, ORC, and JSON File Formats for Your Data Processing

Apache Iceberg, Hadoop, & Hive: Open your Datalake (Lakehouse) -> Part II

Apache Iceberg and Pandas Analytics: Part I

Data Vault Modeling Design Uses

How to Leverage Python Dask for Scalable Data Processing and Analysis

Mastering Aggregations with Apache Spark DataFrames and Spark SQL in Scala, Python, and SQL

How I Optimized Apache Spark Jobs to Prevent Excessive Shuffling

How I Optimize Data Access for Apache Spark RDD

Understanding HDF5 The Versatile Data Format Explained with Examples

Exploring Apache Iceberg and HDF5 Use Cases in Modern Data Management

Unlocking the Potential of Apache Iceberg in Cloud-Based Data Engineering Strategies

Harnessing the Power of Dask for Scalable Data Science Workflows

ETF & Mutual Funds Portfolios: Infrastructure

Apache Spark Best Practices: Optimize Your Data Processing

Gathering Data Statistics Using PySpark: A Comparative Analysis with Scala

Portfolio Holdings Data: Analytics Content Retrieval

Harnessing the Dask Python Library for Parallel Computing

Benefits of Data Architecture and Its Impact on Company Costs

Spark Data Engineering: Best Practices and Use Cases

Privacy Policy