Data Architecture

ORC vs Parquet which file format flexes harder in the data storage showdown

In the world of big data, choosing the right file format can significantly impact your project's success. The performance, storage efficiency, and usability are all key factors influenced by your choice. Two leading contenders in this arena are Apache ORC (Optimized Row Columnar) and Apache Parquet. This post explores these formats in detail, focusing on their structure, performance, and practical applications to help you decide which suits your needs best.

Claude Paugh

1 day ago4 min read

Datalake and Lakehouse: Comparison of Apache Kylin and Trino for Business Intelligence Analytics

In today's dynamic business landscape, having the right tools for data analysis can make all the difference. With the vast amount of data available, businesses need efficient ways to process and analyze it for better decision-making. Two powerful platforms that stand out in this area are Apache Kylin and Trino, also known as Presto. While both serve important functions in analytics, understanding how they differ is key for data professionals looking to leverage these technolo

Claude Paugh

2 days ago6 min read

Comparing Apache Hive, AWS Glue, and Google Data Catalog

Navigating the landscape of data processing and management tools can be a daunting task for software engineers. With so many options available, it is crucial to identify which solution aligns best with your specific workflow needs. In this post, we will compare three popular tools: Apache Hive, AWS Glue, and Google Data Catalog.

Claude Paugh

Jul 86 min read

Data Lake or Lakehouse: Distinctions in Modern Data Architecture

In today's data-driven world, organizations face challenges related to the sheer volume and complexity of data. Two major frameworks, data lakes and lakehouses, have emerged to help businesses manage and harness their data effectively. This post provides a clear comparison of both concepts, highlighting their unique features and practical applications within modern data architecture.

Claude Paugh

May 186 min read

Apache Iceberg and Pandas Analytics: Part I

I generally like to try new things, and technology is no different. So, I decided to do some more in-depth research on the mechanics under the covers of Apache Iceberg. Apache Iceberg with Industrial Piping I was specifically looking at some key items that are usually part of data management practices, regardless of the technology

Claude Paugh

May 76 min read

Data Vault Modeling Design Uses

Data Vault is really a design paradigm, and not a technology. It can be used on any relational database or datalake for that matter. It came about due to a desire to get away from the star/star-cluster/constellation and snowflake (not the DB company) schema designs that are frequently used in Data Marts and Data Warehouses.

Claude Paugh

May 29 min read

How I Optimized Apache Spark Jobs to Prevent Excessive Shuffling

When working with Apache Spark, I often found myself facing a common yet challenging performance issue: excessive shuffling. Shuffling can drastically slow down your application,

Claude Paugh

Apr 243 min read

How I Optimize Data Access for Apache Spark RDD

Optimizing data access in Apache Spark's Resilient Distributed Datasets (RDDs) can significantly boost the performance of big data applications. Using effective strategies can lead to faster processing times and improved resource utilization.

Claude Paugh

Apr 243 min read

Understanding HDF5 The Versatile Data Format Explained with Examples

HDF5, or Hierarchical Data Format version 5, is an open-source file format that enables efficient storage and management of large data sets. Built by the HDF Group, it is extensively used across various fields such as science, engineering, and data analysis.

Claude Paugh

Apr 224 min read

ETF & Mutual Funds Portfolios: Infrastructure

I ended up with Neo4J despite trying Memgraph, TigerGraph, and others, it seemed that Neo4J was the most mature and widely supported. It also was stable in my limited infrastructure that I was using.

Claude Paugh

Apr 1912 min read

Apache Spark Best Practices: Optimize Your Data Processing

Apache Spark is a powerful open-source distributed computing system that excels in big data processing. It is lauded for its speed and ease of use, making it a favorite among software engineers and data scientists.

Claude Paugh

Apr 164 min read

Gathering Data Statistics Using PySpark: A Comparative Analysis with Scala

Data processing and statistics gathering are essential tasks in today's data-driven world. Engineers frequently find themselves choosing between tools like PySpark and Scala when embarking on these tasks.

Claude Paugh

Apr 155 min read

Portfolio Holdings Data: Analytics Content Retrieval

The analytics console looks very much like the query console with the exception of the panels on the right. This is where you can map data structures from the local or remote Couchbase collections as sources. The analytics service makes a copy of the original data, and provides the ability to index it separately from the original source.

Claude Paugh

Apr 152 min read

Harnessing the Dask Python Library for Parallel Computing

Dask is a flexible library for parallel computing in Python. It is designed to scale from a single machine to a cluster of machines seamlessly. By using Dask, you can manage and manipulate large datasets that are too big to fit into memory on a single machine.

Claude Paugh

Apr 155 min read

Benefits of Data Architecture and Its Impact on Company Costs

Data architecture refers to the design and organization of data structures and systems within an organization. It defines how data is collected, stored, and used, serving as a blueprint for managing data assets.

Claude Paugh

Apr 155 min read

ORC vs Parquet which file format flexes harder in the data storage showdown

Datalake and Lakehouse: Comparison of Apache Kylin and Trino for Business Intelligence Analytics

Comparing Apache Hive, AWS Glue, and Google Data Catalog

Data Lake or Lakehouse: Distinctions in Modern Data Architecture

Apache Iceberg and Pandas Analytics: Part I

Data Vault Modeling Design Uses

How I Optimized Apache Spark Jobs to Prevent Excessive Shuffling

How I Optimize Data Access for Apache Spark RDD

Understanding HDF5 The Versatile Data Format Explained with Examples

ETF & Mutual Funds Portfolios: Infrastructure

Apache Spark Best Practices: Optimize Your Data Processing

Gathering Data Statistics Using PySpark: A Comparative Analysis with Scala

Portfolio Holdings Data: Analytics Content Retrieval

Harnessing the Dask Python Library for Parallel Computing

Benefits of Data Architecture and Its Impact on Company Costs

Privacy Policy