Member-only story

How I processed ONE billion rows in PySpark without crashing (and You Can Too!)

Varsha C Bendre

2 min readFeb 19, 2025

Ever tried running a PySpark job on 1 billion rows, only to watch it crash and burn?

Yep, been there. My job kept failing, and I was 🤏 this close to throwing my laptop out the window.

Turns out, the culprits were:

Memory leaks 🤯
Slow I/O operations 🐌
Joins so inefficient they made me question life choices 🤦‍♂️

But after some trial and error (and a few cups of coffee ☕), I cracked the code!

Here’s how I made PySpark run like a dream:

1. Ditch CSV, embrace Parquet

Using CSV for big data is like driving a bicycle on a highway — slow, painful, and not built for scale.

CSVs take 5X more space and load at snail speed.
Parquet? 10X faster reads and way better compression.
Always enable snappy compression:

df.write.parquet("output_path", compression="snappy")

💡 Swapping CSV with Parquet was an instant speed boost!

2. Partition Like a Pro

Create an account to read the full story.

The author made this story available to Medium members only.
If you’re new to Medium, create a new account to read this story on us.

Continue in app

Or, continue in mobile web

Sign up with Google

Sign up with Facebook

Already have an account? Sign in

Written by Varsha C Bendre

Data Scientist passionate about coffee and exploring algorithmic mathematics. Sharing insights on Medium. ☕️📊 #DataScience #Mathematics

No responses yet

Write a response

What are your thoughts?

Also publish to my profile

Recommended from Medium

Avoiding Data Skew in Apache Spark

Ty Berretty

Avoiding Data Skew in Apache Spark

Data skew is a significant performance challenge in distributed systems like Apache Spark. It happens when the data distribution across…

Sep 16, 2024

Apache Spark: Core Concepts, Tools, and Applications

Prem Vishnoi(cloudvala)

Apache Spark: Core Concepts, Tools, and Applications

Overview of Apache Spark’s Ecosystem and Core Libraries

Feb 15

Lists

Predictive Modeling w/ Python

20 stories1843 saves

Practical Guides to Machine Learning

10 stories2215 saves

The New Chatbots: ChatGPT, Bard, and Beyond

12 stories559 saves

Natural Language Processing

1958 stories1605 saves

How to Rename DataFrame Columns in PySpark: A Comprehensive Guide

Think Data

How to Rename DataFrame Columns in PySpark: A Comprehensive Guide

Renaming columns in a PySpark DataFrame is a common data transformation task. Whether you need to make column names more readable, align…

Feb 18

Implementing End to end Change Data Capture (CDC) with PySpark: A Comprehensive Guide

Mayurkumar Surani

Implementing End to end Change Data Capture (CDC) with PySpark: A Comprehensive Guide

Part 1: Foundation and Setup

Feb 21

Delta Lake 4.0: Next-Level Big Data Management

Vijay Gadhave

Delta Lake 4.0: Next-Level Big Data Management

Note: If you’re not a medium member, CLICK HERE

Feb 21

Dynamically Adding a Field to a Struct Column in PySpark

Santosh Joshi

Dynamically Adding a Field to a Struct Column in PySpark

Understanding StructType vs. struct() with a Real-World Example

Jan 30

See more recommendations

Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams