Member-only story

How I processed ONE billion rows in PySpark without crashing (and You Can Too!)

Varsha C Bendre
2 min readFeb 19, 2025

Ever tried running a PySpark job on 1 billion rows, only to watch it crash and burn?

Yep, been there. My job kept failing, and I was 🤏 this close to throwing my laptop out the window.

Turns out, the culprits were:

  • Memory leaks 🤯
  • Slow I/O operations 🐌
  • Joins so inefficient they made me question life choices 🤦‍♂️

But after some trial and error (and a few cups of coffee ☕), I cracked the code!

Here’s how I made PySpark run like a dream:

1. Ditch CSV, embrace Parquet

Using CSV for big data is like driving a bicycle on a highway — slow, painful, and not built for scale.

  • CSVs take 5X more space and load at snail speed.
  • Parquet? 10X faster reads and way better compression.
  • Always enable snappy compression:
df.write.parquet("output_path", compression="snappy")

💡 Swapping CSV with Parquet was an instant speed boost!

2. Partition Like a Pro

Varsha C Bendre
Varsha C Bendre

Written by Varsha C Bendre

Data Scientist passionate about coffee and exploring algorithmic mathematics. Sharing insights on Medium. ☕️📊 #DataScience #Mathematics

No responses yet

Write a response