Member-only story

How I processed ONE billion rows in PySpark without crashing (and You Can Too!)

Varsha C Bendre
2 min readFeb 19, 2025

Ever tried running a PySpark job on 1 billion rows, only to watch it crash and burn?

Yep, been there. My job kept failing, and I was 🀏 this close to throwing my laptop out the window.

Turns out, the culprits were:

  • Memory leaks 🀯
  • Slow I/O operations 🐌
  • Joins so inefficient they made me question life choices πŸ€¦β€β™‚οΈ

But after some trial and error (and a few cups of coffee β˜•), I cracked the code!

Here’s how I made PySpark run like a dream:

1. Ditch CSV, embrace Parquet

Using CSV for big data is like driving a bicycle on a highway β€” slow, painful, and not built for scale.

  • CSVs take 5X more space and load at snail speed.
  • Parquet? 10X faster reads and way better compression.
  • Always enable snappy compression:
df.write.parquet("output_path", compression="snappy")

πŸ’‘ Swapping CSV with Parquet was an instant speed boost!

2. Partition Like a Pro

--

--

Varsha C Bendre
Varsha C Bendre

Written by Varsha C Bendre

Data Scientist passionate about coffee and exploring algorithmic mathematics. Sharing insights on Medium. β˜•οΈπŸ“Š #DataScience #Mathematics

No responses yet