Efficiently Reading and Processing Large CSV Files in Python to Avoid Memory Issues

hectorsalamanca · ‎10-26-2023

I'm trying to read a large CSV file, but I keep running into memory issues. How can I efficiently read and process a large CSV file in Python, ensuring that I don't run out of memory?

Please provide a solution or guidance on how to handle large CSV files in Python to avoid memory problems.

AngelaSchirck · ‎10-26-2023

What have you tried? How large are your files? I've read some pretty large files without issues.

VinceAngelo · ‎10-26-2023

Which Python are you using? I've ingested tens of millions of rows in 64-bit Python 2.7 (64-bit Geoprocessing for ArcMap), and processed 40-60 million rows in 64-bit Python 3 (ArcGIS Pro). Both of those VMs had less than 32GiB RAM available.

- V

HaydenWelch · ‎10-27-2023

The limit is around 100-200M rows from my experience. I've had to deal with large datasets that include say all addresses in a country and you tend to quickly run out of memory without using generators.

EarlMedina · ‎10-27-2023

Out of curiosity, what sorts of things do you need to do? In the past, I ran into similar issues working with traffic data and the way I got around it was actually switching to Julia for the pre-processing tasks. Julia has a DataFrames.jl library that is similar to pandas.

HaydenWelch · ‎10-27-2023

Use Python Generators:

csv = (row for row in open("path/to/csv"))

for row in csv:
    process(row)

# Alternatively

def read_csv(csv:str, sep:str=',') -> list[str]:
    f = open(csv)
    for row in f:
        yield row.strip().split(sep)

for row in read_csv("path/to/csv"):
    process(row)

This will be slower than loading the whole file into memory, but it only uses a fixed amount of memory and iterates through the file line by line only loading in the line that it needs for the current operation

SamirSinghaMahapatra · ‎01-09-2024

You can read large csv using PySpark in Python

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("large_file_read").getOrCreate() df = spark.read.csv('large_dataset.csv', header=True)

You can get more details on this and how to choose best way to read large csv here