Select to view content in your preferred language

Efficiently Reading and Processing Large CSV Files in Python to Avoid Memory Issues

3899
6
10-26-2023 03:32 AM
hectorsalamanca
Deactivated User

I'm trying to read a large CSV file, but I keep running into memory issues. How can I efficiently read and process a large CSV file in Python, ensuring that I don't run out of memory?

Please provide a solution or guidance on how to handle large CSV files in Python to avoid memory problems.

0 Kudos
6 Replies
AngelaSchirck
Frequent Contributor

What have you tried?  How large are your files? I've read some pretty large files without issues.

0 Kudos
VinceAngelo
Esri Esteemed Contributor

Which Python are you using?  I've ingested tens of millions of rows in 64-bit Python 2.7 (64-bit Geoprocessing for ArcMap), and processed 40-60 million rows in 64-bit Python 3 (ArcGIS Pro). Both of those VMs had less than 32GiB RAM available.

- V

0 Kudos
HaydenWelch
Frequent Contributor

The limit is around 100-200M rows from my experience. I've had to deal with large datasets that include say all addresses in a country and you tend to quickly run out of memory without using generators.

0 Kudos
EarlMedina
Esri Regular Contributor

Out of curiosity, what sorts of things do you need to do? In the past, I ran into similar issues working with traffic data and the way I got around it was actually switching to Julia for the pre-processing tasks. Julia has a DataFrames.jl library that is similar to pandas.

 

0 Kudos
HaydenWelch
Frequent Contributor

Use Python Generators:

 

csv = (row for row in open("path/to/csv"))

for row in csv:
    process(row)

# Alternatively

def read_csv(csv:str, sep:str=',') -> list[str]:
    f = open(csv)
    for row in f:
        yield row.strip().split(sep)

for row in read_csv("path/to/csv"):
    process(row)

 

 

This will be slower than loading the whole file into memory, but it only uses a fixed amount of memory and iterates through the file line by line only loading in the line that it needs for the current operation

0 Kudos
SamirSinghaMahapatra
New Contributor

You can read large csv using PySpark in Python

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("large_file_read").getOrCreate() df = spark.read.csv('large_dataset.csv', header=True)

You can get more details on this and how to choose best way to read large csv here

0 Kudos