I'm trying to read a large CSV file, but I keep running into memory issues. How can I efficiently read and process a large CSV file in Python, ensuring that I don't run out of memory?
Please provide a solution or guidance on how to handle large CSV files in Python to avoid memory problems.
What have you tried? How large are your files? I've read some pretty large files without issues.
Which Python are you using? I've ingested tens of millions of rows in 64-bit Python 2.7 (64-bit Geoprocessing for ArcMap), and processed 40-60 million rows in 64-bit Python 3 (ArcGIS Pro). Both of those VMs had less than 32GiB RAM available.
- V
The limit is around 100-200M rows from my experience. I've had to deal with large datasets that include say all addresses in a country and you tend to quickly run out of memory without using generators.
Out of curiosity, what sorts of things do you need to do? In the past, I ran into similar issues working with traffic data and the way I got around it was actually switching to Julia for the pre-processing tasks. Julia has a DataFrames.jl library that is similar to pandas.
Use Python Generators:
csv = (row for row in open("path/to/csv"))
for row in csv:
process(row)
# Alternatively
def read_csv(csv:str, sep:str=',') -> list[str]:
f = open(csv)
for row in f:
yield row.strip().split(sep)
for row in read_csv("path/to/csv"):
process(row)
This will be slower than loading the whole file into memory, but it only uses a fixed amount of memory and iterates through the file line by line only loading in the line that it needs for the current operation
You can read large csv using PySpark in Python
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("large_file_read").getOrCreate() df = spark.read.csv('large_dataset.csv', header=True)
You can get more details on this and how to choose best way to read large csv here