topic Re: Efficiently Reading and Processing Large CSV Files in Python to Avoid Memory Issues in Python Questions

Efficiently Reading and Processing Large CSV Files in Python to Avoid Memory Issues

hectorsalamanca — Thu, 26 Oct 2023 10:32:19 GMT

I'm trying to read a large CSV file, but I keep running into memory issues. How can I efficiently read and process a large CSV file in Python, ensuring that I don't run out of memory?

Please provide a solution or guidance on how to handle large CSV files in Python to avoid memory problems.

Re: Efficiently Reading and Processing Large CSV Files in Python to Avoid Memory Issues

AngelaSchirck — Thu, 26 Oct 2023 11:22:00 GMT

What have you tried? How large are your files? I've read some pretty large files without issues.

Re: Efficiently Reading and Processing Large CSV Files in Python to Avoid Memory Issues

VinceAngelo — Thu, 26 Oct 2023 14:20:39 GMT

Which Python are you using? I've ingested tens of millions of rows in 64-bit Python 2.7 (64-bit Geoprocessing for ArcMap), and processed 40-60 million rows in 64-bit Python 3 (ArcGIS Pro). Both of those VMs had less than 32GiB RAM available.

- V

Re: Efficiently Reading and Processing Large CSV Files in Python to Avoid Memory Issues

EarlMedina — Fri, 27 Oct 2023 15:42:51 GMT

Out of curiosity, what sorts of things do you need to do? In the past, I ran into similar issues working with traffic data and the way I got around it was actually switching to Julia for the pre-processing tasks. Julia has a DataFrames.jl library that is similar to pandas.

Re: Efficiently Reading and Processing Large CSV Files in Python to Avoid Memory Issues

HaydenWelch — Fri, 27 Oct 2023 16:57:23 GMT

Use Python Generators:

csv = (row for row in open("path/to/csv")) for row in csv: process(row) # Alternatively def read_csv(csv:str, sep:str=',') -> list[str]: f = open(csv) for row in f: yield row.strip().split(sep) for row in read_csv("path/to/csv"): process(row)

This will be slower than loading the whole file into memory, but it only uses a fixed amount of memory and iterates through the file line by line only loading in the line that it needs for the current operation

Re: Efficiently Reading and Processing Large CSV Files in Python to Avoid Memory Issues

HaydenWelch — Fri, 27 Oct 2023 17:01:17 GMT

The limit is around 100-200M rows from my experience. I've had to deal with large datasets that include say all addresses in a country and you tend to quickly run out of memory without using generators.

Re: Efficiently Reading and Processing Large CSV Files in Python to Avoid Memory Issues

SamirSinghaMahapatra — Tue, 09 Jan 2024 12:16:29 GMT

You can read large csv using PySpark in Python

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("large_file_read").getOrCreate() df = spark.read.csv('large_dataset.csv', header=True)

You can get more details on this and how to choose best way to read large csv here