Mastering Massive Files: Tips, Tricks, and Tools for Data Engineers

Panagiotis 29 Jul 2024 Coding Digital analytics 525

In the world of data engineering, handling huge files is a common challenge. These massive datasets can push the limits of your hardware and software, making even simple tasks like viewing or editing files a daunting prospect. In this article, we’ll explore some essential tips, tricks, and tools to help you efficiently manage and process huge files in your data engineering projects.

Table of Contents

1. Viewing Large Files Without Freezing Your System

Windows: Lister

Tool: Lister (https://www.ghisler.com/lister/ or mirror)
Why it’s useful: Allows you to easily view large files without loading the entire file into memory. Imagine a simple notepad read-only app, made for huge files.
How to use:
1. Download and install Lister
2. Right-click on the large file and select “Lister” to open
3. Navigate through the file using Lister’s efficient interface

Linux: Command-line Tools

Tools: less, head, tail
Why they’re useful: These tools allow you to view parts of a file without loading it entirely
How to use:
- less filename: View file contents page by page
- head -n 100 filename: View the first 100 lines
- tail -n 100 filename: View the last 100 lines

2. Editing Large Files

Linux: Text Editors for Large Files

Tools: vim, emacs
Why they’re useful: These editors can handle very large files efficiently
How to use:
- vim largefile.txt: Open the file in Vim
- emacs -Q --eval '(find-file-literally "largefile.txt")': Open in Emacs without loading the entire file

Linux: Bash scripts

Tools: sed , awk
Why they’re useful: These tools can handle very large files with minimal resources. TIP: LLM Chatbots like Claude & ChatGPT are great at helping you come up with the correct command.
How to use:

Use sed to replace /” string with “” and save the result to a new file

sed 's/\/"/""/' "test.csv" > "${test%.*}_edited.csv"

Process the file in batches with sed

file="test.csv" 
base_name="${file%.*}" 
extension="${file##*.}" 
temp_dir=$(mktemp -d) 

# Split the file into smaller chunks 
split -l 10000 "$file" "${temp_dir}/chunk_" 

# Process each chunk 
for chunk in "${temp_dir}"/chunk_*; do 
       sed 's/\/"/""/g' "$chunk" > "${chunk}_processed" 
done 

# Combine processed chunks cat "${temp_dir}"/*_processed > "${base_name}_edited.${extension}" 

# Clean up temporary files
 rm -rf "$temp_dir"

Process the file in batches with sed

3. Processing Files in Batches

Python: Chunking with Pandas

Why it’s useful: Allows processing of large CSV files without loading the entire file into memory
How to use:

import pandas as pd

chunksize = 100000  # Adjust based on your system's memory
for chunk in pd.read_csv('large_file.csv', chunksize=chunksize):
    # Process each chunk
    process_data(chunk)

Hadoop Ecosystem: Distributed Processing

Tools: Hadoop, Spark
Why they’re useful: Distributes processing across multiple nodes
How to use:
1. Set up a Hadoop or Spark cluster
2. Use tools like Hive or Spark SQL to query and process large datasets

4. Compression Techniques

Gzip Compression

Why it’s useful: Reduces file size for storage and faster processing
How to use:
- Compress: gzip largefile.txt
- Decompress: gunzip largefile.txt.gz

5. Streaming Processing

Python: Iterating Through Large Files

Why it’s useful: Processes files line by line without loading entire file into memory
How to use:

with open('large_file.txt', 'r') as file:
    for line in file:
        # Process each line
        process_line(line)

6. Cloud-Based Solutions for Processing Large Files

Google Cloud’s Dataflow

Google Cloud’s Dataflow is a fully managed service for executing Apache Beam pipelines within the Google Cloud Platform ecosystem. It’s particularly useful for processing large files and datasets.

Key features:

Serverless: No need to manage infrastructure
Auto-scaling: Adjusts resources based on workload
Batch and streaming: Handles both batch and real-time data processing
Built-in templates: Offers pre-built templates for common data processing tasks

Example use case:

import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions

options = PipelineOptions()
with beam.Pipeline(options=options) as p:
    lines = p | 'ReadLargeFile' >> beam.io.ReadFromText('gs://your-bucket/large-file.txt')
    processed = lines | 'ProcessData' >> beam.Map(lambda x: x.upper())
    processed | 'WriteResults' >> beam.io.WriteToText('gs://your-bucket/output')

Other options include:

Amazon EMR (Elastic MapReduce): Amazon EMR is a cloud-based big data platform for processing vast amounts of data using open-source tools such as Apache Spark, Hive, HBase, Flink, and Presto.
Azure HDInsight: Azure HDInsight is a fully managed, open-source analytics service for enterprises. It makes it easy to process massive amounts of data with open-source frameworks.

7. Database Solutions for Large Datasets

If you can upload the data directly to a DB (because sometimes you will need to clean them up before you can), it will make it much easier for you to work with them.

PostgreSQL

PostgreSQL is a powerful, open-source object-relational database system that can handle large datasets efficiently.

Key features for large datasets:

Partitioning: Divide large tables into smaller, more manageable pieces
Parallel query: Utilize multiple CPU cores for faster query execution
JSONB data type: Efficiently store and query semi-structured data
Foreign data wrappers: Query external data sources as if they were PostgreSQL tables

MongoDB

MongoDB is a popular NoSQL database that excels at handling large volumes of unstructured or semi-structured data.

Key features for large datasets:

Sharding: Distribute data across multiple machines
Indexing: Create indices to speed up queries on large collections
Aggregation pipeline: Perform complex data transformations and analyses
GridFS: Store and retrieve large files efficiently

8. Efficient File Formats for Big Data

When working with large datasets, choosing the right file format can significantly impact processing speed and storage efficiency.

Apache Parquet

Parquet is a columnar storage file format designed for efficient data processing and storage.

Key benefits:

Columnar format: Allows for efficient querying of specific columns
Compression: Reduces storage space and improves I/O
Schema evolution: Supports adding or removing columns over time
Predicate pushdown: Improves query performance by filtering data at the storage level

Example of reading a Parquet file with Python:

import pandas as pd

df = pd.read_parquet('large_dataset.parquet')

Apache Avro

Avro is a row-based storage format that offers schema evolution and efficient serialization.

Key benefits:

Schema evolution: Allows for changes to the data schema over time
Language-agnostic: Supports many programming languages
Compact: Efficient binary serialization
Splittable: Supports parallel processing of data files

Example of writing data to Avro format with Python:

from avro import schema, datafile, io

schema = {
    'name': 'LargeDataset',
    'type': 'record',
    'fields': [
        {'name': 'id', 'type': 'int'},
        {'name': 'name', 'type': 'string'},
        {'name': 'value', 'type': 'float'}
    ]
}

with datafile.DataFileWriter(open('large_dataset.avro', 'wb'), io.DatumWriter(), schema) as writer:
    writer.append({'id': 1, 'name': 'Item 1', 'value': 10.5})
    writer.append({'id': 2, 'name': 'Item 2', 'value': 20.0})

9. Inspect file snippets in case of error

When trying to import files, it’s common to get error referrencing a specific byte offset. You can use tools like sed or awk to review those exact problematic parts of the document.

For instance, check the following BigQuery import error:

Error while reading data, error message: CSV table references column position 23, but line contains only 20 columns.; byte_offset_to_start_of_line: 16955292992 column_index: 23 column_name: "count_domain_sess..." column_type: STRING File: gs://incoming-data/raw_data/data_01.csv

The following bash command will show you the last 5 lines before the problematic area, which might help identify the issue:

gsutil cat gs://incoming-data/raw_data/data_01.csv | head -c 16955292992 | tail -n 5

10. Use Cloud VMs

When working with large files don’t straing your PC and you internet connection, just spin up a cloud VM (e.g. GCP Compute engine instance) to benefit from a hardware setup tailored to your exact needs (i.e. disk space, RAM, CPU) that will be available only as long as you are working on the task.

Plus you will also benefit from much wider bandwidth when uploading/downloading files. For instance when downloading files from S3 on Compute engine using aws s3 cp I am able to reach average speeds of 150MB/sec!

Conclusion

Handling huge files is an essential skill for data engineers. By employing these tools and techniques, you can efficiently view, edit, and process files larger than 10GB without overwhelming your system resources. Remember to always consider the specific requirements of your project and the characteristics of your data when choosing the best approach.

Written By Panagiotis

Panagiotis (pronounced Panayotis) is a passionate G(r)eek with experience in digital analytics projects and website implementation. Fan of clear and effective processes, automation of tasks and problem-solving technical hacks. Hands-on experience with projects ranging from small to enterprise-level companies, starting from the communication with the customers and ending with the transformation of business requirements to the final deliverable.

Mastering Massive Files: Tips, Tricks, and Tools for Data Engineers

1. Viewing Large Files Without Freezing Your System

Windows: Lister

Linux: Command-line Tools

2. Editing Large Files

Linux: Text Editors for Large Files

Linux: Bash scripts

Use sed to replace /” string with “” and save the result to a new file

Process the file in batches with sed

3. Processing Files in Batches

Python: Chunking with Pandas

Hadoop Ecosystem: Distributed Processing

4. Compression Techniques

Gzip Compression

5. Streaming Processing

Python: Iterating Through Large Files

6. Cloud-Based Solutions for Processing Large Files

Google Cloud’s Dataflow

7. Database Solutions for Large Datasets

PostgreSQL

MongoDB

8. Efficient File Formats for Big Data

Apache Parquet

Apache Avro

9. Inspect file snippets in case of error

10. Use Cloud VMs

Conclusion

Written By Panagiotis

Menu

Subscribe

Recent Posts

Panagiotis Tzamtzis

Mastering Massive Files: Tips, Tricks, and Tools for Data Engineers

1. Viewing Large Files Without Freezing Your System

Windows: Lister

Linux: Command-line Tools

2. Editing Large Files

Linux: Text Editors for Large Files

Linux: Bash scripts

Use sed to replace /” string with “” and save the result to a new file

Process the file in batches with sed

3. Processing Files in Batches

Python: Chunking with Pandas

Hadoop Ecosystem: Distributed Processing

4. Compression Techniques

Gzip Compression

5. Streaming Processing

Python: Iterating Through Large Files

6. Cloud-Based Solutions for Processing Large Files

Google Cloud’s Dataflow

7. Database Solutions for Large Datasets

PostgreSQL

MongoDB

8. Efficient File Formats for Big Data

Apache Parquet

Apache Avro

9. Inspect file snippets in case of error

10. Use Cloud VMs

Conclusion

Sign up to receive updates about new posts!

Written By Panagiotis

Menu

Subscribe

Recent Posts

Panagiotis Tzamtzis