Database Migration Options for DFS Price Analysis System

Executive Summary

The current CSV-based data processing system generates over 1.1 million rows (107MB) from a source dataset of ~4,800 price records. This explosive growth occurs because the system creates every possible combination of countries, transaction types, dates, and transaction amounts (1-1000 USD). The explosion factor is approximately 230x - meaning every additional source row generates hundreds of output rows.

Critical Growth Projections:

  • Current: 4,832 source rows → 1.1M exploded rows (107MB)
  • 1 Year: ~10,000 source rows → 2.3M exploded rows (220MB)
  • 2 Years: ~20,000 source rows → 4.6M exploded rows (450MB)
  • 5 Years: ~50,000 source rows → 11.5M exploded rows (1.1GB)

The current CSV approach will become unsustainable as the dataset grows, creating performance bottlenecks, memory constraints, and Git repository management issues. Moving to a database solution is essential for long-term scalability.

Key Recommendations:

  • Primary: MotherDuck cloud database - cost-effective, scalable, minimal development effort
  • Cost-Effective Alternative: Apache Parquet files - 60-75% storage reduction, 10-1500x query performance boost, minimal costs
  • Self-Hosted: Local DuckDB with periodic cloud sync - lower ongoing costs, more control
  • Timeline: 12-16 hours for Parquet, 16-40 hours for databases
  • Monthly Cost: ~$0 for Parquet vs. $25-50 for MotherDuck vs. current CSV storage costs

Current System vs. Database Solutions

Aspect Current CSV System Apache Parquet MotherDuck Database Local DuckDB
Storage 107MB flat file 25-40MB compressed Compressed cloud storage Local optimized files
Query Performance Full file scan required 10-1500x faster Indexed, fast queries Very fast local queries
Data Integrity No validation Schema enforcement Schema enforcement Schema enforcement
Concurrent Access File locking issues Read-only concurrent Multi-user support Single user
Versioning Git-friendly Git-friendly Built-in versioning Manual versioning
Scalability BREAKS at ~500MB Handles multi-GB datasets Auto-scales to any size Limited by local storage
Backup/Recovery Manual Git process Standard Git process Automated Manual process
Cost (Monthly) ~$0 (Git LFS storage) ~$0 $25-50 ~$0
Setup Time 0 (existing) 12-16 hours 16-24 hours 20-32 hours
Maintenance File management Minimal Minimal File management

Database Solution Options

Option 3: Local DuckDB with Cloud Sync

Description: Use local DuckDB files with periodic synchronization to cloud storage.

Benefits:

  • Lower ongoing costs
  • Very fast local queries
  • Full control over data
  • Can work offline
  • Growth Limitation: Storage constrained by local hardware capacity

Code Changes Required:

  • Similar to MotherDuck but with local file connections
  • Add cloud sync mechanism (S3, Google Cloud, etc.)
  • Implement conflict resolution for concurrent updates

Cost Breakdown:

  • Storage: Local disk space (minimal cost)
  • Cloud Sync: ~$2-5/month for backup storage
  • Total: ~$5/month

Implementation Time: 20-32 hours

Option 4: PostgreSQL on Cloud Provider

Description: Traditional relational database hosted on AWS RDS, Google Cloud SQL, or similar.

Benefits:

  • Industry standard solution
  • Extensive tooling and support
  • Strong consistency guarantees

Drawbacks:

  • Higher costs
  • More complex setup and maintenance
  • Overkill for analytics workload

Cost: $50-100/month minimum Implementation Time: 24-40 hours

Implementation Requirements

Technical Changes Needed

1. Database Schema Design

CREATE TABLE transaction_fees (
    id BIGINT GENERATED ALWAYS AS IDENTITY,
    country VARCHAR(50) NOT NULL,
    transaction_type VARCHAR(100) NOT NULL,
    date_collection DATE NOT NULL,
    usd_transaction_amount INTEGER NOT NULL,
    provider VARCHAR(100) NOT NULL,
    tax_type DOUBLE,
    tax DOUBLE,
    tax_pct DOUBLE,
    notes TEXT,
    fee_technical_issue BOOLEAN NOT NULL DEFAULT FALSE,
    usd_fee DOUBLE NOT NULL,
    fee_pct_of_transaction DOUBLE NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY (id)
);

CREATE INDEX idx_country_type_date ON transaction_fees(country, transaction_type, date_collection);
CREATE INDEX idx_amount_range ON transaction_fees(usd_transaction_amount);

2. Python Code Modifications

  • Install MotherDuck connector: pip install motherduck-client
  • Replace pandas CSV operations with database connections
  • Modify the explosion logic to use SQL INSERT statements
  • Add connection configuration and error handling

3. Data Pipeline Updates

  • Batch insert operations instead of row-by-row processing
  • Add data validation before insertion
  • Implement upsert logic for data updates
  • Add progress monitoring for large inserts

Resource Requirements

  • Developer Time: 1 experienced Python developer (see individual option time estimates)
  • Testing: Comprehensive testing of data pipeline (included in time estimates)
  • Documentation: Updated process documentation (2-4 hours)
  • Training: 1-2 hours for team members to learn new query interface

Cost Analysis

Scaling Cost Projections

Current Dataset (~107MB):

Solution Setup Cost Developer Time (Hours) Monthly Cost Annual Cost
Current CSV $0 0 $0 $0
Apache Parquet $0 12-16 $0 $0
MotherDuck $0 16-24 $35 $420
Local DuckDB $0 20-32 $5 $60
PostgreSQL $50-100* 24-40 $75 $950-1,000

5-Year Projection (~1.1GB dataset):

Solution Storage Impact Monthly Cost Annual Cost Scalability Issues
Current CSV 1.1GB files $0** $0** CRITICAL: Memory limits, Git LFS costs, performance collapse
Apache Parquet ~300MB compressed $0 $0 Some performance degradation on large queries
MotherDuck ~100MB cloud storage $40-50 $480-600 Scales automatically
Local DuckDB ~300MB local $5-10 $60-120 Hardware storage requirements
PostgreSQL ~500MB cloud $100-150 $1,200-1,800 Requires instance upgrades

*PostgreSQL setup cost includes initial database provisioning and configuration fees
**CSV approach becomes impractical - hidden costs include developer productivity loss, system failures, Git repository bloat

Long-term Benefits (Quantified)

  • Query Performance: 10-100x faster queries (estimated 5-10 hours/month time savings)
  • Data Reliability: Eliminate file corruption risks (~$500/incident avoided)
  • Collaboration: Multiple users can access data simultaneously
  • Automation: Easier integration with scheduling tools and dashboards

Critical CSV Risks (Quantified Growth Impact)

  • Performance Collapse: CSV queries will take 10+ minutes on 500MB+ files (vs. seconds with database)
  • Memory Failures: Pandas will fail to load 1GB+ CSV files on standard machines
  • Git Repository Bloat: LFS storage costs grow to $100+/month for large datasets
  • Developer Productivity Loss: Estimated 20+ hours/month wasted on file management issues
  • System Failures: High risk of data corruption during large file operations

Recommendations

Primary Recommendation: MotherDuck (Future-Proof Solution)

Given the explosive growth trajectory of your dataset, MotherDuck is the most strategic choice. While Parquet offers immediate cost savings, MotherDuck provides the scalability essential for long-term success:

  1. Unlimited Scalability: Auto-scales seamlessly from 1GB to 100GB+ datasets
  2. Future-Proof Architecture: Designed for analytical workloads that grow exponentially
  3. Team Collaboration: Multi-user access becomes critical as dataset and team grow
  4. Performance Consistency: Query speeds remain fast regardless of dataset size
  5. Professional Features: SQL interface, web dashboard, and advanced analytics

Alternative: Apache Parquet (Short-Term Solution)

Parquet provides an excellent interim solution if immediate database migration isn’t feasible:

  1. Immediate Relief: Handles current dataset growth (2-5x larger) effectively
  2. Zero Infrastructure Cost: No monthly fees during budget planning period
  3. Migration Bridge: Can easily export Parquet data to database later
  4. Performance Gains: 10-1500x faster than CSV with minimal code changes
  5. Buys Time: Provides 2-3 years of scalability while planning database migration

Implementation Roadmap

Option 1: Apache Parquet Implementation

Total Time: 12-16 hours
  • Hours 1-4: Install pyarrow, update CSV read/write operations to Parquet
  • Hours 5-8: Test data pipeline and validate output integrity
  • Hours 9-12: Implement optional partitioning by country/date
  • Hours 13-14: Performance testing and optimization
  • Hours 15-16: Team training and documentation

Option 2: MotherDuck Implementation

Total Time: 16-24 hours
  • Hours 1-8: Set up MotherDuck account and test connection
  • Hours 9-12: Migrate schema and initial data load
  • Hours 13-16: Update Python pipeline code
  • Hours 17-20: Testing and validation
  • Hours 21-22: Deploy to production environment
  • Hours 23-24: Team training and documentation

Success Metrics

  • Query response time < 5 seconds for typical analyses
  • Data pipeline completes in < 30 minutes
  • Zero data corruption incidents
  • Team adoption rate > 80% within first month

Technical Appendix

Database Schema Details

The proposed schema normalizes the current CSV structure while maintaining query performance:

Primary Table: transaction_fees

  • Stores all exploded transaction fee combinations
  • Indexed on frequently queried columns
  • Includes audit fields (created_at, id)

Potential Optimization: Dimension Tables If data grows significantly, consider normalizing:

  • countries table
  • transaction_types table
  • providers table

Code Migration Examples

Apache Parquet Migration

Current CSV approach:

# Current: Read entire CSV into memory
df = pd.read_csv("../data/lfs/transaction_fees_exploded.csv")
result = df[df['country'] == 'nigeria'].groupby('transaction_type')['usd_fee'].mean()

# Current: Save results
df.to_csv("../data/lfs/transaction_fees_exploded.csv", index=False)

Parquet approach:

# New: Read compressed Parquet (60-75% smaller, 10-1500x faster)
df = pd.read_parquet("../data/lfs/transaction_fees_exploded.parquet")
result = df[df['country'] == 'nigeria'].groupby('transaction_type')['usd_fee'].mean()

# New: Save with compression
df.to_parquet("../data/lfs/transaction_fees_exploded.parquet", compression='snappy')

# Optional: Partitioned storage for even better performance
df.to_parquet("../data/lfs/transaction_fees/", partition_cols=['country', 'date_collection'])

MotherDuck Database Migration

Database approach:

import duckdb

# Connect to MotherDuck
conn = duckdb.connect('md:your_database')

# Query directly with SQL
result = conn.execute("""
    SELECT transaction_type, AVG(usd_fee) as avg_fee 
    FROM transaction_fees 
    WHERE country = 'nigeria' 
    GROUP BY transaction_type
""").fetchall()

Data Migration Strategies

Parquet Migration Strategy

  1. Install Dependencies: pip install pyarrow fastparquet
  2. Convert Existing Data: One-time conversion from CSV to Parquet
  3. Update Code: Replace read_csv()/to_csv() with read_parquet()/to_parquet()
  4. Test Performance: Validate query speed improvements
  5. Optional Partitioning: Implement country/date partitioning for additional speed

Database Migration Strategy

  1. Initial Load: Batch insert current CSV data
  2. Validation: Compare aggregate statistics between CSV and database
  3. Incremental Updates: Modify pipeline to append new data
  4. Cutover: Switch applications to use database queries

Performance Considerations

Parquet Optimizations

  • Compression: Use Snappy compression for balance of speed and size
  • Partitioning: Partition by frequently filtered columns (country, date_collection)
  • Column Types: Ensure proper data types for optimal compression
  • Row Group Size: Default 128MB row groups work well for most analytics
  • Predicate Pushdown: Pandas automatically optimizes column/row filtering

Database Optimizations

  • Batch Size: Insert 10,000 rows at a time for optimal performance
  • Indexing Strategy: Create indexes after data load to improve insert speed
  • Query Optimization: Use column pruning and predicate pushdown
  • Compression: MotherDuck automatically optimizes storage compression

Security and Compliance

  • Access Control: Role-based permissions for different user types
  • Data Encryption: At-rest and in-transit encryption included
  • Audit Logging: Track all data access and modifications
  • Backup Strategy: Automated daily backups with point-in-time recovery