Leo Torres

Senior Data Engineer

LLMs & AI • Python • BigQuery • Leipzig/Berlin

Download Resume Full technical experience

Schedule Call Let's discuss your project

About

I'm a Lead Data Scientist at FGS Global, building LLM-powered data pipelines that process 10M+ documents daily for Fortune 500 clients. Based in Leipzig, Germany (commuting to Berlin), I specialize in large-scale AI systems, distributed data processing, and production data infrastructure.

I architect data infrastructure handling billion-row datasets with sub-second query performance using BigQuery, Python, and modern cloud technologies. As technical lead for a team of 6 engineers on our flagship internal product, I implement cutting-edge RAG architectures and vector databases.

My approach combines rigorous computer science fundamentals with practical engineering solutions. I deliver measurable impact at scale through robust pipelines and APIs that serve real production workloads.

Technical Expertise

Core Competencies

Backend systems, data engineering, and ML infrastructure. Experienced in taking projects from prototype to production.

Python JavaScript/TypeScript Distributed Systems Machine Learning Data Pipelines API Design Cloud Architecture Performance Optimization

Featured Projects

FGS Global — Flagship Intelligence Product

Lead Data Scientist and technical lead for a team of 6 engineers. Designed, built, and own end-to-end the news media analysis pipeline — an LLM-powered system processing 10M+ documents/day into client-facing intelligence for Fortune 500 firms. Architected retrieval infrastructure over billion-row datasets in BigQuery with sub-second query performance.

Design: RAG and vector-search layer made domain-specific retrieval accurate enough for production client use; FastAPI service layer serves multiple internal products from a shared data backend.

Python • FastAPI • BigQuery • GCP • LLMs • RAG • Vector Databases

Data Berlin Meetup Talk

hns — Headless Multi-Agent Orchestration (private)

Headless dispatcher that runs Claude Code workers in parallel through a state machine: PR triage, CI gating, stuck-detection with escalation, orphan recovery, autonomous worktree dispatch.

Design: Python; abstractions and control loop designed from first principles.

Python • State Machines • Concurrency • Git Worktrees

The Aris Program

Founder of a multi-project open-source organization building web-native infrastructure for academic publishing. Designed RSM (Readable Science Markup), a markup language for scientific documents, with a tree-sitter grammar.

Built Scroll Press: a preprint server (FastAPI, PostgreSQL, HTMX) accepting Typst, Quarto, MyST, and Jupyter; live in beta. Architecting a collaborative editor (Vue) as the reference RSM implementation.

Python • FastAPI • PostgreSQL • HTMX • Vue.js • tree-sitter

Website GitHub

XGI: Complex Group Interactions

Co-Lead Developer of Python library for analyzing higher-order networks and hypergraphs. NumFOCUS affiliated project with growing academic user base. Implemented core algorithms, designed API, and established comprehensive testing framework.

Technical leadership: OOP design, CI/CD with GitHub Actions, performance optimization with NumPy and Numba.

Python • NumPy • pandas • Numba • pytest • GitHub Actions • OOP

Documentation GitHub

Manim Community

Organization Owner and Core Developer of the community-maintained version of 3Blue1Brown's mathematical animation engine. Contributing to the open-source Python library that creates precise, programmatic mathematical visualizations and educational content.

Recognition: Featured in GitHub's Popular Python Repositories. Contributions: Algorithm implementations, performance optimizations, documentation improvements, and community support for mathematical animation workflows.

Python • Mathematical Visualization • OpenGL • Cairo • Community Development

Website GitHub

COVID-19 Mobility Data Pipeline

Engineered data pipeline processing mobility data for 300+ US cities during COVID-19 pandemic. Built ETL workflows using Apache Airflow, implemented data quality checks, and optimized geospatial queries with PostGIS.

Results: Enabled epidemiologists to analyze movement patterns in near real-time, contributing to public health policy decisions.

Python • Airflow • Pandas

Industry Experience

Lead Data Scientist → Tech Lead, Data Platform

May 2023 - Present

FGS Global

• Designed, built, and own end-to-end the news media analysis pipeline at FGS — an LLM-powered system processing 10M+ documents/day into client-facing intelligence for Fortune 500 firms

• Architected retrieval infrastructure over billion-row datasets in BigQuery with sub-second query performance

• Designed the RAG and vector-search layer that made domain-specific retrieval accurate enough for production client use

• Technical lead for a team of 6 engineers: set architecture, owned technical direction, drove hiring decisions

• Built the FastAPI service layer serving multiple internal products from a shared data backend

Research Intern

May 2019 - Jul 2019

Yahoo! Research

• Built graph representation learning models on Tumblr social network data

• Processed terabyte-scale datasets using PySpark and distributed computing

• Developed Python pipelines for large-scale network analysis

Research Programmer

2012 - 2014

Wolfram Research South America

• Developed data pipelines for the Wolfram|Alpha knowledge engine

• Owned specific data domains end-to-end, including ingestion and quality

• Worked in a remote, globally distributed team

Open-Source Maintainership

hns — Headless Multi-Agent Orchestration (private)

2025 - Present

Headless dispatcher for parallel Claude Code workers

• State machine for PR triage, CI gating, stuck-detection with escalation, orphan recovery, autonomous worktree dispatch

• Python; abstractions and control loop designed from first principles

Co-Lead Developer — XGI

Aug 2021 - Present

NumFOCUS-affiliated Python library for higher-order networks

• Designed public API, core algorithms, and CI/CD; performance work with NumPy and Numba

• Library adopted by researchers across academia and industry

Organization Owner & Core Developer — Manim Community

May 2020 - May 2021

Community-maintained mathematical animation engine (3Blue1Brown)

• Featured in GitHub's Popular Python Repositories; grew project from fork to active community

• Algorithm implementations, performance work, release management, contributor onboarding

Co-Lead Developer — netrd

Jan 2019 - Jul 2019

Library for network reconstruction and comparison (JOSS-published)

• Implemented 40+ algorithms; co-led ~6 core developers; set coding standards

Reviewer — Journal of Open Source Software

Jul 2020 - Present

Peer review for scientific software submissions

Research Engineering & Academia

Postdoctoral Fellow — Mathematics

Aug 2021 - May 2023

Max Planck Institute for Mathematics in the Sciences

• Spectral graph theory research applied to complex networks

• Implemented high-performance graph mining tools in Python alongside published research

PhD, Network Science

2016 - 2021

Network Science Institute, Northeastern University

• Dissertation: Spectral Aspects of Mining Complex Networks

• Developed open-source Python libraries used by the research community

Technical Skills

Languages & Frameworks

Python: NumPy, Pandas, SciPy, PyTorch, FastAPI, Django, Celery
JavaScript: Node.js, React, TypeScript, D3.js, Express
Systems: C++, Rust (learning), Go (basic)
Other: SQL, GraphQL, Shell scripting, LaTeX

Infrastructure & Tools

Cloud: AWS (EC2, S3, Lambda, SageMaker), GCP, Azure
Databases: PostgreSQL, MongoDB, Redis, Neo4j, TimescaleDB
DevOps: Docker, Kubernetes, Terraform, GitHub Actions, CircleCI
Monitoring: Prometheus, Grafana, ELK Stack, Datadog

Methodologies & Practices

Architecture: Microservices, Event-driven, REST/GraphQL APIs
Development: TDD, CI/CD, Code review, Pair programming
Data: ETL pipelines, Stream processing, Data modeling
ML Ops: Model versioning, A/B testing, Feature stores

Soft Skills

Technical leadership and mentoring
Cross-functional collaboration
Technical documentation and knowledge sharing
Remote team coordination (5+ years)