Skip to content
View Mattral's full-sized avatar
👀
I may be slow to respond.
👀
I may be slow to respond.

Highlights

  • Pro

Block or report Mattral

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
mattral/README.md

Mattral

ML Engineer · Distributed Training · LLM Systems · Computer Vision

I build things that work at scale -- and try to understand why they work at all.


Profile views GitHub followers


What I actually do

I work in the gap between ML research and production engineering -- where the math is clean and the cluster is not.

Day-to-day: cloud-scale ML infrastructure at a hyperscaler, distributed training infrastructure, LLM safety systems, and the occasional Triton kernel when PyTorch decides it's done for the day. Most of my production work lives in private repos -- this is where the side projects land.

Things I care about technically

  • Large-scale pre-training infrastructure -- MoE routing, fault-tolerant checkpointing, tensor/pipeline parallelism
  • LLM safety and observability -- keeping models honest at inference time
  • The hardware-software boundary: SIMD, CUDA, kernel-level optimization
  • Novel architectures worth deploying, not just benchmarking

Things I care about less technically

  • Code that impresses interviewers but breaks on week two
  • Benchmarks that only win on synthetic data
  • Documentation that describes the happy path and nothing else

Selected work

Most projects here are built to solve a real problem, not to fill a portfolio. I'd rather have three things that work than ten that look good.

Project What it is Status
Composed-MoE-Engine Sparse MoE training runtime -- Triton Top-K routing, DP+EP+TP distributed, async sharded checkpointing, TorchElastic fault recovery Active
GuardRail Studio LLM firewall -- sub-10ms p99 inline guardrails, DistilRoBERTa + ONNX + Triton, continuous drift detection and LoRA retraining Active
KANX Production KAN library -- TF + PyTorch + ONNX, Docker/K8s ready, published to PyPI Active · pip install kanx
RLHF-PPO-DPO Modular RLHF framework -- PPO and DPO, reward model training, policy optimization Active
SIMD Microkernels C++ AVX2 kernels for ML primitives -- tiled GEMM, vectorized GeLU, Python bindings Experimental
ML from scratch NumPy-only implementations of supervised, unsupervised, RL, and Bayesian methods Reference

Stack

Not a comprehensive list. Just what I actually reach for.

Training & inference   PyTorch TensorFlow Triton ONNX TensorRT FSDP2 TorchElastic

LLM ecosystem   Transformers PEFT / LoRA vLLM LangChain FastAPI Triton Inference Server

Distributed & infra   NCCL Kubernetes Helm Terraform Airflow Ray

Observability   Prometheus Grafana OpenTelemetry Weights & Biases

Low-level   C++ AVX2 / SIMD CUDA pybind11

Data   PostgreSQL Qdrant MongoDB Spark Dask


A few honest notes

Most of my interesting work happens in private repositories -- production systems at cloud scale where open-sourcing isn't an option. This GitHub is a public window, not the full picture.

That said: the repos here are held to the same standard as the private ones -- CI, tests, type checking, real benchmarks. If something is experimental, the README says so. I'd rather write documentation that admits limitations than one that hides them.

I'm particularly interested in the fault-tolerance problems that only appear at real cluster scale, the latency-accuracy tradeoffs in LLM safety systems, and the open question of whether KAN-style architectures will find their niche or stay a curiosity.


Currently

  • Working on: fixing MoE engine chaos scenario A -- sudden node failure under expert resharding
  • Reading: the Megatron-LM codebase and the FlexAttention paper
  • Thinking about: whether MFU tracking gives you enough signal to catch silent training degradation early

Problem-solving

Algorithms are how I warm up. Systems are where I live.


Stats


🎶 Current frequency


Rhythm & motion

contribution snake



3D contribution graph

On the equation that changed everything

$$\mathbf{h}_t = \sigma!\left(\mathbf{W}_h,\mathbf{h}_{t-1} + \mathbf{W}_x,\mathbf{x}_t + \mathbf{b}\right)$$

The idea that a machine could hold memory across time -- that the past could shape the present through nothing more than a weight matrix -- was the moment I understood why this field is worth a lifetime.

The equation is simple. What it implies is not.


Outside of work I'm usually reading something I don't fully understand yet, listening to music that has no business being that good, and occasionally wondering if the model actually converged or if I just got lucky. I like working with people who say "I don't know" without embarrassment and argue about architecture in good faith.

mattralminn@gmail.com


Open to interesting conversations about distributed training, LLM infrastructure, or any hard ML systems problem worth losing sleep over.

Pinned Loading

  1. KANX KANX Public

    One library, four surfaces. Production-grade Kolmogorov-Arnold Networks || TensorFlow + PyTorch + ONNX. || A small KAN beats a 10× larger MLP on smooth, separable target. One library. Two backends.…

    Python 25 8

  2. Composed-Mixture-of-Experts-Engine Composed-Mixture-of-Experts-Engine Public

    Production-grade sparse MoE training runtime. Designed to keep large-scale pre-training jobs alive end-to-end: sparse Top-K routing in custom Triton, DP+EP distributed training with TP support in c…

    Python 9 8

  3. Improving-LLM-Models-with-RLHF-PPO-DPO Improving-LLM-Models-with-RLHF-PPO-DPO Public

    A modular, production-grade framework for Reinforcement Learning from Human Feedback (RLHF) with Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO).

    Python 20 4

  4. RAG-Multimodal-Financial-Document-Analysis-and-Recall RAG-Multimodal-Financial-Document-Analysis-and-Recall Public

    Enterprise-grade multimodal Retrieval-Augmented Generation (RAG) system for financial document analysis with async processing, fault tolerance, structured observability, and scalable architecture.

    Python 59 13

  5. ML-AI-Algorithms-from-scratch ML-AI-Algorithms-from-scratch Public

    A structured, educational repository of from-scratch ML/AI/RL/Bayesian algorithms. This project is evolving from a collection of standalone scripts into a clean, pip-installable Python package unde…

    Python 35 5

  6. GuardRail-Studio GuardRail-Studio Public

    Ultra-Low-Latency, High-Throughput LLM Firewall & Observability Platform Designed to defend planet-scale LLM systems against prompt injection, PII leakage, data poisoning, and model drift — with a …

    Python 9 7