Min Htet Myet Mattral

Mattral

ML Engineer · Distributed Training · LLM Systems · Computer Vision

I build things that work at scale -- and try to understand why they work at all.

What I actually do

I work in the gap between ML research and production engineering -- where the math is clean and the cluster is not.

Day-to-day: cloud-scale ML infrastructure at a hyperscaler, distributed training infrastructure, LLM safety systems, and the occasional Triton kernel when PyTorch decides it's done for the day. Most of my production work lives in private repos -- this is where the side projects land.

Things I care about technically

Large-scale pre-training infrastructure -- MoE routing, fault-tolerant checkpointing, tensor/pipeline parallelism
LLM safety and observability -- keeping models honest at inference time
The hardware-software boundary: SIMD, CUDA, kernel-level optimization
Novel architectures worth deploying, not just benchmarking

Things I care about less technically

Code that impresses interviewers but breaks on week two
Benchmarks that only win on synthetic data
Documentation that describes the happy path and nothing else

Selected work

Most projects here are built to solve a real problem, not to fill a portfolio. I'd rather have three things that work than ten that look good.

Project	What it is	Status
Composed-MoE-Engine	Sparse MoE training runtime -- Triton Top-K routing, DP+EP+TP distributed, async sharded checkpointing, TorchElastic fault recovery	Active
GuardRail Studio	LLM firewall -- sub-10ms p99 inline guardrails, DistilRoBERTa + ONNX + Triton, continuous drift detection and LoRA retraining	Active
KANX	Production KAN library -- TF + PyTorch + ONNX, Docker/K8s ready, published to PyPI	Active · `pip install kanx`
RLHF-PPO-DPO	Modular RLHF framework -- PPO and DPO, reward model training, policy optimization	Active
SIMD Microkernels	C++ AVX2 kernels for ML primitives -- tiled GEMM, vectorized GeLU, Python bindings	Experimental
ML from scratch	NumPy-only implementations of supervised, unsupervised, RL, and Bayesian methods	Reference

Stack

Not a comprehensive list. Just what I actually reach for.

Training & inference PyTorch TensorFlow Triton ONNX TensorRT FSDP2 TorchElastic

LLM ecosystem Transformers PEFT / LoRA vLLM LangChain FastAPI Triton Inference Server

Distributed & infra NCCL Kubernetes Helm Terraform Airflow Ray

Observability Prometheus Grafana OpenTelemetry Weights & Biases

Low-level C++ AVX2 / SIMD CUDA pybind11

Data PostgreSQL Qdrant MongoDB Spark Dask

A few honest notes

Most of my interesting work happens in private repositories -- production systems at cloud scale where open-sourcing isn't an option. This GitHub is a public window, not the full picture.

That said: the repos here are held to the same standard as the private ones -- CI, tests, type checking, real benchmarks. If something is experimental, the README says so. I'd rather write documentation that admits limitations than one that hides them.

I'm particularly interested in the fault-tolerance problems that only appear at real cluster scale, the latency-accuracy tradeoffs in LLM safety systems, and the open question of whether KAN-style architectures will find their niche or stay a curiosity.

Currently

Working on: fixing MoE engine chaos scenario A -- sudden node failure under expert resharding
Reading: the Megatron-LM codebase and the FlexAttention paper
Thinking about: whether MFU tracking gives you enough signal to catch silent training degradation early

Problem-solving

Algorithms are how I warm up. Systems are where I live.

Stats

🎶 Current frequency

Rhythm & motion

On the equation that changed everything

$$\mathbf{h}_t = \sigma!\left(\mathbf{W}_h,\mathbf{h}_{t-1} + \mathbf{W}_x,\mathbf{x}_t + \mathbf{b}\right)$$

The idea that a machine could hold memory across time -- that the past could shape the present through nothing more than a weight matrix -- was the moment I understood why this field is worth a lifetime.

The equation is simple. What it implies is not.

Outside of work I'm usually reading something I don't fully understand yet, listening to music that has no business being that good, and occasionally wondering if the model actually converged or if I just got lucky. I like working with people who say "I don't know" without embarrassment and argue about architecture in good faith.

mattralminn@gmail.com

Open to interesting conversations about distributed training, LLM infrastructure, or any hard ML systems problem worth losing sleep over.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly