Uber AI Solutions is one of Uber's biggest bets with the ambition to build one of the world's largest data foundries for AI applications and evolve into a platform of choice for a variety of online tasks. The Moonshot AI team focuses on accelerating human-in-the-loop data annotation and collection with automation and developing robust automated evaluation systems.
In this role, you will collaborate closely with research scientists, engineers, and cross-functional teams to deliver real-world impact through your research. You'll help grow Uber AI Solutions into a leader in the space.
What the candidate will do:
Drive research in areas such as LLM post-training (RLHF, GRPO, instruction tuning), data efficiency, and the design of benchmarks to evaluate LLM capabilities across safety, reasoning, and domain-specific performance.
Design and run experiments to validate hypotheses and iterate on research ideas.
Collaborate with research scientists and engineers to prototype and evaluate novel approaches.
Produce publication-ready research targeting top-tier AI/ML conferences.
Requirements:
Currently pursuing a Ph.D. in Computer Science, Machine Learning, Natural Language Processing, or a related field.
Published work at top-tier AI/ML conferences (e.g., NeurIPS, ICML, ICLR, ACL, EMNLP, CVPR, COLM).
Deep expertise in at least one of the following: LLM post-training (RLHF, instruction tuning), LLM evaluation, reasoning and agents, data efficiency, or alignment and safety.
Proficiency in Python and deep learning frameworks (e.g., PyTorch, JAX).
Hands-on experience training or fine-tuning large language models.
Preferred Qualifications
First-author publications at top-tier AI/ML conferences.
Experience with distributed training frameworks (e.g., DeepSpeed, FSDP, Megatron).
Contributions to open-source LLM projects or frameworks.
Demonstrated ability to rapidly prototype and iterate on research ideas.
Current research interests
Real-world LLM Benchmarking: Moving beyond standard metrics to create benchmarks that map model performance to real-world business impact and responsible usage.
Agentic Quality Evaluation: Developing agentic systems to automatically evaluate dataset quality and adherence to requirements.
Few-Shot Grounding: Utilizing small subsets of annotated data (e.g., 10%) to significantly boost ML assistance for the remaining 90%.
Human-in-the-Loop Optimization: Minimizing human intervention in annotation tasks by integrating robust automated checks and feedback loops.
Apply for this job online