Announcing Phi-3 fine-tuning, new generative AI models, and other Azure AI updates to empower organizations to customize and scale AI applications

AI is transforming every industry and creating new opportunities for innovation and growth. But, developing and deploying AI applications at scale requires a robust and flexible platform that can handle the complex and diverse needs of modern enterprises and allow them to create solutions grounded in their organizational data. That’s why we are excited to announce several updates to help developers quickly create customized AI solutions with greater choice and flexibility leveraging the Azure AI toolchain:

Serverless fine-tuning for Phi-3-mini and Phi-3-medium models enables developers to quickly and easily customize the models for cloud and edge scenarios without having to arrange for compute.
Updates to Phi-3-mini including significant improvement in core quality, instruction-following, and structured output, enabling developers to build with a more performant model without additional cost.
Same day shipping earlier this month of the latest models from OpenAI (GPT-4o mini), Meta (Llama 3.1 405B), Mistral (Large 2) to Azure AI to provide customers greater choice and flexibility.

Table of Contents

Unlocking value through model innovation and customization

In April, we introduced the Phi-3 family of small, open models developed by Microsoft. Phi-3 models are our most capable and cost-effective small language models (SLMs) available, outperforming models of the same size and next size up. As developers look to tailor AI solutions to meet specific business needs and improve quality of responses, fine-tuning a small model is a great alternative without sacrificing performance. Starting today, developers can fine-tune Phi-3-mini and Phi-3-medium with their data to build AI experiences that are more relevant to their users, safely, and economically.

Given their small compute footprint, cloud and edge compatibility, Phi-3 models are well suited for fine-tuning to improve base model performance across a variety of scenarios including learning a new skill or a task (e.g. tutoring) or enhancing consistency and quality of the response (e.g. tone or style of responses in chat/Q&A). We’re already seeing adaptations of Phi-3 for new use cases.

Microsoft and Khan Academy are working together to help improve solutions for teachers and students across the globe. As part of the collaboration, Khan Academy uses Azure OpenAI Service to power Khanmigo for Teachers, a pilot AI-powered teaching assistant for educators across 44 countries and is experimenting with Phi-3 to improve math tutoring. Khan Academy recently published a research paper highlighting how different AI models perform when evaluating mathematical accuracy in tutoring scenarios, including benchmarks from a fine-tuned version of Phi-3. Initial data shows that when a student makes a mathematical error, Phi-3 outperformed most other leading generative AI models at correcting and identifying student mistakes.

And we’ve fine-tuned Phi-3 for the device too. In June, we introduced Phi Silica to empower developers with a powerful, trustworthy model for building apps with safe, secure AI experiences. Phi Silica builds on the Phi family of models and is designed specifically for the NPUs in Copilot+ PCs. Microsoft Windows is the first platform to have a state-of-the-art small language model (SLM) custom built for the Neural Processing Unit (NPU) and shipping inbox.

You can try fine-tuning for Phi-3 models today in Azure AI.

I am also excited to share that our Models-as-a-Service (serverless endpoint) capability in Azure AI is now generally available. Additionally, Phi-3-small is now available via a serverless endpoint so developers can quickly and easily get started with AI development without having to manage underlying infrastructure. Phi-3-vision, the multi-modal model in the Phi-3 family, was announced at Microsoft Build and is available through Azure AI model catalog. It will soon be available via a serverless endpoint as well. Phi-3-small (7B parameter) is available in two context lengths 128K and 8K whereas Phi-3-vision (4.2B parameter) has also been optimized for chart and diagram understanding and can be used to generate insights and answer questions.

We are seeing great response from the community on Phi-3. We released an update for Phi-3-mini last month that brings significant improvement in core quality and instruction following. The model was re-trained leading to substantial improvement in instruction following and support for structured output. We also improved multi-turn conversation quality, introduced support for <|system|> prompts, and significantly improved reasoning capability.

The table below highlights improvements across instruction following, structured output, and reasoning.

Benchmarks	Phi-3-mini-4k		Phi-3-mini-128k
Benchmarks	Apr ’24 release	Jun ’24 update	Apr ’24 release	Jun ’24 update
Instruction Extra Hard	5.7	6.0	5.7	5.9
Instruction Hard	4.9	5.1	5	5.2
JSON Structure Output	11.5	52.3	1.9	60.1
XML Structure Output	14.4	49.8	47.8	52.9
GPQA	23.7	30.6	25.9	29.7
MMLU	68.8	70.9	68.1	69.7
Average	21.7	35.8	25.7	37.6

We continue to make improvements to Phi-3 safety too. A recent research paper highlighted Microsoft’s iterative “break-fix” approach to improving the safety of the Phi-3 models which involved multiple rounds of testing and refinement, red teaming, and vulnerability identification. This method significantly reduced harmful content by 75% and enhanced the models’ performance on responsible AI benchmarks.

Expanding model choice, now with over 1600 models available in Azure AI

With Azure AI, we’re committed to bringing the most comprehensive selection of open and frontier models and state-of-the-art tooling to help meet customers’ unique cost, latency, and design needs. Last year we launched the Azure AI model catalog where we now have the broadest selection of models with over 1,600 models from providers including AI21, Cohere, Databricks, Hugging Face, Meta, Mistral, Microsoft Research, OpenAI, Snowflake, Stability AI and others. This month we added—OpenAI’s GPT-4o mini through Azure OpenAI Service, Meta Llama 3.1 405B, and Mistral Large 2.

Continuing the momentum today we are excited to share that Cohere Rerank is now available on Azure. Accessing Cohere’s enterprise-ready language models on Azure AI’s robust infrastructure enables businesses to seamlessly, reliably, and safely incorporate cutting-edge semantic search technology into their applications. This integration allows users to leverage the flexibility and scalability of Azure, combined with Cohere’s highly performant and efficient language models, to deliver superior search results in production.

TD Bank Group, one of the largest banks in North America, recently signed an agreement with Cohere to explore its full suite of large language models (LLMs), including Cohere Rerank.

At TD, we’ve seen the transformative potential of AI to deliver more personalized and intuitive experiences for our customers, colleagues and communities, we’re excited to be working alongside Cohere to explore how its language models perform on Microsoft Azure to help support our innovation journey at the Bank.”

Kirsti Racine, VP, AI Technology Lead, TD.

Atomicwork, a digital workplace experience platform and longtime Azure customer, has significantly enhanced its IT service management platform with Cohere Rerank. By integrating the model into their AI digital assistant, Atom AI, Atomicwork has improved search accuracy and relevance, providing faster, more precise answers to complex IT support queries. This integration has streamlined IT operations and boosted productivity across the enterprise.

The driving force behind Atomicwork’s digital workplace experience solution is Cohere’s Rerank model and Azure AI Studio, which empowers Atom AI, our digital assistant, with the precision and performance required to deliver real-world results. This strategic collaboration underscores our commitment to providing businesses with advanced, secure, and reliable enterprise AI capabilities.”

Vijay Rayapati, CEO of Atomicwork

Command R+, Cohere’s flagship generative model which is also available on Azure AI, is purpose-built to work well with Cohere Rerank within a Retrieval Augmented Generation (RAG) system. Together they are capable of serving some of the most demanding enterprise workloads in production.

Earlier this week, we announced that Meta Llama 3.1 405B along with the latest fine-tuned Llama 3.1 models, including 8B and 70B, are now available via a serverless endpoint in Azure AI. Llama 3.1 405B can be used for advanced synthetic data generation and distillation, with 405B-Instruct serving as a teacher model and 8B-Instruct/70B-Instruct models acting as student models. Learn more about this announcement here.

Mistral Large 2 is now available on Azure, making Azure the first leading cloud provider to offer this next-gen model. Mistral Large 2 outperforms previous versions in coding, reasoning, and agentic behavior, standing on par with other leading models. Additionally, Mistral Nemo, developed in collaboration with NVIDIA, brings a powerful 12B model that pushes the boundaries of language understanding and generation. Learn More.

And last week, we brought GPT-4o mini to Azure AI alongside other updates to Azure OpenAI Service, enabling customers to expand their range of AI applications at a lower cost and latency with improved safety and data deployment options. We will announce more capabilities for GPT-4o mini in coming weeks. We are also happy to introduce a new feature to deploy chatbots built with Azure OpenAI Service into Microsoft Teams.

Enabling AI innovation safely and responsibly

Building AI solutions responsibly is at the core of AI development at Microsoft. We have a robust set of capabilities to help organizations measure, mitigate, and manage AI risks across the AI development lifecycle for traditional machine learning and generative AI applications. Azure AI evaluations enable developers to iteratively assess the quality and safety of models and applications using built-in and custom metrics to inform mitigations. Additional Azure AI Content Safety features—including prompt shields and protected material detection—are now “on by default” in Azure OpenAI Service. These capabilities can be leveraged as content filters with any foundation model included in our model catalog, including Phi-3, Llama, and Mistral. Developers can also integrate these capabilities into their application easily through a single API. Once in production, developers can monitor their application for quality and safety, adversarial prompt attacks, and data integrity, making timely interventions with the help of real-time alerts.

Azure AI uses HiddenLayer Model Scanner to scan third-party and open models for emerging threats, such as cybersecurity vulnerabilities, malware, and other signs of tampering, before onboarding them to the Azure AI model catalog. The resulting verifications from Model Scanner, provided within each model card, can give developer teams greater confidence as they select, fine-tune, and deploy open models for their application.

We continue to invest across the Azure AI stack to bring state of the art innovation to our customers so you can build, deploy, and scale your AI solutions safely and confidently. We cannot wait to see what you build next.