T
traeai
登录
返回首页
Together AI Blog

DeepSeek-V4 Pro 现在可在 Together AI 上使用

7.5Score
DeepSeek-V4 Pro 现在可在 Together AI 上使用

TL;DR · AI 摘要

Together AI 推出 DeepSeek-V4 Pro 模型,提供高性能推理和多种计算选项。

核心要点

  • DeepSeek-V4 Pro 在 NVIDIA Blackwell 上实现 1.3 倍速度提升。
  • Together AI 提供 GPU 集群、批量推理 API 和模型微调平台。
  • 支持自定义硬件和容器推理,满足多样化需求。

结构提纲

按章节快速跳转。

  1. 介绍 Together AI 推出 DeepSeek-V4 Pro 模型及其优势。

  2. DeepSeek-V4 Pro 在 NVIDIA Blackwell 上实现 1.3 倍速度提升。

  3. Together AI 提供 GPU 集群、批量推理 API 和模型微调平台。

  4. 支持自定义硬件和容器推理,满足多样化需求。

思维导图

用一张图看清主题之间的关系。

查看大纲文本(无障碍 / 无 JS 友好)
  • Together AI DeepSeek-V4 Pro 发布
    • 性能优化
      • FlashAttention-4 加速 1.3 倍
    • 计算服务
      • GPU 集群
        • NVIDIA Blackwell 支持
    • 模型部署
      • 批量推理 API
        • 成本降低 50%

金句 / Highlights

值得收藏与分享的关键句。

#AI#模型部署#深度学习
打开原文

DeepSeek-V4 Pro now available on Together AI

Image 1⚡️ FlashAttention-4: up to 1.3× faster than cuDNN on NVIDIA Blackwell →

Image 2Introducing Together AI's new look →

Image 3🔎 ATLAS: runtime-learning accelerators delivering up to 4x faster LLM inference →

Image 4⚡ Together GPU Clusters: self-service NVIDIA GPUs, now generally available →

Image 5📦 Batch Inference API: Process billions of tokens at 50% lower cost for most models →

Image 6🪛 Fine-Tuning Platform Upgrades: Larger Models, Longer Contexts →

[](https://www.together.ai/)

  • ![Image 7 Serverless Inference High-performance inference as APIs](https://www.together.ai/serverless-inference)
  • ![Image 8 Batch Inference Inference for batch workloads](https://www.together.ai/batch-inference)
  • ![Image 9 Dedicated Model Inference Inference on custom hardware](https://www.together.ai/dedicated-model-inference)
  • ![Image 10 Dedicated Container Inference Inference for custom models](https://www.together.ai/dedicated-container-inference)

![Image 11 MiniMax M2.5 Image 12 Nano Banana Pro Image 13 Qwen3.5-397B Image 14 GLM-5 Image 15 kimi k2.5 Image 16 gpt-oss-120B Model library Explore the top open-source models](https://www.together.ai/models)

Accelerated Compute

  • ![Image 17 GPU Clusters Reliable GPU clusters at scale](https://www.together.ai/gpu-clusters)
  • ![Image 18 AI Factory Custom infrastructure at frontier scale](https://www.together.ai/ai-factory)

Developer Environments

  • ![Image 19 Sandbox Build development environments for AI](https://www.together.ai/sandbox)

Storage

  • ![Image 20 Managed Storage Store model weights & data securely](https://www.together.ai/managed-storage)
  • ![Image 21 Fine-Tuning Shape models with your data](https://www.together.ai/fine-tuning)
  • ![Image 22 Evaluations Measure model quality](https://www.together.ai/evaluations)

![Image 23 DeepSeek V3.1 Image 24 GLM 5 FP4 Image 25 Qwen3-VL 32B Image 26 gpt-oss-120b Image 27 kimi k2.5 Image 28 Llama 4 Maverick Model library Fine-tune top open-source models](https://www.together.ai/models)

  • ![Image 29 Research Systems research for production AI](https://www.together.ai/research)
  • ![Image 30 Research blog All our research publications](https://www.together.ai/research-blog)

Featured publications

Show all

  • ![Image 31 Documentation Technical docs for Together AI](https://docs.together.ai/)
  • ![Image 32 Demos Our open-source demo apps](https://www.together.ai/demos)
  • ![Image 33 Cookbooks Practical implementation guides](https://www.together.ai/cookbooks)
  • ![Image 34 Voice Agents Build voice agents for production](https://www.together.ai/solutions/voice)

Resources

  • ![Image 35 Customer stories Testimonials from AI Natives](https://www.together.ai/customers)
  • ![Image 36 Startup accelerator Build and scale your startup](https://www.together.ai/startup-accelerator)
  • ![Image 37 Customer support Find answers to your questions](https://www.together.ai/support)
  • ![Image 38 Blog Our latest news & blog posts](https://www.together.ai/blog)
  • ![Image 39 Events Explore our events calendar](https://www.together.ai/events)

Company

  • ![Image 40 About Get to know us](https://www.together.ai/about-us)
  • ![Image 41 Careers Join our mission](https://www.together.ai/careers)

*

  • ![Image 42 Serverless Inference High-performance inference as APIs](https://www.together.ai/serverless-inference)
  • ![Image 43 Batch Inference Inference for batch workloads](https://www.together.ai/batch-inference)
  • ![Image 44 Dedicated Model Inference Inference on custom hardware](https://www.together.ai/dedicated-model-inference)
  • ![Image 45 Dedicated Container Inference Inference for custom models](https://www.together.ai/dedicated-container-inference)

![Image 46 MiniMax M2.5 Image 47 Nano Banana Pro Image 48 Qwen3.5-397B Image 49 GLM-5 Image 50 kimi k2.5 Image 51 gpt-oss-120B Model library Explore the top open-source models](https://www.together.ai/models)

* Accelerated Compute

  • ![Image 52 GPU Clusters Reliable GPU clusters at scale](https://www.together.ai/gpu-clusters)
  • ![Image 53 AI Factory Custom infrastructure at frontier scale](https://www.together.ai/ai-factory)

Developer Environments

  • ![Image 54 Sandbox Build development environments for AI](https://www.together.ai/sandbox)

Storage

  • ![Image 55 Managed Storage Store model weights & data securely](https://www.together.ai/managed-storage)

*

  • ![Image 56 Fine-Tuning Shape models with your data](https://www.together.ai/fine-tuning)
  • ![Image 57 Evaluations Measure model quality](https://www.together.ai/evaluations)

![Image 58 DeepSeek V3.1 Image 59 GLM 5 FP4 Image 60 Qwen3-VL 32B Image 61 gpt-oss-120b Image 62 kimi k2.5 Image 63 Llama 4 Maverick Model library Fine-tune top open-source models](https://www.together.ai/models)

*

  • ![Image 64 Research Systems research for production AI](https://www.together.ai/research)
  • ![Image 65 Research blog All our research publications](https://www.together.ai/research-blog)

Featured publications

Show all

*

  • ![Image 66 Documentation Technical docs for Together AI](https://docs.together.ai/)
  • ![Image 67 Demos Our open-source demo apps](https://www.together.ai/demos)
  • ![Image 68 Cookbooks Practical implementation guides](https://www.together.ai/cookbooks)
  • ![Image 69 Voice Agents Build voice agents for production](https://www.together.ai/solutions/voice)

* Resources

  • ![Image 70 Customer stories Testimonials from AI Natives](https://www.together.ai/customers)
  • ![Image 71 Startup accelerator Build and scale your startup](https://www.together.ai/startup-accelerator)
  • ![Image 72 Customer support Find answers to your questions](https://www.together.ai/support)
  • ![Image 73 Blog Our latest news & blog posts](https://www.together.ai/blog)
  • ![Image 74 Events Explore our events calendar](https://www.together.ai/events)

Company

  • ![Image 75 About Get to know us](https://www.together.ai/about-us)
  • ![Image 76 Careers Join our mission](https://www.together.ai/careers)

Contact sales

Contact sales

Sign in

All blog posts

Model Library

Published 4/29/2026

DeepSeek-V4 Pro now available on Together AI

1.6T-parameter MoE reasoning model with 512K context on Together AI, controllable reasoning modes, and cached-input pricing for long-context workloads.

  • Authors Sonny Khan
  • Table of contents

Quickstart Guide

What's New

  • **DeepSeek V4 Pro on Together AI:** DeepSeek V4 Pro is now available on Together AI with a 512K-token context window for long-context reasoning workloads.
  • Large-scale MoE architecture: DeepSeek V4 Pro uses a 1.6T-parameter Mixture-of-Experts architecture with 49B activated parameters.
  • Controllable reasoning modes: Non-Think, Think High, and Think Max let teams choose between fast responses, deeper reasoning, and maximum reasoning effort.
  • Transparent serverless pricing: DeepSeek V4 Pro is available at $2.10 per 1M input tokens, $0.20 per 1M cached input tokens, and $4.40 per 1M output tokens.

Long-context reasoning changes what teams can ask a model to do. Entire repositories, large document sets, long agent traces, and tool outputs can fit into the model’s working context instead of being compressed into brittle summaries. But the models that can use that much context are also the hardest to serve: a 1.6T-parameter MoE with million-token context is not something most teams want to deploy, tune, and operate themselves.

DeepSeek-V4 Pro is now available on Together AI, the AI Native Cloud, so teams can start with Serverless Inference at 512K context and move to dedicated infrastucture for full 1M context, reserved capacity, and production control. DeepSeek-V4 Flash is coming soon, giving teams another V4 option for workloads where speed and cost matter more than maximum reasoning depth.

**At a glance**

| Spec | Value | | --- | --- | | Model | DeepSeek V4 Pro on Together AI | | Endpoint | deepseek-ai/DeepSeek-V4-Pro | | Architecture | 1.6T-parameter MoE | | Activated parameters | 49B | | Context on Together AI | 512K tokens | | Model-level context | 1M tokens | | Reasoning modes | Non-Think, Think High, Think Max | | Deployment | Serverless, Monthly Reserved | | Input price | $2.10 / 1M tokens | | Cached input price | $0.20 / 1M tokens | | Output price | $4.40 / 1M tokens | | Best-fit workloads | Code agents, document intelligence, long-context agents, research synthesis |

**Built for long-context reasoning**

DeepSeek V4 Pro is built for workloads where the model needs to reason over more than a short prompt: large repositories, long technical documents, dense retrieval bundles, tool-call histories, and research corpora.

DeepSeek V4 Pro supports million-token context at the model level; on Together AI, it is currently available with a 512K-token context window. That distinction matters because model capability and deployed serving profile are not always the same thing. Together AI is launching DeepSeek V4 Pro with a context window designed for reliable production serving, while still giving teams enough room for serious long-context workloads.

The architecture also matters because long context is not only a product spec. As context grows, serving cost, memory pressure, KV cache usage, latency, and concurrency all become part of the system design. DeepSeek V4 Pro uses hybrid attention, combining Compressed Sparse Attention and Heavily Compressed Attention, with DeepSeek reporting 27% of single-token inference FLOPs and 10% of KV cache compared to DeepSeek V3.2 at million-token context.

**Choose reasoning effort by workload**

DeepSeek V4 Pro supports three reasoning modes, so teams can match reasoning depth to task difficulty instead of treating every request the same.

| Mode | Use when | Tradeoff | | --- | --- | --- | | Non-Think | Extraction, classification, simple Q&A, routine responses | Fastest path for lower-complexity tasks | | Think High | Code planning, document analysis, multi-step reasoning | More reasoning depth for complex work | | Think Max | Hard debugging, deep research synthesis, agentic decision points | Maximum reasoning effort; expect higher latency and token usage |

A document assistant might use Non-Think for simple extraction, Think High for conflict analysis across policies, and Think Max only when the model needs to reason through a difficult decision. A code agent might use Think High for planning a migration and Think Max for debugging a subtle cross-service failure.

DeepSeek reports benchmark results across coding, reasoning, long-context, and agentic tasks, including 93.5% LiveCodeBench, 90.1% GPQA Diamond, 80.6% SWE-bench Verified, 83.5% MRCR 1M, and 62.0% CorpusQA 1M.

**Make repeated long-context queries cheaper with cached input pricing**

Long-context systems often reuse the same large context across multiple questions: a repository snapshot, a document bundle, a policy archive, a retrieval payload, or a long agent trace. Cached input pricing makes those repeated workloads more practical.

DeepSeek V4 Pro is priced at $2.10 / 1M input tokens, with cached input at $0.20 / 1M tokens and output at $4.40 / 1M tokens. That represents a 90% cost reduction for reused context, which matters when the expensive part of the request is a stable block of text that gets reused across follow-up analysis.

Example pattern:

  1. Load a large stable context, such as a 300K-token repo summary, contract set, or policy archive.
  2. Ask several follow-up questions over that same context.
  3. Use cached input pricing where applicable to drastically reduce the cost of repeated analysis.

**Workload patterns**

Code agents

Use DeepSeek V4 Pro when an agent needs to reason across repository slices, issue traces, internal documentation, prior tool calls, and proposed patches. Think High or Think Max is most useful for planning changes, debugging failures, or resolving cross-file dependencies.

Document intelligence

Use long context for contracts, policy sets, technical manuals, or research collections that need to be compared in one request. Non-Think can handle extraction and simple Q&A; Think High is better for conflict analysis, interpretation, and synthesis.

Long-context agent traces

Use DeepSeek V4 Pro to inspect long tool-call histories, intermediate results, and execution traces. Higher reasoning modes are most useful at decision points: when the agent needs to decide whether to continue, call another tool, revise a plan, or stop.

Research synthesis

Use DeepSeek V4 Pro for workflows that combine papers, notes, benchmark reports, retrieved documents, and prior analysis. Cached input pricing is especially useful when the same evidence set is reused across multiple questions.

**Start serverless, move to reserved capacity**

DeepSeek V4 Pro is available on Together AI Serverless Inference and Monthly Reserved infrastructure. Serverless is the right starting point for evaluation, development, and variable traffic. Monthly Reserved is better for steadier production demand where teams need more predictable capacity and cost control.

For long-context workloads, the deployment path matters. Teams are not only choosing a model; they are choosing how to manage throughput, concurrency, latency, KV cache pressure, and cost as context sizes grow. Together AI gives teams a path from evaluation to production without standing up the serving stack themselves.

Try it now

DeepSeek-V4 Pro is available today on Together AI Serverless Inference and Dedicated Endpoints.

python
from together import Together

client = Together()

stream = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V4-Pro",
    messages=[
        {
            "role": "user",
            "content": "Prove that the square root of 2 is irrational.",
        }
    ],
    stream=True,
)

for chunk in stream:
    if not chunk.choices:
        continue
    delta = chunk.choices[0].delta

    if hasattr(delta, "reasoning") and delta.reasoning:
        print(delta.reasoning, end="", flush=True)
    if hasattr(delta, "content") and delta.content:
        print(delta.content, end="", flush=True)

Start with Serverless Inference for development and evaluation. For production workloads that require full 1M context, reserved capacity, workload isolation, or more predictable throughput, contact sales to deploy DeepSeek-V4 Pro on Together AI Dedicated Inference.

Get started

→ Follow our DeepSeek-V4 quickstart to get up and running in minutes

→ View the DeepSeek-V4 Pro Model Page

→ Try DeepSeek-V4 Pro in the Playground

Contact Sales for Dedicated Inference deployment and volume pricing

Start building on Together AI

From optimized training and model shaping to large-scale production inference

Get Started now

Image 77

* Products

  • Models

See all modelsDeepSeek Meta Qwen Google OpenAI Mistral AI Custom models * Developers

Pricing

* Resources

© 2026 Together AI. All Rights Reserved.

  • [](https://discord.gg/9Rk6sSeWEG)
  • [](https://x.com/togethercompute)
  • [](https://www.linkedin.com/company/togethercomputer/)

Image 79Image 80

AI 可能会生成不准确的信息,请核实重要内容