gavin/transformers

Fork 0

Files

陈赣 06f1fd69a6

Self-hosted runner (nightly-past-ci-caller) / Get number (push) Has been cancelled

Details

Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.11 (push) Has been cancelled

Details

Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.10 (push) Has been cancelled

Details

Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.9 (push) Has been cancelled

Details

Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.8 (push) Has been cancelled

Details

Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.7 (push) Has been cancelled

Details

Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.6 (push) Has been cancelled

Details

Self-hosted runner (nightly-past-ci-caller) / TensorFlow 2.5 (push) Has been cancelled

Details

Self-hosted runner (benchmark) / Benchmark (aws-g5-4xlarge-cache) (push) Has been cancelled

Details

Build documentation / build (push) Has been cancelled

Details

Build documentation / build_other_lang (push) Has been cancelled

Details

CodeQL Security Analysis / CodeQL Analysis (push) Has been cancelled

Details

New model PR merged notification / Notify new model (push) Has been cancelled

Details

PR CI / pr-ci (push) Has been cancelled

Details

Slow tests on important models (on Push - A10) / Get all modified files (push) Has been cancelled

Details

Secret Leaks / trufflehog (push) Has been cancelled

Details

Update Transformers metadata / build_and_package (push) Has been cancelled

Details

Slow tests on important models (on Push - A10) / Model CI (push) Has been cancelled

Details

Check Tiny Models / Check tiny models (push) Has been cancelled

Details

Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Model CI (push) Has been cancelled

Details

Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Pipeline CI (push) Has been cancelled

Details

Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Example CI (push) Has been cancelled

Details

Self-hosted runner (Intel Gaudi3 scheduled CI caller) / DeepSpeed CI (push) Has been cancelled

Details

Self-hosted runner (Intel Gaudi3 scheduled CI caller) / Trainer/FSDP CI (push) Has been cancelled

Details

Nvidia CI - Flash Attn / Setup (push) Has been cancelled

Details

Nvidia CI - Flash Attn / Model CI (push) Has been cancelled

Details

Nvidia CI / Setup (push) Has been cancelled

Details

Nvidia CI / Model CI (push) Has been cancelled

Details

Nvidia CI / Torch pipeline CI (push) Has been cancelled

Details

Nvidia CI / Example CI (push) Has been cancelled

Details

Nvidia CI / Trainer/FSDP CI (push) Has been cancelled

Details

Nvidia CI / DeepSpeed CI (push) Has been cancelled

Details

Nvidia CI / Quantization CI (push) Has been cancelled

Details

Nvidia CI / Kernels CI (push) Has been cancelled

Details

Doctests / Setup (push) Has been cancelled

Details

Doctests / Call doctest jobs (push) Has been cancelled

Details

Doctests / Send results to webhook (push) Has been cancelled

Details

Extras Smoke Test / Get supported Python versions (push) Has been cancelled

Details

Extras Smoke Test / Test extras on Python ${{ matrix.python-version }} (push) Has been cancelled

Details

Extras Smoke Test / Check Slack token availability (push) Has been cancelled

Details

Extras Smoke Test / Notify failures to Slack (push) Has been cancelled

Details

Self-hosted runner (AMD scheduled CI caller) / Trigger Scheduled AMD CI (push) Has been cancelled

Details

Stale Bot / Close Stale Issues (push) Has been cancelled

Details

first commit

2026-06-05 16:53:03 +08:00

7.2 KiB

Raw Permalink Blame History

Monocular depth estimation

Monocular depth estimation is a computer vision task that involves predicting the depth information of a scene from a single image. In other words, it is the process of estimating the distance of objects in a scene from a single camera viewpoint.

Monocular depth estimation has various applications, including 3D reconstruction, augmented reality, autonomous driving, and robotics. It is a challenging task as it requires the model to understand the complex relationships between objects in the scene and the corresponding depth information, which can be affected by factors such as lighting conditions, occlusion, and texture.

There are two main depth estimation categories:

Absolute depth estimation: This task variant aims to provide exact depth measurements from the camera. The term is used interchangeably with metric depth estimation, where depth is provided in precise measurements in meters or feet. Absolute depth estimation models output depth maps with numerical values that represent real-world distances.
Relative depth estimation: Relative depth estimation aims to predict the depth order of objects or points in a scene without providing the precise measurements. These models output a depth map that indicates which parts of the scene are closer or farther relative to each other without the actual distances to A and B.

In this guide, we will see how to infer with Depth Anything V2, a state-of-the-art zero-shot relative depth estimation model, and ZoeDepth, an absolute depth estimation model.

Check the Depth Estimation task page to view all compatible architectures and checkpoints.

Before we begin, we need to install the latest version of Transformers:

pip install -q -U transformers

Depth estimation pipeline

The simplest way to try out inference with a model supporting depth estimation is to use the corresponding [pipeline]. Instantiate a pipeline from a checkpoint on the Hugging Face Hub:

>>> from transformers import pipeline
from accelerate import Accelerator
>>> import torch
>>> device = Accelerator().device
>>> checkpoint = "depth-anything/Depth-Anything-V2-base-hf"
>>> pipe = pipeline("depth-estimation", model=checkpoint, device=device)

Next, choose an image to analyze:

>>> from PIL import Image
>>> import requests

>>> url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> image

Pass the image to the pipeline.

>>> predictions = pipe(image)

The pipeline returns a dictionary with two entries. The first one, called predicted_depth, is a tensor with the values being the depth expressed in meters for each pixel. The second one, depth, is a PIL image that visualizes the depth estimation result.

Let's take a look at the visualized result:

>>> predictions["depth"]

Depth estimation inference by hand

Now that you've seen how to use the depth estimation pipeline, let's see how we can replicate the same result by hand.

Start by loading the model and associated processor from a checkpoint on the Hugging Face Hub. Here we'll use the same checkpoint as before:

>>> from transformers import AutoImageProcessor, AutoModelForDepthEstimation

>>> checkpoint = "Intel/zoedepth-nyu-kitti"

>>> image_processor = AutoImageProcessor.from_pretrained(checkpoint)
>>> model = AutoModelForDepthEstimation.from_pretrained(checkpoint).to(device)

Prepare the image input for the model using the image_processor that will take care of the necessary image transformations such as resizing and normalization:

>>> pixel_values = image_processor(image, return_tensors="pt").pixel_values.to(device)

Pass the prepared inputs through the model:

>>> import torch

>>> with torch.no_grad():
...     outputs = model(pixel_values)

Let's post-process the results to remove any padding and resize the depth map to match the original image size. The post_process_depth_estimation outputs a list of dicts containing the "predicted_depth".

>>> # ZoeDepth dynamically pads the input image. Thus we pass the original image size as argument
>>> # to `post_process_depth_estimation` to remove the padding and resize to original dimensions.
>>> post_processed_output = image_processor.post_process_depth_estimation(
...     outputs,
...     source_sizes=[(image.height, image.width)],
... )

>>> predicted_depth = post_processed_output[0]["predicted_depth"]
>>> depth = (predicted_depth - predicted_depth.min()) / (predicted_depth.max() - predicted_depth.min())
>>> depth = depth.detach().cpu().numpy() * 255
>>> depth = Image.fromarray(depth.astype("uint8"))

In the original implementation ZoeDepth model performs inference on both the original and flipped images and averages out the results. The post_process_depth_estimation function can handle this for us by passing the flipped outputs to the optional outputs_flipped argument:

>>> with torch.no_grad():
...     outputs = model(pixel_values)
...     outputs_flipped = model(pixel_values=torch.flip(inputs.pixel_values, dims=[3]))
>>> post_processed_output = image_processor.post_process_depth_estimation(
...     outputs,
...     source_sizes=[(image.height, image.width)],
...     outputs_flipped=outputs_flipped,
... )

7.2 KiB Raw Permalink Blame History

Monocular depth estimation

Depth estimation pipeline

Depth estimation inference by hand

7.2 KiB

Raw Permalink Blame History