May 12, 2025

Pretrained Models in ArcGIS: Comparing Task-Specific and Generalized Vision-Language models

By Vinay Viswambharan and Rohit Singh

ArcGIS provides nearly 100 pre-trained AI models, making geospatial AI more accessible than ever before. These models are part of Esri’s ongoing effort to democratize AI, helping users—regardless of their technical background—extract high-quality insights from imagery with minimal investment in compute or skillset.

For some users, these models are a starting point into the world of AI. For others, they are a complete solution, solving domain-specific problems right out of the box or serving as a foundation for fine-tuning to specific geographies, datasets or features of interest.

ArcGIS is democratizing access to AI by providing pre-trained models that deliver authoritative geospatial insights—making AI approachable, scalable, and impactful for a wide range of users in the geospatial community.

Two Types of Pretrained Models

Task-Specific Pretrained Models are built to do one job extremely well—like detecting building footprints, identifying swimming pools, or classifying land cover. They are highly optimized for single tasks and perform well in specific domains and imagery types.

Examples include models for building footprint extraction, land cover classification, tree segmentation, and car detection.

Generalized Vision-Language Pretrained Models, on the other hand, are like Swiss Army knives. They can perform many visual tasks using natural language prompts—such as “find red cars,” “segment all lakes,” or “classify these images based on candidate labels”—all with the same model.

Examples include Text SAM, Prompt-Based Segmentation, Vision-Language Context-Based Classification, and Image Interrogation.

Key Differences Between the Two Model Types

The models can be differentiated based on 7 factors:

Applications – Refers to the types of real-world use cases the model is designed to support- such as extracting features, classifying objects, or answering questions about imagery. This defines whether the model serves a narrow, fixed purpose or can adapt to diverse user needs.
Accuracy – Measures how precisely and reliably a model performs its intended task. This includes how well it detects, classifies, or segments features, and how often it makes correct predictions based on ground truth.
Supported Data Types – Describes the kinds of imagery or input formats the model can work with—such as satellite imagery, drone data, street-level photos, or point cloud datasets. It also reflects the model’s compatibility with different resolutions and spectral bands.
Tasks – Defines the range and nature of operations the model can perform—like object detection, pixel classification, image captioning, or point cloud classification. It helps distinguish between models that do one task well and those that can handle many.
Model Size – Refers to the computational footprint of the model, including the number of parameters, storage requirements, and memory usage. Larger models often support more complex tasks but require more resources to run.
Fine-tuning – Indicates how easily the model can be adapted to specific datasets, geographies, or new classes. Some models need retraining for different use cases, while others can generalize with minimal or no customization.
Domain – Represents the scope of subject matter the model is trained on—such as remote sensing imagery or general photos or Alt-text pairs from the Internetimagery. A model’s domain influences its performance and relevance for geospatial applications.

1. Applications

Task-specific models are built for a single, well-defined task like detecting buildings or swimming pools. In contrast, generalized models can handle many tasks through flexible natural language prompts, such as detecting red cars or airplanes.

2. Accuracy

Task-specific models usually provide higher accuracy since they’re fine-tuned for a specific job. Generalized models offer broader capabilities but may compromise on precision.

3. Supported Data Types

Task-specific models in ArcGIS are versatile in their input support and support a wide variety of geospatial formats, including low-resolution satellite imagery, NAIP, 3D point clouds, SAR, hyperspectral, and video. Generalized models typically work only with RGB imagery.

4. Tasks

Task-specific models can handle a range of geospatial AI tasks: object detection, pixel classification, change detection, 3D point cloud classification, and more. Generalized models are limited to classification, segmentation, and object detection on RGB inputs.

5. Model Size

Task-specific models are smaller and more efficient, making them easier to deploy and run with lower compute requirements. Generalized models tend to be large and compute-intensive, requiring more powerful GPUs or cloud services.

6. Fine-tuning

Task-specific models can be fine-tuned using a relatively small amount of labeled data to improve accuracy. Generalized models are typically static and not fine-tunable or improved.

7. Domain

Task-specific models are purpose-built for geospatial analysis and closely aligned with real-world mapping needs. Generalized models are designed for broad, non-spatial image understanding tasks. By integrating them with ArcGIS, we can extend their capabilities to spatial domains, enabling powerful workflows that combine foundational visual understanding with location intelligence

Wrapping Up

Both types of models have their strengths. Task-specific models are ideal for users who need high precision and domain-specific capabilities. Generalized models are great for exploring flexible, prompt-driven AI tasks. Our recommendation is to use a task-specific model if one exists for your features of interest and geography. If that doesn’t exist, you can try using vision-language models and try out different prompts to see which ones give the best results. If you’re still not getting the desired results, you can look into fine-tuning the task-specific models on your specific imagery, features of interest, and geography.

At Esri, we’re committed to supporting both—so our users can get the most out of imagery AI, no matter their experience level.

Vinay Viswambharan

Principal Product manager on the Imagery team at Esri, with a zeal for remote sensing, AI and everything imagery.

Rohit Singh

Director of Esri R&D Center, New Delhi & development lead of ArcGIS AI technologies and ArcGIS API for Python. Applying deep learning to the Science of Where!

Article Discussion:

0 Comments

Oldest

Newest

Inline Feedbacks

View all comments

December 22, 2023 | Multiple Authors | GeoAI

ArcGIS AI Models – Year in Review
October 10, 2024 | Multiple Authors | Public Safety

New Pretrained Geospatial AI Models for Disaster Response
February 11, 2025 | Multiple Authors | Imagery & Remote Sensing

Learn to use AI to extract information from World Imagery

ARCGIS

CAPABILITIES

BUY ARCGIS

INDUSTRIES

Support & Services

SELF-SERVICE

CONTACT US

ESRI STORIES

About Esri

About GIS

Commitment to Innovation

ArcGIS Blog