{"id":2727322,"date":"2025-03-11T05:55:55","date_gmt":"2025-03-11T12:55:55","guid":{"rendered":"https:\/\/www.esri.com\/arcgis-blog\/?post_type=blog&#038;p=2727322"},"modified":"2026-03-10T00:59:18","modified_gmt":"2026-03-10T07:59:18","slug":"use-vision-language-models-to-optimize-object-classification","status":"publish","type":"blog","link":"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/geoai\/use-vision-language-models-to-optimize-object-classification","title":{"rendered":"Use vision-language models to optimize object classification"},"author":207622,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","format":"standard","meta":{"_acf_changed":false,"_searchwp_excluded":""},"categories":[770712],"tags":[],"industry":[],"product":[36561],"class_list":["post-2727322","blog","type-blog","status-publish","format-standard","hentry","category-geoai","product-arcgis-pro"],"acf":{"authors":[{"ID":207622,"user_firstname":"Aawaj","user_lastname":"Joshi","nickname":"Aawaj Joshi","user_nicename":"ajoshi","display_name":"Aawaj Joshi","user_email":"ajoshi@esri.com","user_url":"","user_registered":"2021-03-25 20:59:02","user_description":"Aawaj is a Product Engineer on the ArcGIS Enterprise team.","user_avatar":"<img data-del=\"avatar\" src='https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2026\/03\/IMG_6218-1-1-465x465.png' class='avatar pp-user-avatar avatar-96 photo ' height='96' width='96'\/>"},{"ID":6911,"user_firstname":"Rohit","user_lastname":"Singh","nickname":"Rohit Singh","user_nicename":"rsinghesri-com","display_name":"Rohit Singh","user_email":"rsingh@esri.com","user_url":"","user_registered":"2018-03-02 00:19:00","user_description":"Rohit Singh is Director of Esri\u2019s R&amp;D Center in New Delhi, leading the design and development of Geospatial AI capabilities across the ArcGIS platform. He has played a key role in the development of ArcGIS API for Python, ArcGIS Java Engine API, and the Linux enablement of ArcGIS. An alumnus of IIT Kharagpur, Rohit holds an MS in Computer Science with specialization in AI from Georgia Tech.","user_avatar":"<img data-del=\"avatar\" src='https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/08\/RohitSingh_AISummit2025-213x200.jpeg' class='avatar pp-user-avatar avatar-96 photo ' height='96' width='96'\/>"}],"short_description":"Rohit Singh demonstrates how to use vision-language models in scenarios where understanding both image and text content is crucial.","flexible_content":[{"acf_fc_layout":"content","content":"<p>Esri&#8217;s curated library of pretrained deep learning models, accessible through ArcGIS Living Atlas of the World, has an exciting new addition\u2014the Vision Language Context-Based Classification model. What sets vision-language models apart from traditional deep learning models is their ability to not only understand and process images but also interpret and generate human-like text.<\/p>\n"},{"acf_fc_layout":"content","content":"<p>At this year&#8217;s Developer and Technology Summit plenary, Rohit Singh puts the Vision Language Context-Based Classification model to the test by using it to identify buildings damaged in the Palisades fire that recently swept through western Los Angeles County.<\/p>\n"},{"acf_fc_layout":"kaltura","video_id":"1_frvs1b8u","time":true,"start":"344","stop":""},{"acf_fc_layout":"content","content":"<p>In his ArcGIS Pro project, Rohit has an imagery layer showing the fire&#8217;s perimeter and a layer showing the footprints of 13,000 buildings within it.<\/p>\n"},{"acf_fc_layout":"image","image":{"ID":2730612,"id":2730612,"title":"Snag_2a7ff2f7","filename":"Snag_2a7ff2f7.png","filesize":1610887,"url":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/Snag_2a7ff2f7.png","link":"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/geoai\/use-vision-language-models-to-optimize-object-classification\/snag_2a7ff2f7","alt":"Rohit's ArcGIS Pro project containing the required layers.","author":"207622","description":"","caption":"Rohit's ArcGIS Pro project containing the required layers.","name":"snag_2a7ff2f7","status":"inherit","uploaded_to":2727322,"date":"2025-03-14 01:11:13","modified":"2025-03-14 01:11:22","menu_order":0,"mime_type":"image\/png","type":"image","subtype":"png","icon":"https:\/\/www.esri.com\/arcgis-blog\/wp-includes\/images\/media\/default.png","width":1920,"height":1040,"sizes":{"thumbnail":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/Snag_2a7ff2f7-213x200.png","thumbnail-width":213,"thumbnail-height":200,"medium":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/Snag_2a7ff2f7.png","medium-width":464,"medium-height":251,"medium_large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/Snag_2a7ff2f7.png","medium_large-width":768,"medium_large-height":416,"large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/Snag_2a7ff2f7.png","large-width":1920,"large-height":1040,"1536x1536":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/Snag_2a7ff2f7-1536x832.png","1536x1536-width":1536,"1536x1536-height":832,"2048x2048":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/Snag_2a7ff2f7.png","2048x2048-width":1920,"2048x2048-height":1040,"card_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/Snag_2a7ff2f7-826x447.png","card_image-width":826,"card_image-height":447,"wide_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/Snag_2a7ff2f7.png","wide_image-width":1920,"wide_image-height":1040}},"image_position":"center","orientation":"horizontal","hyperlink":""},{"acf_fc_layout":"content","content":"<p>To classify the buildings, he uses the <a href=\"https:\/\/pro.arcgis.com\/en\/pro-app\/latest\/tool-reference\/image-analyst\/classify-objects-using-deep-learning.htm\">Classify Objects Using Deep Learning tool<\/a>, which runs a deep learning model on an input raster and a feature class to assign a class or category label to each input feature.<\/p>\n<p>For the classification model, he selects the Vision Language Context-Based Classification model. The model leverages OpenAI&#8217;s GPT 4o model and takes prompts in natural language for additional context on the input imagery and the desired way of classifying objects; so, Rohit provides it with appropriate context and specifies the custom class labels\u2014damaged and undamaged\u2014he wants the model to employ to describe each identified building.<\/p>\n"},{"acf_fc_layout":"image","image":{"ID":2730622,"id":2730622,"title":"Snag_2a8141be","filename":"Snag_2a8141be.png","filesize":1871222,"url":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/Snag_2a8141be.png","link":"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/geoai\/use-vision-language-models-to-optimize-object-classification\/snag_2a8141be","alt":"Rohit uses the Classify Objects Using Deep Learning tool to classify the buildings.","author":"207622","description":"","caption":"Rohit uses the Classify Objects Using Deep Learning tool to classify the buildings.","name":"snag_2a8141be","status":"inherit","uploaded_to":2727322,"date":"2025-03-14 01:12:39","modified":"2025-03-14 01:12:54","menu_order":0,"mime_type":"image\/png","type":"image","subtype":"png","icon":"https:\/\/www.esri.com\/arcgis-blog\/wp-includes\/images\/media\/default.png","width":1920,"height":1040,"sizes":{"thumbnail":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/Snag_2a8141be-213x200.png","thumbnail-width":213,"thumbnail-height":200,"medium":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/Snag_2a8141be.png","medium-width":464,"medium-height":251,"medium_large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/Snag_2a8141be.png","medium_large-width":768,"medium_large-height":416,"large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/Snag_2a8141be.png","large-width":1920,"large-height":1040,"1536x1536":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/Snag_2a8141be-1536x832.png","1536x1536-width":1536,"1536x1536-height":832,"2048x2048":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/Snag_2a8141be.png","2048x2048-width":1920,"2048x2048-height":1040,"card_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/Snag_2a8141be-826x447.png","card_image-width":826,"card_image-height":447,"wide_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/Snag_2a8141be.png","wide_image-width":1920,"wide_image-height":1040}},"image_position":"center","orientation":"horizontal","hyperlink":""},{"acf_fc_layout":"content","content":"<p>Since the classification takes a few hours, Rohit shows the result he obtained when he ran the tool before the plenary. The result shows that the model identified approximately 7000 damaged buildings.<\/p>\n"},{"acf_fc_layout":"image","image":{"ID":2729352,"id":2729352,"title":"Snag_25f35250","filename":"Snag_25f35250.png","filesize":2193313,"url":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/Snag_25f35250.png","link":"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/geoai\/use-vision-language-models-to-optimize-object-classification\/snag_25f35250","alt":"Red features represent damaged buildings.","author":"207622","description":"","caption":"Red features represent damaged buildings.","name":"snag_25f35250","status":"inherit","uploaded_to":2727322,"date":"2025-03-13 03:59:10","modified":"2025-03-13 03:59:20","menu_order":0,"mime_type":"image\/png","type":"image","subtype":"png","icon":"https:\/\/www.esri.com\/arcgis-blog\/wp-includes\/images\/media\/default.png","width":1920,"height":1040,"sizes":{"thumbnail":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/Snag_25f35250-213x200.png","thumbnail-width":213,"thumbnail-height":200,"medium":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/Snag_25f35250.png","medium-width":464,"medium-height":251,"medium_large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/Snag_25f35250.png","medium_large-width":768,"medium_large-height":416,"large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/Snag_25f35250.png","large-width":1920,"large-height":1040,"1536x1536":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/Snag_25f35250-1536x832.png","1536x1536-width":1536,"1536x1536-height":832,"2048x2048":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/Snag_25f35250.png","2048x2048-width":1920,"2048x2048-height":1040,"card_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/Snag_25f35250-826x447.png","card_image-width":826,"card_image-height":447,"wide_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/Snag_25f35250.png","wide_image-width":1920,"wide_image-height":1040}},"image_position":"center","orientation":"horizontal","hyperlink":""},{"acf_fc_layout":"content","content":"<p>Additionally, the model explains how it determined whether a building is damaged.<\/p>\n"},{"acf_fc_layout":"image","image":{"ID":2727452,"id":2727452,"title":"The model explains how it determined whether a building is damaged.","filename":"Snag_1d838c56.png","filesize":1864942,"url":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/Snag_1d838c56.png","link":"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/geoai\/use-vision-language-models-to-optimize-object-classification\/snag_1d838c56","alt":"The model explains how it determined whether a building is damaged.","author":"207622","description":"","caption":"The model explains how it determined whether a building is damaged.","name":"snag_1d838c56","status":"inherit","uploaded_to":2727322,"date":"2025-03-11 12:40:08","modified":"2025-03-11 12:41:11","menu_order":0,"mime_type":"image\/png","type":"image","subtype":"png","icon":"https:\/\/www.esri.com\/arcgis-blog\/wp-includes\/images\/media\/default.png","width":1920,"height":1040,"sizes":{"thumbnail":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/Snag_1d838c56-213x200.png","thumbnail-width":213,"thumbnail-height":200,"medium":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/Snag_1d838c56.png","medium-width":464,"medium-height":251,"medium_large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/Snag_1d838c56.png","medium_large-width":768,"medium_large-height":416,"large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/Snag_1d838c56.png","large-width":1920,"large-height":1040,"1536x1536":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/Snag_1d838c56-1536x832.png","1536x1536-width":1536,"1536x1536-height":832,"2048x2048":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/Snag_1d838c56.png","2048x2048-width":1920,"2048x2048-height":1040,"card_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/Snag_1d838c56-826x447.png","card_image-width":826,"card_image-height":447,"wide_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/Snag_1d838c56.png","wide_image-width":1920,"wide_image-height":1040}},"image_position":"center","orientation":"horizontal","hyperlink":""},{"acf_fc_layout":"content","content":"<p>Next, he opens a 3D scene that better shows the extent of the damage.<\/p>\n"},{"acf_fc_layout":"image","image":{"ID":2727462,"id":2727462,"title":"3D scene showing the extent of the damage.","filename":"Snag_1d8542b1.png","filesize":2470704,"url":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/Snag_1d8542b1.png","link":"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/geoai\/use-vision-language-models-to-optimize-object-classification\/snag_1d8542b1","alt":"3D scene showing the extent of the damage.","author":"207622","description":"","caption":"3D scene showing the extent of the damage.","name":"snag_1d8542b1","status":"inherit","uploaded_to":2727322,"date":"2025-03-11 12:42:00","modified":"2025-03-11 12:42:39","menu_order":0,"mime_type":"image\/png","type":"image","subtype":"png","icon":"https:\/\/www.esri.com\/arcgis-blog\/wp-includes\/images\/media\/default.png","width":1920,"height":1040,"sizes":{"thumbnail":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/Snag_1d8542b1-213x200.png","thumbnail-width":213,"thumbnail-height":200,"medium":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/Snag_1d8542b1.png","medium-width":464,"medium-height":251,"medium_large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/Snag_1d8542b1.png","medium_large-width":768,"medium_large-height":416,"large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/Snag_1d8542b1.png","large-width":1920,"large-height":1040,"1536x1536":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/Snag_1d8542b1-1536x832.png","1536x1536-width":1536,"1536x1536-height":832,"2048x2048":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/Snag_1d8542b1.png","2048x2048-width":1920,"2048x2048-height":1040,"card_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/Snag_1d8542b1-826x447.png","card_image-width":826,"card_image-height":447,"wide_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/Snag_1d8542b1.png","wide_image-width":1920,"wide_image-height":1040}},"image_position":"center","orientation":"horizontal","hyperlink":""},{"acf_fc_layout":"content","content":"<p>Finally, he shows how the classification can be automated through <a href=\"https:\/\/pro.arcgis.com\/en\/pro-app\/latest\/arcpy\/get-started\/what-is-arcpy-.htm\">ArcPy<\/a> and <a href=\"https:\/\/www.esri.com\/en-us\/arcgis\/products\/arcgis-notebooks\/overview\">ArcGIS Notebooks<\/a>, mainly using natural language and very little code.<\/p>\n"},{"acf_fc_layout":"image","image":{"ID":2727472,"id":2727472,"title":"Automating the classification using ArcPy and ArcGIS Notebooks","filename":"Snag_1d86c3f2.png","filesize":944723,"url":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/Snag_1d86c3f2.png","link":"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/geoai\/use-vision-language-models-to-optimize-object-classification\/snag_1d86c3f2","alt":"Automating the classification using ArcPy and ArcGIS Notebooks","author":"207622","description":"","caption":"Automating the classification using ArcPy and ArcGIS Notebooks","name":"snag_1d86c3f2","status":"inherit","uploaded_to":2727322,"date":"2025-03-11 12:43:38","modified":"2025-03-11 12:44:30","menu_order":0,"mime_type":"image\/png","type":"image","subtype":"png","icon":"https:\/\/www.esri.com\/arcgis-blog\/wp-includes\/images\/media\/default.png","width":1855,"height":1025,"sizes":{"thumbnail":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/Snag_1d86c3f2-213x200.png","thumbnail-width":213,"thumbnail-height":200,"medium":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/Snag_1d86c3f2.png","medium-width":464,"medium-height":256,"medium_large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/Snag_1d86c3f2.png","medium_large-width":768,"medium_large-height":424,"large":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/Snag_1d86c3f2.png","large-width":1855,"large-height":1025,"1536x1536":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/Snag_1d86c3f2-1536x849.png","1536x1536-width":1536,"1536x1536-height":849,"2048x2048":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/Snag_1d86c3f2.png","2048x2048-width":1855,"2048x2048-height":1025,"card_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/Snag_1d86c3f2-826x456.png","card_image-width":826,"card_image-height":456,"wide_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/Snag_1d86c3f2.png","wide_image-width":1855,"wide_image-height":1025}},"image_position":"center","orientation":"horizontal","hyperlink":""},{"acf_fc_layout":"content","content":"<p>In a scenario where understanding both image and text prompts was crucial, Rohit employed the Vision Language Context-Based Classification model to extract insights that would have been difficult or impossible to find manually. To learn more about the model, see the <a href=\"https:\/\/doc.arcgis.com\/en\/pretrained-models\/latest\/imagery\/introduction-to-vision-language-context-based-classification.htm\">ArcGIS pretrained models documentation<\/a>.<\/p>\n"}],"related_articles":"","show_article_image":false,"card_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/Snag_1d981e82.png","wide_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/Snag_1d9460a0.png"},"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.9 (Yoast SEO v25.9) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Use vision-language models to optimize object classification<\/title>\n<meta name=\"description\" content=\"Rohit Singh demonstrates how to use vision-language models in scenarios where understanding both image and text content is crucial.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/geoai\/use-vision-language-models-to-optimize-object-classification\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Use vision-language models to optimize object classification\" \/>\n<meta property=\"og:description\" content=\"Rohit Singh demonstrates how to use vision-language models in scenarios where understanding both image and text content is crucial.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/geoai\/use-vision-language-models-to-optimize-object-classification\" \/>\n<meta property=\"og:site_name\" content=\"ArcGIS Blog\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/esrigis\/\" \/>\n<meta property=\"article:modified_time\" content=\"2026-03-10T07:59:18+00:00\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:site\" content=\"@ESRI\" \/>\n<meta name=\"twitter:label1\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data1\" content=\"4 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":[\"Article\",\"BlogPosting\"],\"@id\":\"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/geoai\/use-vision-language-models-to-optimize-object-classification#article\",\"isPartOf\":{\"@id\":\"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/geoai\/use-vision-language-models-to-optimize-object-classification\"},\"author\":{\"name\":\"Aawaj Joshi\",\"@id\":\"https:\/\/www.esri.com\/arcgis-blog\/#\/schema\/person\/1a05fe13e7f3a2c7350529406f1ad821\"},\"headline\":\"Use vision-language models to optimize object classification\",\"datePublished\":\"2025-03-11T12:55:55+00:00\",\"dateModified\":\"2026-03-10T07:59:18+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/geoai\/use-vision-language-models-to-optimize-object-classification\"},\"wordCount\":7,\"publisher\":{\"@id\":\"https:\/\/www.esri.com\/arcgis-blog\/#organization\"},\"articleSection\":[\"AI\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/geoai\/use-vision-language-models-to-optimize-object-classification\",\"url\":\"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/geoai\/use-vision-language-models-to-optimize-object-classification\",\"name\":\"Use vision-language models to optimize object classification\",\"isPartOf\":{\"@id\":\"https:\/\/www.esri.com\/arcgis-blog\/#website\"},\"datePublished\":\"2025-03-11T12:55:55+00:00\",\"dateModified\":\"2026-03-10T07:59:18+00:00\",\"description\":\"Rohit Singh demonstrates how to use vision-language models in scenarios where understanding both image and text content is crucial.\",\"breadcrumb\":{\"@id\":\"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/geoai\/use-vision-language-models-to-optimize-object-classification#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/geoai\/use-vision-language-models-to-optimize-object-classification\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/geoai\/use-vision-language-models-to-optimize-object-classification#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/www.esri.com\/arcgis-blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Use vision-language models to optimize object classification\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.esri.com\/arcgis-blog\/#website\",\"url\":\"https:\/\/www.esri.com\/arcgis-blog\/\",\"name\":\"ArcGIS Blog\",\"description\":\"Get insider info from Esri product teams\",\"publisher\":{\"@id\":\"https:\/\/www.esri.com\/arcgis-blog\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.esri.com\/arcgis-blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/www.esri.com\/arcgis-blog\/#organization\",\"name\":\"Esri\",\"url\":\"https:\/\/www.esri.com\/arcgis-blog\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.esri.com\/arcgis-blog\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2018\/04\/Esri.png\",\"contentUrl\":\"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2018\/04\/Esri.png\",\"width\":400,\"height\":400,\"caption\":\"Esri\"},\"image\":{\"@id\":\"https:\/\/www.esri.com\/arcgis-blog\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/esrigis\/\",\"https:\/\/x.com\/ESRI\",\"https:\/\/www.linkedin.com\/company\/5311\/\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/www.esri.com\/arcgis-blog\/#\/schema\/person\/1a05fe13e7f3a2c7350529406f1ad821\",\"name\":\"Aawaj Joshi\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.esri.com\/arcgis-blog\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2026\/03\/IMG_6218-1-1-465x465.png\",\"contentUrl\":\"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2026\/03\/IMG_6218-1-1-465x465.png\",\"caption\":\"Aawaj Joshi\"},\"description\":\"Aawaj is a Product Engineer on the ArcGIS Enterprise team.\",\"url\":\"https:\/\/www.esri.com\/arcgis-blog\/author\/ajoshi\"}]}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Use vision-language models to optimize object classification","description":"Rohit Singh demonstrates how to use vision-language models in scenarios where understanding both image and text content is crucial.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/geoai\/use-vision-language-models-to-optimize-object-classification","og_locale":"en_US","og_type":"article","og_title":"Use vision-language models to optimize object classification","og_description":"Rohit Singh demonstrates how to use vision-language models in scenarios where understanding both image and text content is crucial.","og_url":"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/geoai\/use-vision-language-models-to-optimize-object-classification","og_site_name":"ArcGIS Blog","article_publisher":"https:\/\/www.facebook.com\/esrigis\/","article_modified_time":"2026-03-10T07:59:18+00:00","twitter_card":"summary_large_image","twitter_site":"@ESRI","twitter_misc":{"Est. reading time":"4 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":["Article","BlogPosting"],"@id":"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/geoai\/use-vision-language-models-to-optimize-object-classification#article","isPartOf":{"@id":"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/geoai\/use-vision-language-models-to-optimize-object-classification"},"author":{"name":"Aawaj Joshi","@id":"https:\/\/www.esri.com\/arcgis-blog\/#\/schema\/person\/1a05fe13e7f3a2c7350529406f1ad821"},"headline":"Use vision-language models to optimize object classification","datePublished":"2025-03-11T12:55:55+00:00","dateModified":"2026-03-10T07:59:18+00:00","mainEntityOfPage":{"@id":"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/geoai\/use-vision-language-models-to-optimize-object-classification"},"wordCount":7,"publisher":{"@id":"https:\/\/www.esri.com\/arcgis-blog\/#organization"},"articleSection":["AI"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/geoai\/use-vision-language-models-to-optimize-object-classification","url":"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/geoai\/use-vision-language-models-to-optimize-object-classification","name":"Use vision-language models to optimize object classification","isPartOf":{"@id":"https:\/\/www.esri.com\/arcgis-blog\/#website"},"datePublished":"2025-03-11T12:55:55+00:00","dateModified":"2026-03-10T07:59:18+00:00","description":"Rohit Singh demonstrates how to use vision-language models in scenarios where understanding both image and text content is crucial.","breadcrumb":{"@id":"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/geoai\/use-vision-language-models-to-optimize-object-classification#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/geoai\/use-vision-language-models-to-optimize-object-classification"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/geoai\/use-vision-language-models-to-optimize-object-classification#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.esri.com\/arcgis-blog\/"},{"@type":"ListItem","position":2,"name":"Use vision-language models to optimize object classification"}]},{"@type":"WebSite","@id":"https:\/\/www.esri.com\/arcgis-blog\/#website","url":"https:\/\/www.esri.com\/arcgis-blog\/","name":"ArcGIS Blog","description":"Get insider info from Esri product teams","publisher":{"@id":"https:\/\/www.esri.com\/arcgis-blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.esri.com\/arcgis-blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.esri.com\/arcgis-blog\/#organization","name":"Esri","url":"https:\/\/www.esri.com\/arcgis-blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.esri.com\/arcgis-blog\/#\/schema\/logo\/image\/","url":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2018\/04\/Esri.png","contentUrl":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2018\/04\/Esri.png","width":400,"height":400,"caption":"Esri"},"image":{"@id":"https:\/\/www.esri.com\/arcgis-blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/esrigis\/","https:\/\/x.com\/ESRI","https:\/\/www.linkedin.com\/company\/5311\/"]},{"@type":"Person","@id":"https:\/\/www.esri.com\/arcgis-blog\/#\/schema\/person\/1a05fe13e7f3a2c7350529406f1ad821","name":"Aawaj Joshi","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.esri.com\/arcgis-blog\/#\/schema\/person\/image\/","url":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2026\/03\/IMG_6218-1-1-465x465.png","contentUrl":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2026\/03\/IMG_6218-1-1-465x465.png","caption":"Aawaj Joshi"},"description":"Aawaj is a Product Engineer on the ArcGIS Enterprise team.","url":"https:\/\/www.esri.com\/arcgis-blog\/author\/ajoshi"}]}},"text_date":"March 11, 2025","author_name":"Multiple Authors","author_page":"https:\/\/www.esri.com\/arcgis-blog\/products\/arcgis-pro\/geoai\/use-vision-language-models-to-optimize-object-classification","custom_image":"https:\/\/www.esri.com\/arcgis-blog\/app\/uploads\/2025\/03\/Snag_1d9460a0.png","primary_product":"ArcGIS Pro","tag_data":[],"category_data":[{"term_id":770712,"name":"AI","slug":"geoai","term_group":0,"term_taxonomy_id":770712,"taxonomy":"category","description":"","parent":0,"count":51,"filter":"raw"}],"product_data":[{"term_id":36561,"name":"ArcGIS Pro","slug":"arcgis-pro","term_group":0,"term_taxonomy_id":36561,"taxonomy":"product","description":"","parent":0,"count":2036,"filter":"raw"}],"primary_product_link":"https:\/\/www.esri.com\/arcgis-blog\/?s=#&products=arcgis-pro","_links":{"self":[{"href":"https:\/\/www.esri.com\/arcgis-blog\/wp-json\/wp\/v2\/blog\/2727322","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.esri.com\/arcgis-blog\/wp-json\/wp\/v2\/blog"}],"about":[{"href":"https:\/\/www.esri.com\/arcgis-blog\/wp-json\/wp\/v2\/types\/blog"}],"author":[{"embeddable":true,"href":"https:\/\/www.esri.com\/arcgis-blog\/wp-json\/wp\/v2\/users\/207622"}],"replies":[{"embeddable":true,"href":"https:\/\/www.esri.com\/arcgis-blog\/wp-json\/wp\/v2\/comments?post=2727322"}],"version-history":[{"count":0,"href":"https:\/\/www.esri.com\/arcgis-blog\/wp-json\/wp\/v2\/blog\/2727322\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.esri.com\/arcgis-blog\/wp-json\/wp\/v2\/media?parent=2727322"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.esri.com\/arcgis-blog\/wp-json\/wp\/v2\/categories?post=2727322"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.esri.com\/arcgis-blog\/wp-json\/wp\/v2\/tags?post=2727322"},{"taxonomy":"industry","embeddable":true,"href":"https:\/\/www.esri.com\/arcgis-blog\/wp-json\/wp\/v2\/industry?post=2727322"},{"taxonomy":"product","embeddable":true,"href":"https:\/\/www.esri.com\/arcgis-blog\/wp-json\/wp\/v2\/product?post=2727322"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}