Unveiling SAM 2: Meta’s New Open-Source Foundation Model for Real-Time Object Segmentation in Videos and Images

August 1, 2024

2 5 minutes read

In recent years, the world of AI has made remarkable progress in basic AI for word processing, with developments that have transformed industries from customer service to legal analytics. But when it comes to image processing, we’ve only scratched the surface. The complexity of visual data and the challenges of training models to accurately interpret and analyze images have presented significant obstacles. As researchers continue to explore basic AI for images and videos, the future of image processing in AI holds the potential for innovations in healthcare, autonomous vehicles and beyond.

Object segmentation, which pinpoints the exact pixels in an image that correspond to an object of interest, is a crucial task in computer vision. Traditionally, this involved creating specialized AI models, which require extensive infrastructure and large amounts of annotated data. Last year Meta introduced the Segment Anything Model (SAM)a AI Foundation model that simplifies this process by allowing users to segment images with a simple prompt. This innovation reduced the need for specialized expertise and extensive computing resources, making image segmentation more accessible.

Now Meta goes one step further SAM 2. This new iteration not only improves on SAM’s existing image segmentation capabilities, but also extends it further to video processing. SAM 2 can segment any object in both images and videos, even objects it has not encountered before. This advancement is a leap forward in the field of computer vision and image processing, providing a more versatile and powerful tool for analyzing visual content. In this article, we dive into the exciting developments of SAM 2 and explore its potential to redefine the field of computer vision.

Introducing the Segment Anything Model (SAM)

Traditional segmentation methods require manual refinement, known as interactive segmentation, or extensive annotated data for automatic segmentation into predefined categories. SAM is a basic AI model that supports interactive segmentation using versatile cues such as clicks, boxes, or text input. It can also be refined with minimal data and computational resources for automatic segmentation. SAM is trained on more than 1 billion different image annotations and can process new objects and images without the need for custom data collection or refinement.

SAM works with two main components: an image encoder that processes the image and a prompt encoder that handles input such as clicks or text. These components come together with a lightweight decoder to predict segmentation masks. Once the image is processed, SAM can create a segment in a web browser in just 50 milliseconds, making it a powerful tool for real-time, interactive tasks. To build SAM, researchers developed a three-step data collection process: model-assisted annotation, a combination of automatic and assisted annotation, and fully automatic mask creation. This process resulted in the SA-1B dataset, which contains more than 1.1 billion masks on 11 million licensed, privacy-protecting images, making it 400 times larger than any existing dataset. SAM’s impressive performance stems from this extensive and diverse dataset, ensuring better representation across different geographic regions compared to previous datasets.

Unveiling SAM 2: A leap from image to video segmentation

Building on the foundation of SAM, SAM 2 is designed for real-time, promptable object segmentation in both images and videos. Unlike SAM, which focuses exclusively on static images, SAM 2 processes videos by treating each frame as part of a continuous sequence. This allows SAM 2 to deal more effectively with dynamic scenes and changing content. For image segmentation, SAM 2 not only improves the capabilities of SAM, but also runs three times faster on interactive tasks.

SAM 2 retains the same architecture as SAM, but introduces a memory mechanism for video processing. This feature allows SAM 2 to maintain information from previous frames, ensuring consistent object segmentation despite changes in motion, lighting, or occlusion. By referencing previous frames, SAM 2 can refine its mask predictions throughout the video.

The model is trained on newly developed datasets, SA-V dataset, with more than 600,000 masklet annotations on 51,000 videos from 47 countries. This diverse dataset includes whole objects as well as their parts, improving SAM 2’s accuracy in real-world video segmentation.

SAM 2 is available as an open source model under the Apache 2.0 license, making it accessible for a variety of purposes. Meta has also shared the dataset used for SAM 2 under a CC BY 4.0 license. Moreover, there is one web-based demo which allows users to explore the model and see how it performs.

Potential use cases

SAM 2’s capabilities in real-time, promptable object segmentation for images and videos have unlocked countless innovative applications in various areas. For example, some of these applications are as follows:

Healthcare diagnostics: SAM 2 can significantly improve real-time surgical assistance by segmenting anatomical structures and identifying abnormalities during live video feeds in the operating room. It can also improve medical imaging analysis by providing accurate segmentation of organs or tumors in medical scans.
Autonomous vehicles: SAM 2 can enhance autonomous vehicle systems by improving the accuracy of object detection through continuous segmentation and tracking of pedestrians, vehicles and road signs across video frames. The ability to handle dynamic scenes also supports adaptive navigation and collision avoidance systems by recognizing and responding to changes in the environment in real time.
Interactive media and entertainment: SAM 2 can enhance augmented reality (AR) applications by accurately segmenting objects in real time, making it easier for virtual elements to blend into the real world. It also benefits video editing by automating object segmentation in footage, simplifying processes such as background removal and object replacement.
Environmental Control: SAM 2 can assist in wildlife tracking by segmenting and monitoring animals in video footage, supporting species research and habitat studies. In disaster response, it can evaluate damage and guide response efforts by accurately segmenting affected areas and objects in video feeds.
Retail and e-commerce: SAM 2 can improve product visualization in e-commerce by enabling interactive segmentation of products in images and videos. This allows customers to view items from different angles and contexts. For inventory management, it helps retailers track and segment products on shelves in real time, streamlining inventory and improving overall inventory management.

Overcoming the limitations of SAM 2: practical solutions and future improvements

Although SAM 2 performs well with images and short videos, it has some limitations that should be taken into account for practical use. It may have difficulty tracking objects through significant viewpoint changes, long occlusions, or in busy scenes, especially in extended videos. Manual correction with interactive clicks can help solve these problems.

In crowded environments with similar-looking objects, SAM 2 can occasionally misidentify targets, but additional cues in later frames can resolve this. Although SAM 2 can segment multiple objects, its efficiency decreases because each object is processed separately. Future updates may benefit from integrating shared contextual information to improve performance.

SAM 2 can also miss fine details on fast-moving objects, and predictions can be unstable across frames. However, further training could overcome this limitation. Although automatic annotation generation has improved, human annotators are still needed for quality checks and frame selection, and further automation could increase efficiency.

It comes down to

SAM 2 represents a significant leap forward in real-time object segmentation for both images and videos, building on the foundation laid by its predecessor. By improving capabilities and extending functionality to dynamic video content, SAM 2 promises to transform a variety of fields, from healthcare and autonomous vehicles to interactive media and retail. While challenges remain, especially when dealing with complex and busy scenes, the open-source nature of SAM 2 encourages continuous improvement and adaptation. With its powerful performance and accessibility, SAM 2 is poised to drive innovation and expand capabilities in computer vision and beyond.