Implement illegal dumping detection using AI

Finally, I got time to write blog posts again. This post is the first part of a series: Implement illegal dumping detection using AI.

Before moving to the details, there are some points I want to share:

  • This series is not focusing on developing an AI model. It only focuses on leveraging the off-the-shell AI model and building an application that can communicate with the AI model.

  • Illegal dumping detection is just an idea I had when building this application; you can completely adjust the code for other purposes.

  • The application is designed to run on edge devices such as Rasperry Pi, Nvidia Jetson with a CCTV camera or even a webcam.

In this opening post, instead of covering technical details, I will start with the idea, then workflow analyzing and technologies that can be used.

Idea

In 2024, we all can see the rise of AI technologies, many AI models that have been developed, released, out-performance, etc. Many AI companies are providing tools that improve our lives. I had the idea of leveraging AI for automatic detection when I was playing with my son on the weekend and saw a mattress dumped on the road. The city council had to clean later. That was the time, I think, haha; I should build a system to detect this dumping, inform the house owner, and capture the license plate if possible. These license plates can be used as evidence when the house owner reports to the city council. Expanding this idea, we can have many other applications with automatic detection, which I will share later at the end of this series.

Concepts

Large Language Models (LLMs) are deep learning models that are trained using massive datasets; this enables them to recognize, translate, predict, or generate text. A transformer model is the most common architecture of an LLM, and it consists of an encoder and decoder [1]. The encoder and decoder extract meanings from a sequence of text and understand the relationships between words and phrases [2].

What are modalities? A modality refers to how something happens or experienced [3], and by incorporating additional modalities to LLMs, we will create LMMs (Large Multimodal Models). Multimodal can mean one or more of the following [4]:

  • Input and output are of different modalities (e.g. text-to-image, image-to-text)
  • Inputs are multimodal (e.g. a system that can process both text and images)
  • Outputs are multimodal (e.g. a system that can generate both text and images)

Now, there are many problems that can be solved by utilizing multimodal models. Illegal dumping detection is just an example I want to use in this series. With the input being multimodal (text and video), we can have an application that detects when illegal dumping is happening and processes further actions.

Workflow analyze

From the idea, we can break it to four parts:

  • Capture videos from RTSP/webcam stream

  • Sending the captured videos to the API

  • Vision Language Model (VLM) detects suspicious activities through captured videos

  • Trigger notifications for positive detected.

Workflow

Technologies

In the first opening post of this series, I will not go deep dive into technologies that the project will use. However, based on the workflow analysis above, I am going to use

  • A lightweight model to capture videos from an RTSP/webcam stream.

  • An API application that accept the video input files.

  • A VLM model for video analyzing.

  • A simple notification system, in this case, email.

Thanks for reading! In the upcoming post, I will discuss the first part, using a lightweight model to capture videos from an RTSP/webcam stream.

References