Semantic Content Extraction, Storage and Querying of Visual, Audio, and Text Data in Videos (METU-MMDS)


We download news videos (total 18000 seconds) from NTV news archives and categorize them into the accident, military, natural disaster, sport, and politics categories. We also create a concept list that is a subset of LSCOM concepts (Following table). Then the shot boundaries, keyframes, visual objects/concepts, audio concepts, and subtitle texts are annotated manually for all of the video clips. Briefly saying, each video is split into some shots so that each shot represents almost the same scene. Then some Key-frames are extracted for each shot. For each Key-frame, concepts related to the scene are annotated using the previously created concepts list. Furthermore, for each audio segment, a proper audio concept is assigned. In the end, the subtitles of shots are manually extracted and converted to the named entities. In addition, some words are selected from subtitles as important words. Also, by integrating various modules and analyzing these shots automatically, we create a dataset. This dataset contains the shots that concepts’ scores are computed automatically for each modal.

Visual Concepts
Basketball_Ball Football_Ball Airplane Bus Fire Gun Motorcycle Tennin_Net
Basketball_Field Football_Field Ambulance Camera Fire_Truck Helicopter Mountain Tennis_Ball
Basketball_Hoop Football_Player Bicycle Car Flag Ice_pist Person Tennis_Court
Basketball_Player Football_Refree Bridge Cloud Goalpost Ice_Skater Person_Front Tennis_Player
Basketball_Refree Armed_Person Building Desert Greenery Missile Person_Side Tennis_Racket
Race_Car Radar Road Sky Smoke Snow Tank Water
Audio Concepts
Emergency_Alarm Car_Horn Gun Bomb Automobile Motorcycle Helicopter Wind
Water Rain Applause Crowd Laughter Outdoor Nature Meeting
Text Concepts
Brazil UN Injury Voting Operation Impossible Traffic Basketball
USA Kaddafi Accident Erdogan Cease fire Casualty Disaster Football
11 September Bahrain Car International Conflict Terror Homeless Victory
Alcohol Japan Contest Intervene War Suicide Minister Derbi
United Nations TSK (Turkish Armed Forces) Rocket Agreement Target Volcano Goal Valencia
Libya CHP(Party name) Enemy Confirmation Destroy Flood Tennis Arsenal
China AKP (Party name) Violence Flying Fire Earthquake Tournament  Real Madrid
Germany Besiktas (Football team) Vehicle Forbidden Missile Person NBA Formula 1
Iran MHP (Party name) Death Aid Bomb Politic Smash Power
Italy Hidayet (Player name) Bus Precaution Defense Police Match Fly
Russia BDP (Party name) Attack Country Headquarter Selection Star America
France Fenerbahce (Football team ) Army Civilian Champion Region League Barcelona
England Galatasaray (Football team) Final Parliament

Project scopes

In this project, by using visual, audio, and text data of videos (multi-modal), the semantic contents are extracted automatically, stored in an appropriate format, and then a prototype system is developed that can answers the queries efficiently. A new video that is uploaded to the developed system primarily is pre-processed to obtain the corresponding visual, audio, and text data. In order to extract the semantic content of the visual, audio, and text data, three separate modules are developed for each modal.  Then, the information obtained from these three modules is analyzed and integrated. Afterword, the incomplete data are concluded and the duplicate data are cleaned. These steps prepare the data to be stored in the database. Finally, the fusion process is applied to this data. The fused data obtained from the video are stored in the Intelligent Fuzzy Object-Oriented Database System which is previously developed by the researchers in a TUBITAK 1001 project. The intelligent Fuzzy Object-Oriented Database System mainly is consisted of a fuzzy knowledge base and a fuzzy object-oriented database. In the domain of this project, large multimedia data are stored in the object-oriented database. Furthermore, by employing some domain-specific rules in the knowledge base and using the data which is stored in the database, new semantic information is extracted. Additionally, in order to answers the queries regarding both the semantic content and the low-level features, an index structure is developed. In the proposed system, fuzzy and uncertain data also can be processed.

The main contribution of this project is fussing the different modals (visual, audio, and text) which are obtained from a video and thereby, creating a more complete semantic data structure that can be stored in a database and queried effectively.

In addition, it is evaluated that the obtained results of the project fill a big gap in the academic literature. During the project, 7 journal papers and 21 conference papers (19 international, 2 national), which make 28 in total, are published. An opportunity is provided for 4 Ph.D. and 6 Ms students, who took responsibility during different terms of the project, to work on and accomplish their thesis.

This project is supported under the SCIENTIFIC AND TECHNOLOGICAL RESEARCH PROJECTS SUPPORT PROGRAM by TUBITAK with the grant number 109E014.

The above demo video shows how to use METU-MMDS for (i) extracting semantic content from videos, and (ii) querying multimedia data using various types of queries.