Semantic Content Extraction, Storage and Querying of Visual, Audio, and Text Data in Videos (METU-MMDS)
Dataset
We download news videos (total 18000 seconds) from NTV news archives and categorize them into the accident, military, natural disaster, sport, and politics categories. We also create a concept list that is a subset of LSCOM concepts (Following table). Then the shot boundaries, keyframes, visual objects/concepts, audio concepts, and subtitle texts are annotated manually for all of the video clips. Briefly saying, each video is split into some shots so that each shot represents almost the same scene. Then some Key-frames are extracted for each shot. For each Key-frame, concepts related to the scene are annotated using the previously created concepts list. Furthermore, for each audio segment, a proper audio concept is assigned. In the end, the subtitles of shots are manually extracted and converted to the named entities. In addition, some words are selected from subtitles as important words. Also, by integrating various modules and analyzing these shots automatically, we create a dataset. This dataset contains the shots that concepts’ scores are computed automatically for each modal.
Visual Concepts | |||||||
Basketball_Ball | Football_Ball | Airplane | Bus | Fire | Gun | Motorcycle | Tennin_Net |
Basketball_Field | Football_Field | Ambulance | Camera | Fire_Truck | Helicopter | Mountain | Tennis_Ball |
Basketball_Hoop | Football_Player | Bicycle | Car | Flag | Ice_pist | Person | Tennis_Court |
Basketball_Player | Football_Refree | Bridge | Cloud | Goalpost | Ice_Skater | Person_Front | Tennis_Player |
Basketball_Refree | Armed_Person | Building | Desert | Greenery | Missile | Person_Side | Tennis_Racket |
Race_Car | Radar | Road | Sky | Smoke | Snow | Tank | Water |
Tree | |||||||
Audio Concepts | |||||||
Emergency_Alarm | Car_Horn | Gun | Bomb | Automobile | Motorcycle | Helicopter | Wind |
Water | Rain | Applause | Crowd | Laughter | Outdoor | Nature | Meeting |
Violence | |||||||
Text Concepts | |||||||
Brazil | UN | Injury | Voting | Operation | Impossible | Traffic | Basketball |
USA | Kaddafi | Accident | Erdogan | Cease fire | Casualty | Disaster | Football |
11 September | Bahrain | Car | International | Conflict | Terror | Homeless | Victory |
Alcohol | Japan | Contest | Intervene | War | Suicide | Minister | Derbi |
United Nations | TSK (Turkish Armed Forces) | Rocket | Agreement | Target | Volcano | Goal | Valencia |
Libya | CHP(Party name) | Enemy | Confirmation | Destroy | Flood | Tennis | Arsenal |
China | AKP (Party name) | Violence | Flying | Fire | Earthquake | Tournament | Real Madrid |
Germany | Besiktas (Football team) | Vehicle | Forbidden | Missile | Person | NBA | Formula 1 |
Iran | MHP (Party name) | Death | Aid | Bomb | Politic | Smash | Power |
Italy | Hidayet (Player name) | Bus | Precaution | Defense | Police | Match | Fly |
Russia | BDP (Party name) | Attack | Country | Headquarter | Selection | Star | America |
France | Fenerbahce (Football team ) | Army | Civilian | Champion | Region | League | Barcelona |
England | Galatasaray (Football team) | Final | Parliament |
Project scopes
In this project, by using visual, audio, and text data of videos (multi-modal), the semantic contents are extracted automatically, stored in an appropriate format, and then a prototype system is developed that can answers the queries efficiently. A new video that is uploaded to the developed system primarily is pre-processed to obtain the corresponding visual, audio, and text data. In order to extract the semantic content of the visual, audio, and text data, three separate modules are developed for each modal. Then, the information obtained from these three modules is analyzed and integrated. Afterword, the incomplete data are concluded and the duplicate data are cleaned. These steps prepare the data to be stored in the database. Finally, the fusion process is applied to this data. The fused data obtained from the video are stored in the Intelligent Fuzzy Object-Oriented Database System which is previously developed by the researchers in a TUBITAK 1001 project. The intelligent Fuzzy Object-Oriented Database System mainly is consisted of a fuzzy knowledge base and a fuzzy object-oriented database. In the domain of this project, large multimedia data are stored in the object-oriented database. Furthermore, by employing some domain-specific rules in the knowledge base and using the data which is stored in the database, new semantic information is extracted. Additionally, in order to answers the queries regarding both the semantic content and the low-level features, an index structure is developed. In the proposed system, fuzzy and uncertain data also can be processed.
The main contribution of this project is fussing the different modals (visual, audio, and text) which are obtained from a video and thereby, creating a more complete semantic data structure that can be stored in a database and queried effectively.
In addition, it is evaluated that the obtained results of the project fill a big gap in the academic literature. During the project, 7 journal papers and 21 conference papers (19 international, 2 national), which make 28 in total, are published. An opportunity is provided for 4 Ph.D. and 6 Ms students, who took responsibility during different terms of the project, to work on and accomplish their thesis.
This project is supported under the SCIENTIFIC AND TECHNOLOGICAL RESEARCH PROJECTS SUPPORT PROGRAM by TUBITAK with the grant number 109E014.
The above demo video shows how to use METU-MMDS for (i) extracting semantic content from videos, and (ii) querying multimedia data using various types of queries.