慶應義塾大学

慶應義塾大学青木義満研究室

慶應義塾大学青木義満研究室

japanese

Research

Pioneering new domains in pattern recognition and image sensing

Deep learning is becoming indispensable in image pattern recognition. Our laboratory uses existing deep learning models, promotes research to further improve recognition accuracy and realize highly human-compatible recognition systems, such as new architecture and learning methods, and attempts to visualize and understand the internals of these . In addition, we pursue new image sensing methods, including image measurement, recognition and generation, with the aim of pioneering these domains.

Fast Soft Color Segmentation

※Accepted to CVPR2020 : Arxiv , OSS
In this study, we deal with the problem of decomposing a single image into multiple RGBA layers containing only similar colors. Our proposed neural network-based method can be decomposed 300,000 times faster than conventional optimization-based methods. The advantage of its fast decomposition realizes novel applications, such as video recoloring and compositing.

Visual Attribute Manipulation Using Natural Language Commands

In this paper, a novel setting is tackled in which a neural network generates object images with transferred attributes, by conditioning on natural language. Conventional methods for object image transformation have been known to bridge the gap between visual features by using an intermediate space of visual attributes. This paper builds on this approach and finds an algorithm to precisely extract information from natural language commands, completing this image translation model. The effectiveness of our information extraction model is experimented, with additional tests to see if the change in visual attributes is correctly seen in the image.

Graph Convolutional Neural Networks on superpixels for segmentation

A disadvantage of image domain segmentation using CNN is that spatial information is lost due to down-sampling by the pooling layer, and domain segmentation accuracy in the vicinity of object contours is reduced. Therefore, we proposed a graph convolution on superpixels as a different approach to prevent loss of information by pooling. In addition, we proposed a Dilated Graph Convolution, which extends the receptive field more effectively as an extension of the graph convolution. In a domain segmentation task using the HKU-IS data set, the proposed method outperformed a conventional CNN with the same configuration.

Super pixel pooling for segmentation

Saliency map generation in image discrimination using a CNN classifier

In general, when an image is input to a CNN and a specific output obtained, it is difficult to explain why such an output was obtained. In this study, we propose a saliency map generation method given by applying a Generative Adversarial Networks framework. In this system, learning takes place while two neural networks are made to compete. The first network learns to perform image identification. The second network learns to create an image that – if an image is input to the first network and can be successfully identified – is similar to this image but outputs an incorrect result when input into to the first network. For the second network to efficiently generate such an image, it suffices for the network to generate an image in which an image area important in the image identification of the first network is significantly changed. Such learning can be regarded as a saliency map, since it is possible to explicitly output an image area important in image identification.

Saliency map generation for image classification task by using GAN

Simultaneous execution of color adjustment and image completion by GAN

In this study, we propose a method of image completion while performing color adjustment in consideration of context in order to solve the problem of natural paste synthesis by color adjustment and image completion. In order to make the inserted object image explicitly appear in the completion area, we use CNN and Generic Adversarial Networks (GAN) for completion in consideration of context, and extract features related to the context from the entire background image. Furthermore, color adjustment taking context into consideration is carried out using the context features not only for image completion but also for color adjustment. In this way, a network is realized that simultaneously solves the problems of color adjustment and image completion.

Results of color adjustment and image completion by GAN

Image Completion of 360-Degree Images by cGAN

This work proposes the novel problem setting that by using a known area from the 360-degree image as an input, the remainder of the image can be completed with the GANs. To do so, we propose the approach of two-stage generation using network architecture with series-parallel dilated convolution layers. Moreover, we present how to rearrange images for data augmentation, simplify the problem, and make inputs for training the 2nd stage generator. Our experiments show that these methods generate the distortion seen in 360-degree images in the outlines of buildings and roads, and their boundaries are clearer than those of baseline methods. Furthermore, we discuss and clarify the difficulty of our proposed problem. Our work is the first step towards GANs predicting an unseen area within a 360-degree space.

Image control after imaging by epsilon photography reconstruction using compression sensing

Conventionally, the photographer must select many parameters on the camera at the time of photographing. Light field imaging has enabled image control after shooting with respect to focus position and shooting viewpoint, but resolution is low, specialist hardware is required, and a completely flexible restoration of the focus position and aperture size is not possible. This study relates to technology to restore images taken with various parameters from ten images shot in succession using conventional cameras with parameters such as focus position, aperture size, exposure time and ISO changed. For example, we completely reconstruct the high dynamic range focus-aperture stack using consecutively shot images with pre-set parameters as the input.

Idea of Epsilon Photography

Human movement analysis / behavior recognition technology

Our laboratory has acquired expressions for modeling human form and motion with high accuracy and efficiency, and advanced research on human recognition by machine learning. We are promoting research on robust human detection and tracking from images, posture estimation, motion analysis / prediction technology, and the eclectic application of these.

Daily activity recognition using human-object probability maps

Daily activities that occurs for several minutes to hours often includes several primitive actions which hinders action recognition from video inputs. In this research we observe “what”, “where”, “when”, and “how” humans performed daily activities and used these features to better recognize these daily activities. We introduced Human-Object Maps (HOMs) which represents probability maps of where humans and objects were and used these features for activity recognition. We evaluated the effectiveness of these maps using a dataset we have created which consists of several daily activities performed throughout the lab.

Fine-grained action segmentation for industrial production line

In this research, we aim to tackle the fine-grained action recognition task using working videos taken in product line scenes. Unlike general datasets for action recognition, our working dataset consists of fine-grained actions, which worker only uses hands and arms. This causes the difficulty to detect frame wise action using RGB image as the input for each frame. To tackle this problem, we focused on combining pose features and hand features. Pose features are for capturing the movements of arms, and hand features are for capturing the movements of hands and also gaining the information of which tool the worker is using. By using our original dataset, we showed that are model are able to secure high recognition rate, and also robust to the variety of environments and workers

Human re-identification by distance learning using CNN

We proposed a new method to re-identify people by learning the similarity of people in moving images by a convolution neural network. Feature-extraction is carried out on each moving human image by a convolution neural network, and the distance between embedding vectors is mapped to Euclidean space so as to directly correspond to the distance indicator between people. Update of parameters is carried out once having taken into account all triplet groups that can be taken within the mini-batch by an improved parameter learning method called Entire Triplet Loss . The generalization performance of the network was greatly improved by such a simple change of the parameter updating method, and the embedding vector was more easily separated for each person. In evaluation experiments, the method achieved the most advanced re-identification rate for an international data set.

Architecture of proposed CNN

On-line multiple object tracking using re-identification of tracking trajectory

Many existing methods of tracking multiple objects by on-line processing adopt a tracking-by-detection approach that chronologically assigns object rectangles obtained by performing object detection in every frame of a moving image. However, existing methods could not track objects that were not detected by the object detector due to masking or the like. Therefore, we propose a method to transfer a lost object to a tracking state once again by re-identification of the tracking trajectory. Embedding vectors expressing the high dimensional appearance features of the object are acquired using a convolution neural network and re-identification of the tracking trajectory is carried out according to the distance of the embedding vector between tracking trajectories. At this time, using the mask image of the object obtained by area division as the input of the network enables robust re-identification determination with respect to change in background. Furthermore, since the determination of the re-identification of the tracking trajectory pair is performed based on the distance between low-dimensional vectors, the increase in calculation cost due to the determination is small.

Process flow of online tracking system

Time sequence behavior recognition in a behavior transition video

In this study, we considered the issue behavior recognition conducted on moving images in which multiple actions are continuously transitioning, and proposed various time series analyses using hierarchical LSTM. In addition, we proposed the effective use of peripheral features by incremental learning of peripheral information and filtering of peripheral features by posture characteristics centered on posture information robust to environmental change. We obtained an improvement over the conventional method for behavior recognition tasks conduced on behavior transition video using a data set.

Action recognition system by using hierarchical LSTM

Calibration free gaze estimation

Many existing gaze estimation methods use special equipment such as infrared LEDs and distance sensors, or require prior calibration work. In this study, we propose a calibration-free gaze point estimation method for cameras that can be used with a wide range of head positions, in order to realize a gaze estimation method suitable for practical use in society. Based on a robust iris tracking method independent of resolution, we demonstrate the possibility of realizing calibration-free gaze estimation in an extensive space using a gaze estimation method consisting of facial feature point detection, iris tracking, and gazing point estimation. We aim to apply this to various fields.

Calibration-free gaze estimation system

Video-Text Retrieval for understanding more complex activities in videos

We aim to develop a more advanced ‘video-text retrieval/detection’ system that learns matching and alignment between videos/moments and its textual descriptions.
Our research goals are mainly twofold:
1. Video Moment Retrieval (Natural Language Moment Localization): Given a textual query, we search a corresponding temporal moment within a video.
2. Video-text retrieval with a self-supervised word-like 3D unit discovery in a video.

Retrieving and Highlighting Action with Spatiotemporal Reference

In this paper, we present a framework that jointly retrieves and spatiotemporally highlights actions in videos by enhancing current deep cross-modal retrieval methods. Our work takes on the novel task of action highlighting, which visualizes where and when actions occur in an untrimmed video setting. Leveraging weak supervision from annotated captions, our framework acquires spatiotemporal relevance maps and generates local embeddings which relate to the nouns and verbs in captions.

Moment-Sentence Grounding from Temporal Action Proposal

Deep learning in Vision and Language, which is one of a challenging task in multi-modal learning, is gaining more attention these days.
In this paper, we tackle with temporal moment retrieval. Given an untrimmed video and a description query, temporal moment retrieval aims to localize the temporal segment within the video that best describes the textual query. Our approach is based on mainly two stage models. First, temporal proposals are obtained by using the existing temporal action proposal method. Second, the best proposal is predicted by the similarity score between visual features and linguistic features.

Sports video analysis

In sports, quantitative play analysis from images is important in improving the level of competition and motivation of athletes, supporting coaching, and providing new broadcast video content. Our laboratory is conducting research on methods and systems for sport image analysis which can be practically utilized in the field of various sports.

Shot Detection in Tennis Games

In this study, we propose a shot detection method in tennis games, based on a movie. Conventional method depends on detecting the ball, but our proposed method can recognize the ball, rackets and players, achieving a higher precision. Shot detectors for other racquet sports as well as further analytics to provide features like shot classification, rally analysis and recommendations, can easily be built with our proposed solution. Moreover, by adapting to interaction between humans and objects, such as touching and tapping, our method can also be applied to a user interface device.

Result of shot detection
(Lower left graph describes shots of the front player and lower right graph describes shots of the back player)

Rugby video analysis system

In this study, we developed technology to accurately map the trajectory of movement of the ball/players from a single camera image on a 2D field using hybrid image analysis that carries out ball detection/tracking by feature quantity design method and selection detection/tracking by deep learning method. In addition, automatic play was classified by deep learning, and the automation of the tagging work of the main players, which was conventionally done manually, was studied. This technology is applicable not only to rugby but also to various other sports, and it is expected to be adopted in applications other than sports, such as industrial fields.

Automatic player tracking and play classification for Rugby

American football video analysis system

Shielding of players by other players is particularly common and there are many patterns of player motion during American football and other team sports. Play time is judged using a Global Motion Feature, such as player position and motion information for the entire field from videos of American football. Furthermore, after calculating the positions of two distinctive features i.e. the play start and end position, and carrying out the classifications of pass, run, and kick, which are American football play patterns, we realize a method of estimating the ball trajectory, which has important information for game analysis without detecting the ball itself. This enables the acquisition of play time / play classification / ball trajectory information, and the automatic creation of a game analysis database.

Play scene analysis for American football videos

Swimmer tracking system

We propose an athlete tracking and stroke estimation method for videos of swimming events, which is robust to splashing and other noise and does not depend on the shooting environment. Athletes are detected from the video and tracked. The feature value of the athlete image is acquired by inputting the detected athlete images into a CNN. Furthermore, the stroke is estimated by creating a temporal sequence from the obtained feature value and inputting it into MultiLSTM. Based on the finally obtained athlete position and stroke information, we visualize the speed and stroke of the athlete, and aim to amplify the sense of live broadcast by superimposing it on the video.

Swimmer tracking and stroke estimation by using Multi-LSTM
(Left:Estimation of stroke signal,Right:Visualization of velocity and stroke information)

Intelligent robotics

To date, robots have performed various services based on careful instructions. In our laboratory, we are conducting R&D on an intelligent robot that behaves appropriately by observing the situation and people, using real-time human behavior recognition, object/environment recognition technology and past action logs to obtain various forms of “awareness”.

Object Manipulation Task Generation by Self-supervised Learning of Stationary Layouts

We propose a method for generating the operations to reconstruct a stationary object’s layout from an input image that has a non-stationary layout. The proposed method is an encoder-decoder–type network with special layers for estimating a list of operations which includes the operation type, object class, and position. The network can trained by a self-supervised manner. From an experiment of the operation generation using real images, it is confirmed that the our method have enabled generating the operations that change the object’s layout in an input scene to a stationary layout in real time.

Task-oriented Function Detection based on an operational task

We propose a novel representation to describe functions of an object, Task-oriented Function, which takes the place of Affordance in the field of Robotics Vision. We also propose a CNN-based network to detect Task-oriented Function. This network takes as input an operational task as well as a RGB image and assign each pixel to an appropriate label to every task. Because the outputs from the network differ depending on tasks, Task-oriented Function makes it possible to describe a variety of ways to use an object. We introduce a new dataset for Task-oriented Function, which contains about 1200 RGB images and 6000 annotations assuming five tasks. Our proposed method reached 0.80 mean IOU in our dataset.

Tactile Logging: A method of describing the operation history on an object surface based on human motion analysis

We propose a method to analyze a demonstration of a tool operated by a human and shot as an RGB-D movie. The proposed method estimates the interaction that occurs with the object while tracking the human pose and the three-dimensional position and attitude of the object subject to operation. This result is recorded as a time-series usage history (tactile log) onthe surface of the 3D model of the object. The tactile log is a new data expression for manifesting the ideal method of using the object, and it can be used for generating operations of gripping and handling “natural” tools by robot arms.

Examples of generated “Tactile Log”

6-degree-of-freedom attitude estimation of similar-shaped objects focusing on the spatial arrangement of functional attributes

We propose a 6-degree-of-freedom attitude estimation method that can be operated even if the same 3D model as the object does not exist. Even if the design of tools in the same category is different, the arrangement of the roles (function attributes) of each part is considered to be common. In the proposed method, this is used as a means for attitude estimation. We confirmed that the reliability of attitude estimation improves by simultaneously optimizing consistency between arrangements of functional attributes and consistency between shapes. In actual use, this has the advantage that if just one 3D model of an object of a target category is associated with a function attribute or a grasping method, it is possible to handle an actual object just as it is, eliminating the necessity to prepare model data for each object.

Input/Output of proposed method
(Input:PointCloud with function attributes,Output:Posture transformation parameters)

Real world sensing

Image sensing technology is expected to be utilized in various situations in the real world. Our laboratory aims to utilize new image sensing technology in various fields such as automobiles and medical care.

Optical flow estimation by in-vehicle event camera

We propose regularization specialized to in-vehicle camera scenes, which utilize characteristics relating to vehicle motion characteristics and focus of expansion (FOE) for optical flow estimation using event cameras. FOE is defined as the intersection of the translation axis of the camera and the image plane. The optical flow has the feature of becoming radial from the FOE when the component due to rotation is excluded from the optical flow of the surrounding environmental due to the motion of the vehicle itself. The proposed regularization restricts the direction of the optical flow using this feature. The usefulness of this regularization was demonstrated by evaluating the rotation parameters estimated during the method.

Left: Output signal of event camera, Right:Result of optical flow estimation

Robust QR-code recognition by using event camera

QR codes are widely used on production lines in factory automation. However, there is a problem that blurring occurs due to lighting conditions and the speed of the belt conveyor. Against this problem, event cameras asynchronously capture changes in luminance for each pixel and have excellent advantages such as high temporal resolution and high dynamic range. In this research, we proposed a method of estimating QR code robustly from event data by optimizing in QR code space which is more restricted than image space.

Scoliosis screening by estimation of spinal column alignment from moire topographic images of the back

In this study, we propose a method to calculate the Cobb angle and VR angle necessary for fully automatic scoliosis screening, using input of moire topographic images of the back of subjects without X-ray exposure. Using a moire images and X-ray images, we propose a method of estimating spinal column alignment coordinates with high precision and a method to automatically calculate Cobb angle and VR angle from spinal column alignment information from just a moire image by CNN learning with feature point coordinates of a spinal column extracted from an X-ray image by a physician as teacher data. We demonstrated the effectiveness of the proposed method on an independently constructed dataset. Currently, we are investigating a method to estimate 3D spinal column alignment from 3D scan data of the back.

Scoliosis screening by spinal alignment estimation

Application of Change Detection via Convolutional Neural Networks in Remote Sensing

Analysis of remote sensing imagery plays an increasingly vital role in the environment and climate monitoring, especially in detecting and managing changes in the environment. Since obtaining satellite imagery or aerial imagery are getting more comfortable in recent years, changes in landscape due to disaster are highly in demand. In this paper, we propose automatic landscape change detection especially landslide and flood detection by implementing convolutional neural network (CNN) in extracting the feature more effectively. CNN is robust to shadow, able to obtain the characteristic of disaster adequately and most importantly able to overcome misdetection or misjudgment by operators, which will affect the effectiveness of disaster relief. The neural network consists of 2 phases: training phase and testing phase. We created our own training data patches of pre-disaster and post-disaster from Google Earth Aerial Imagery, which we are currently focusing on two countries: Japan and Thailand. Each disaster’s training data set consists of 50000 patches, and all patches are trained in CNN to extract region where the disaster occurred without delay. The results show the accuracy of our system in around 80%-90% of both disaster detections. Based on the promising results, the proposed method may assist in our understanding of the role of deep learning in disaster detection.

Result of disaster detection

Aoki Media Sensing Lab.

Keio University
Dept. of Electrical Engineering, Faculty of Science and Technology

3-14-1 Hiyoshi, Kohoku-ku, Yokohama, Kanagawa

223-8522, Japan

Copyright © 2018 Aoki Media Sensing Lab. All Rights Reserved.
トップへ戻るボタン