The computer vision and machine perception lab at Saarland University focuses on 3D computer vision and lifelong learning, where the emphasis lies on building machine perception algorithms that are not rigid but able to adapt to their environment and evolve. The lab connects to surrounding interdisciplinary fields, such as fundamentals, reinforcement learning, natural language processing and privacy. The lab is headed by Eddy Ilg, who is known for his contributions to computer vision on optical flow and his work on AR technology at Meta.

Saarland University is surrounded by the Saarland Informatics Campus that is one of the leading locations for computer science in Germany and Europe with leading international research institutes, such as the Computer Science Department, the Language Technology Department, the Max Planck Institute for Informatics (MPI-INF), the German Research Center for AI (DFKI) and the the CISPA Helmholtz Center for Information Security (CISPA). Saarland Informatics Campus holds 28 ERC Grants and 7 Gottfried Wilhelm Leibniz Prize winners. The lab is english-speaking and has collaborations with many other famous researchers - inside Saarland Informatics Campus as well as internationally.


Suraj Sudhakar and Kevin Raj are starting their Master's thesis, cosupervised by Eddy Ilg and Raza Yunus.
Raza Yunus joins the lab as a PhD student.
Eddy Ilg became ELLIS Unit Faculty, allowing him to be a first supervisor and recruit through ELLIS.
Tomas Amado, Cameron Braunstein and Devikalyan Das are joining the lab as Master's students.
Eddy Ilg became area chair for BMVC and CVPR.
Suraj Sudhakar and Kevin Raj are joining the lab as Hiwis.
The lab will offer a seminar on "3D Representations for Deep Learning in Computer Vision" in the upcoming winter semester 2022/23 - see the guest lecture on the upcoming Friday.
Eddy Ilg was awarded the professorship at Saarland University.
Eddy Ilg became an ELLIS member.
The lab has an open Hiwi Position.
Tom Fischer and Nikhil Paliwal are joining the lab for their Master's thesis.
The lab became a Continual AI Unit.
NinjaDesc has been accepted to CVPR 2022.
Our paper on explicit radiance field reconstruction without deep learning is available on arXiv.
Our paper on tight learned inertial odometry was accepted to RAL/IROS.
Our paper on deep local shapes was accepted to ECCV.


3D Representations
for Deep Learning in Computer Vision

Title: 3D Representations for Deep Learning in Computer Vision
Presenter: Prof. Dr.-Ing. Eddy Ilg
Location: E 1.4, Room 024
Time>: Friday, August 19 at 2pm
Zoom Link:
Meeting ID: 970 4487 3513
Passcode: 033867
In the recent history of computer vision, methods leveraging deep learning and 3D representations have shown great success. The lecture will give an introduction to 3D computer vision, starting with the fundamentals of 3D reconstruction and providing an overview of state-of-the-art 3D scene and object representations with point clouds, surface- and density based approaches. The lecture will conclude with an outlook on the future direction of the field and an overview of the research of the new CVMP lab at Saarland University.
Brief Bio:
Eddy Ilg is a new professor at Saarland University and leads the Computer Vision and Machine Perception (CVMP) lab. He received his PhD from the University of Freiburg and is author of FlowNet, FlowNet 2.0 and FlowNetH, which were the first in their field. After his PhD, Eddy has spent three years at Facebook (now Meta) Reality Labs, focusing on 3D object reconstruction in the wild. He is now starting a new research group at Saarland University that will focus on 3D reconstruction in the combination with continual learning, with the goal of building future machine perception algorithms that are not rigid but able to adapt to their environment and evolve.


Prof. Eddy Ilg

Eddy Ilg

I became professor at Saarland University in June 2022 and lead the CVMP lab.

Previously I have worked in the US at Meta Reality Labs on project LiveMaps that aims to build a complete machine perception stack for augmented reality, ranging from hardware to mapping and localization as well as 3D scene understanding up to photorealistic 3D reconstruction. This work gifted me with a very broad perspective on AI and understanding the challenges involved in building AI systems.

I obtained my PhD from the University of Freiburg and I am very well known for my work on optical flow, which led to a paradigm shift and first established deep learning approaches in the field. Furthermore I am known my work in related areas on the estimation of disparities, motion and occlusion boundaries, occlusions as well as uncertainties. Please have a look at my thesis.

During high school I wrote an operating system and won the prize for the best work in software engineering at the German young researcher's federal contest. I am generally a very curious and creative person. During my studies I was supported by a scholarship from the German Academic Scholarship Foundation for extraordinarily talented students with a vast horizon.

My current research interests lie in bringing AI systems to the next level and using deep learning in combination with 3D computer vision, computer graphics and NLP. My vision is to combine research from these areas to build AI systems that are not rigid and can evolve over time.

During my PhD I have previously had the chance to lead a small team and a strong team atmosphere is very important to me. Research is fundamentally about moving forward into the unknown and I personally find it is invaluable to have many and diverse perspectives on one's work. Therefore I support working in small teams with an intensive disucssion culture and a healthy team athmosphere. To my students I serve as mentor in aspects of their career ranging from theory to engineering and from academia to industry.

Raza Yunus

Raza Yunus

I am a Ph.D. student since December 2022, jointly at the CVMP lab and Max Planck Institute for Informatics (MPI).

I did my Bachelors in Software Engineering at NUST, Pakistan and my Masters in Informatics at TU Munich, with a focus on Computer Vision and AI. During my thesis, I developed ManhattanSLAM, a robust SLAM system that utilizes structure in indoor scenes for accurate camera pose tracking.

My research interests lie in 3D Computer Vision and Simultaneous Localization and Mapping (SLAM) in general. Currently, I'm focusing on non-rigid 3D reconstruction, where the goal is to eventually track and reconstruct complete dynamic scenes in real-time, using a single camera.

I like to play music, read or pop a game of chess in my free time.

Tom Fischer

Tom Fischer

I am writing my Master's thesis with Professor Ilg in the newly founded lab for Computer Vision and Machine Perception.

I did my Bachelor's degree in Cybersecurity from 2016-2020 at Saarland University with a focus on cryptography and secure machine learning. Since 2020, I am doing my Master's in computer science. My research interests lie in the intersection between computer vision and deep learning, where I want to design transparent and robust deep learning solutions for computer vision problems.

For my thesis, I am investigating if and how it is possible to extend state-of-the-art deep learning approaches for optical flow with explicit diffusion processes. We hope that using the well-understood diffusion modelling theory lets us design more explainable and stable networks to solve the optical flow problem efficiently.

When I am not staring at a computer screen, I like to cook and experiment with different recipes and cuisines. I love eating out with friends and I am always looking for restaurant recommendations!

Nikhil Paliwal

Nikhil Paliwal

I am a Master's student at Saarland University, studying in the Data Science and Artificial Intelligence program. I have a background in self-supervision, model compression techniques and deep learning. I did my bachelors in Communication and Computer Engineering from the LNM Institute of Information Technology, India from 2015-19. I also received Chairman's gold medal for best-overall performance in graduating batch.

I am aiming towards making deep learning more accessible by reducing the requirement for large amounts of supervised labels, reducing computational resource needs and catastrophic forgetting in continual systems. In my master thesis I will be focusing on the topic of knowledge transfer with continual learning.

In my leisure time, I enjoy activities such as football and swimming. I also enjoy science fiction, reading novels and watching movies.

Suraj Sudhakar

I am doing my Master's in Visual Computing at Saarland University and did my Bachelor's in Electronics & Communication at the BMS College of Engineering, Bangalore, from 2014-2018. My research interests lie in the area of Computer Vision & Deep Learning. More specifically, I am interested in explainable models which have a well rooted theory behind them.

I have been working as a Hiwi at CVMP since October 2022 and now started my Master's thesis working with Eddy Ilg and Raza Yunus on non-rigid reconstruction.

I enjoy watching anime, and playing games. I am also learning to cook!

Kevin Raj

I started my Master's in Visual Computing at Saarland University in March 2021. My broad interests lie in the intersection of Computer Vision, Computer Graphics and Machine Learning.

I have been working as a Hiwi at CVMP since October 2022 and now started my Master's thesis working with Eddy Ilg and Raza Yunus on non-rigid reconstruction.

Before coming to Saarland, I worked as a Research Assistant at the Indian Insititute of Science, Bangalore, under the supervision of Dr. Chandra Sekhar. My work was related to Biomedical image processing using deep learning and signal/image processing techniques. I did my Bachelor's in Electrical and Electronics at Manipal Institute of Technology, Manipal, from 2015-2019.

Generally, I spend my free time watching anime, traveling, and cooking :)

Tomas Amado

Tomas Amado

I am a Data Science and Artificial Intelligence Master’s student at Saarland University. My interests have been in the areas of computer vision, machine learning and recently 3D computer vision.

I will be writing my Master’s thesis with Prof. Ilg focusing on the topic of using SLOT-Attention mechanisms to learn Object-Centric representations from 3D Point Cloud data.

I studied my Bachelor’s in Computer Engineering at Rafael Urdaneta University from 2016-2019 in my home city of Maracaibo, Venezuela mainly focusing on software engineering and have been since 2020 living in Germany and doing my Master’s at Saarland University.

In my free time I am mostly doing sports like volleyball, running and bouldering. I also love playing the guitar and exploring new music everytime I can.

Cameron Braunstein

Cameron Braunstein

I am a Master’s student at Saarland University studying Data Science and Artificial Intelligence.

I graduated from Brandeis University in 2019 with a Bachelor’s degree in Computer Science and Mathematics. During the course of my Master’s studies I have found a passion for Computer Vision, which I am pursuing jointly with CVMP and MPI, cosupervised by Prof. Eddy Ilg and Vladislav Golyanik.

For my thesis, I am investigating quantum computing approaches to optical flow. This work lies on the exciting new frontiers of computing technology.

In my spare time I enjoy cooking, watching movies, and playing the piano works of Claude Debussy.

Devikalyan Das

Devikalyan Das

I am doing my Master’s in Visual Computing at Saarland University. My interests lie in Computer Graphics, Computer Vision and Machine Learning, with the aim of better understanding of the world around me from the eyes of the computer.

I currently doing my masters thesis under the supervision of Professor Eddy Ilg, where I am focussing on efficient 3D reconstruction of deformable objects in video.

I enjoy cooking, playing flute and swimming or jogging during my free time

Mona Linn

I am supporting the CVMP lab as an administrative assistant with procurement, accounting and recruting. Please contact me if you like to get in touch with the lab and for general inquiries.

Job Postings

Fully-Funded Open PhD and PostDoc Positions
in 3D Computer Vision and Continual Learning

Saarland University is one of the leading locations for computer science in Germany and Europe. The Computer Vision and Machine Perception research lab is headed by Eddy Ilg with close ties to MPI Saarbrücken and to Meta and has open PhD and PostDoc positions for 3D computer vision and continual learning.

Although deep learning has conquered the field of perception algorithms and is today omnipresent, strong limitations of conventional systems are that they operate in 2D and use a fixed training dataset.

The goal of your project will be to advance state-of-the-art machine perception algorithms by lifting the representation to 3D and enabling the systems to increase their knowledge from continuously incoming unlabeled data. To this end, you will be developing deep learning-based models that can reason about their own knowledge and combining those models with state-of-the-art 3D reconstruction techniques. Your work will involve 3D reconstruction, physical light transport models, uncertainty estimation and continual learning and connect to natural language processing, reinforcement learning and fundamental research.

The positions target publications on CVPR, ECCV, ICCV and NeurIPS and will set you up for an outstanding career in academia or industry. The research is highly relevant for the future of AI and the practical use of your research will be for 3D scene understanding, contextual AI, virtual shopping and visual search. You will be working in an interdisciplinary setting embedded in the environment of Saarland Informatics Campus with leading researchers in the field and building international relationships.

Requirements for the PhD:

Please include the following information with your application:

Requirements for the PostDoc:

Please include the following information with your application:

And send your application to .

You can find more information and news at cvmp.cs.uni-saarland.de. Eddy Ilg is known for his contributions to computer vision on optical flow and his work on AR technology at Meta. Saarland University is surrounded by the Saarland Informatics Campus with five strong research institutes and three collaborating university departments. It offers a dynamic and stimulating research environment and holds 28 ERC Grants and 7 Gottfried Wilhelm Leibniz Prize winners. You will be working in an english-speaking diverse environment with international students and collaborating with many other famous professors, inside Saarland Informatics Campus as well as internationally. Industry collaborations with US tech companies are also planned.

#phd #phdposition #phdstudent #phdstudent #postdoc #machinelearning #deeplearning #computervision #machineperception #artificialintelligence #computergraphics #3dreconstruction #3dcomputervision #3dvision #saarlanduniversity #saarlandinformaticscampus #hiring #research #ai #technology #university #augmentedreality #continuallearning #uncertaintyestimation

Hiwi Position

All positions for 2022 are currently filled. If you are interested in working with the lab in 2023, please send your application including your CV and recent grade transcripts to .

Thesis Offerings

The lab is offering master theses in the following areas: If you are interested in doing your thesis with the lab please send your application including your CV and recent grade transcripts to .



ERF: Explicit Radiance Field Reconstruction From Scratch

Samir Aroudj, Steven Lovegrove, Eddy Ilg, Tanner Schmidt, Michael Goesele and Richard Newcombe

arXiv Paper 2022

We propose a novel explicit dense 3D reconstruction approach that processes a set of images of a scene with sensor poses and calibrations and estimates a photo-real digital model. One of the key innovations is that the underlying volumetric representation is completely explicit in contrast to neural network-based (implicit) alternatives. We encode scenes explicitly using clear and understandable mappings of optimization variables to scene geometry and their outgoing surface radiance. We represent them using hierarchical volumetric fields stored in a sparse voxel octree. Robustly reconstructing such a volumetric scene model with millions of unknown variables from registered scene images only is a highly non-convex and complex optimization problem. To this end, we employ stochastic gradient descent (Adam) which is steered by an inverse differentiable renderer.

We demonstrate that our method can reconstruct models of high quality that are comparable to state-of-the-art implicit methods. Importantly, we do not use a sequential reconstruction pipeline where individual steps suffer from incomplete or unreliable information from previous stages, but start our optimizations from uniformed initial solutions with scene geometry and radiance that is far off from the ground truth. We show that our method is general and practical. It does not require a highly controlled lab setup for capturing, but allows for reconstructing scenes with a vast variety of objects, including challenging ones, such as outdoor plants or furry toys. Finally, our reconstructed scene models are versatile thanks to their explicit design. They can be edited interactively which is computationally too costly for implicit alternatives.

NinjaDesc: Content-Concealing Visual Descriptors via Adversarial Learning

Tony Ng, Hyo Jin Kim, Vincent Lee, Daniel DeTone, Tsun-Yi Yang, Tianwei Shen, Eddy Ilg, Vassileios Balntas, Krystian Mikolajczyk and Chris Sweeney

Conference on Computer Vision and Pattern Recognition (CVPR) 2022

In the light of recent analyses on privacy-concerning scene revelation from visual descriptors, we develop descriptors that conceal the input image content. In particular, we propose an adversarial learning framework for training visual descriptors that prevent image reconstruction, while maintaining the matching accuracy. We let a feature encoding network and image reconstruction network compete with each other, such that the feature encoder tries to impede the image reconstruction with its generated descriptors, while the reconstructor tries to recover the input image from the descriptors. The experimental results demonstrate that the visual descriptors obtained with our method significantly deteriorate the image reconstruction quality with minimal impact on correspondence matching and camera localization performance.


Mitigating Reverse Engineering Attacks on Local Feature Descriptors

Deeksha Dangwal, Vincent T. Lee, Hyo Jin Kim, Tianwei Shen, Meghan Cowan, Rajvi Shah, Caroline Trippel, Brandon Reagen, Timothy Sherwood, Vasileios Balntas, Armin Alaghi and Eddy Ilg

British Machine Vision Conference (BMVC) 2021

As autonomous driving and augmented reality evolve, a practical concern is data privacy. In particular, these applications rely on localization based on user images. The widely adopted technology uses local feature descriptors, which are derived from the images and it was long thought that they could not be reverted back. However, recent work has demonstrated that under certain conditions reverse engineering attacks are possible and allow an adversary to reconstruct RGB images. This poses a potential risk to user privacy. We take this a step further and model potential adversaries using a privacy threat model. Subsequently, we show under controlled conditions a reverse engineering attack on sparse feature maps and analyze the vulnerability of popular descriptors including FREAK, SIFT and SOSNet. Finally, we evaluate potential mitigation techniques that select a subset of descriptors to carefully balance privacy reconstruction risk while preserving image matching accuracy; our results show that similar accuracy can be obtained when revealing less information.


Deep Local Shapes: Learning Local SDF Priors for Detailed 3D Reconstruction

Rohan Chabra, Jan Eric Lenssen, Eddy Ilg, Tanner Schmidt, Julian Straub, Steven Lovegrove and Richard Newcombe

European Conference on Computer Vision (ECCV) 2020

Efficiently reconstructing complex and intricate surfaces at scale is a long-standing goal in machine perception. To address this problem we introduce Deep Local Shapes (DeepLS), a deep shape representation that enables encoding and reconstruction of high-quality 3D shapes without prohibitive memory requirements. DeepLS replaces the dense volumetric signed distance function (SDF) representation used in traditional surface reconstruction systems with a set of locally learned continuous SDFs defined by a neural network, inspired by recent work such as DeepSDF. Unlike DeepSDF, which represents an object-level SDF with a neural network and a single latent code, we store a grid of independent latent codes, each responsible for storing information about surfaces in a small local neighborhood. This decomposition of scenes into local shapes simplifies the prior distribution that the network must learn, and also enables efficient inference. We demonstrate the effectiveness and generalization power of DeepLS by showing object shape encoding and reconstructions of full scenes, where DeepLS delivers high compression, accuracy, and local shape completion.

TLIO: Tight Learned Inertial Odometry

Wenxin Liu, David Caruso, Eddy Ilg, Jing Dong, Anastasios I. Mourikis, Kostas Daniilidis, Vijay Kumar and Jakob Engel

IEEE Robotics and Automation Letters 2020

In this work we propose a tightly-coupled Extended Kalman Filter framework for IMU-only state estimation. Strap-down IMU measurements provide relative state estimates based on IMU kinematic motion model. However the integration of measurements is sensitive to sensor bias and noise, causing significant drift within seconds. Recent research by Yan et al. (RoNIN) and Chen et al. (IONet) showed the capability of using trained neural networks to obtain accurate 2D displacement estimates from segments of IMU data and obtained good position estimates from concatenating them. This paper demonstrates a network that regresses 3D displacement estimates and its uncertainty, giving us the ability to tightly fuse the relative state measurement into a stochastic cloning EKF to solve for pose, velocity and sensor biases. We show that our network, trained with pedestrian data from a headset, can produce statistically consistent measurement and uncertainty to be used as the update step in the filter, and the tightly-coupled system outperforms velocity integration approaches in position estimates, and AHRS attitude filter in orientation estimates.

Domain Adaptation of Learned Features for Visual Localization

Sungyong Baik, Hyo Jin Kim, Tianwei Shen, Eddy Ilg, Kyoung Mu Lee and Christopher Sweeney

British Machine Vision Conference (BMVC) 2020

We tackle the problem of visual localization under changing conditions, such as time of day, weather, and seasons. Recent learned local features based on deep neural networks have shown superior performance over classical hand-crafted local features. However, in a real-world scenario, there often exists a large domain gap between training and target images, which can significantly degrade the localization accuracy. While existing methods utilize a large amount of data to tackle the problem, we present a novel and practical approach, where only a few examples are needed to reduce the domain gap. In particular, we propose a few-shot domain adaptation framework for learned local features that deals with varying conditions in visual localization. The experimental results demonstrate the superior performance over baselines, while using a scarce number of training examples from the target domain.


Overcoming Limitations of Mixture Density Networks: A Sampling and Fitting Framework for Multimodal Future Prediction

Osama Makansi, Eddy Ilg, Özgün Çiçek and Thomas Brox

Conference on Computer Vision and Pattern Recognition (CVPR) 2019

Future prediction is a fundamental principle of intelligence that helps plan actions and avoid possible dangers. As the future is uncertain to a large extent, modeling the uncertainty and multimodality of the future states is of great relevance. Existing approaches are rather limited in this regard and mostly yield a single hypothesis of the future or, at the best, strongly constrained mixture components that suffer from instabilities in training and mode collapse. In this work, we present an approach that involves the prediction of several samples of the future with a winner-takes-all loss and iterative grouping of samples to multiple modes. Moreover, we discuss how to evaluate predicted multimodal distributions, including the common real scenario, where only a single sample from the ground-truth distribution is available for evaluation. We show on synthetic and real data that the proposed approach triggers good estimates of multimodal distributions and avoids mode collapse.


What Makes Good Synthetic Training Data for Learning Disparity and Optical Flow Estimation?

Nikolaus Mayer, Eddy Ilg, Philipp Fischer, Caner Hazirbas, Daniel Cremers, Alexey Dosovitskiy and Thomas Brox

Int. Journal of Computer Vision (IJCV) 2018

The finding that very large networks can be trained efficiently and reliably has led to a paradigm shift in computer vision from engineered solutions to learning formulations. As a result, the research challenge shifts from devising algorithms to creating suitable and abundant training data for supervised learning. How to efficiently create such training data? The dominant data acquisition method in visual recognition is based on web data and manual annotation. Yet, for many computer vision problems, such as stereo or optical flow estimation, this approach is not feasible because humans cannot manually enter a pixel-accurate flow field. In this paper, we promote the use of synthetically generated data for the purpose of training deep networks on such tasks.We suggest multiple ways to generate such data and evaluate the influence of dataset properties on the performance and generalization properties of the resulting networks. We also demonstrate the benefit of learning schedules that use different types of data at selected stages of the training process.

FusionNet and AugmentedFlowNet: Selective Proxy Ground Truth for Training on Unlabeled Images

Osama Makansi, Eddy Ilg and Thomas Brox

arXiv Paper 2018

Recent work has shown that convolutional neural networks (CNNs) can be used to estimate optical flow with high quality and fast runtime. This makes them preferable for real-world applications. However, such networks require very large training datasets. Engineering the training data is difficult and/or laborious. This paper shows how to augment a network trained on an existing synthetic dataset with large amounts of additional unlabelled data. In particular, we introduce a selection mechanism to assemble from multiple estimates a joint optical flow field, which outperforms that of all input methods. The latter can be used as proxy-ground-truth to train a network on real-world data and to adapt it to specific domains of interest. Our experimental results show that the performance of networks improves considerably, both, in cross-domain and in domain-specific scenarios. As a consequence, we obtain state-of-the-art results on the KITTI benchmarks.

Occlusions, Motion and Depth Boundaries with a Generic Network for Disparity, Optical Flow or Scene Flow Estimation

Eddy Ilg, Tonmoy Saikia, Margret Keuper and Thomas Brox

European Conference on Computer Vision (ECCV) 2018

Occlusions play an important role in optical flow and disparity estimation, since matching costs are not available in occluded areas and occlusions indicate motion boundaries. Moreover, occlusions are relevant for motion segmentation and scene flow estimation. In this paper, we present an efficient learning-based approach to estimate occlusion areas jointly with optical flow or disparities. The estimated occlusions and motion boundaries clearly improve over the state of the art. Moreover, we present networks with state-of-the-art performance on the popular KITTI benchmark and good generic performance. Making use of the estimated occlusions, we also show imprved results on motion segmentation and scene flow estimation.

Uncertainty Estimates and Multi-Hypotheses Networks for Optical Flow

Eddy Ilg, Özgün Çiçek, Silvio Galesso, Aaron Klein, Osama Makansi, Frank Hutter and Thomas Brox

European Conference on Computer Vision (ECCV) 2018

Optical flow estimation can be formulated as an end-to-end supervised learning problem, which yields estimates with a superior accuracy-runtime tradeoff compared to alternative methodology. In this paper, we make such networks estimate their local uncertainty about the correctness of their prediction, which is vital information when building decisions on top of the estimations. For the first time we compare several strategies and techniques to estimate uncertainty in a large-scale computer vision task like optical flow estimation. Moreover, we introduce a new network architecture and loss function that enforce complementary hypotheses and provide uncertainty estimates efficiently with a single forward pass and without the need for sampling or ensembles. We demonstrate the quality of the uncertainty estimates, which is clearly above previous confidence measures on optical flow and allows for interactive frame rates.


End-to-End Learning of Video Super-Resolution with Motion Compensation

Osama Makansi, Eddy Ilg and Thomas Brox

German Conference on Pattern Recognition (GCPR) 2017

Learning approaches have shown great success in the task of super-resolving an image given a low resolution input. Video superresolution aims for exploiting additionally the information from multiple images. Typically, the images are related via optical flow and consecutive image warping. In this paper, we provide an end-to-end video superresolution network that, in contrast to previous works, includes the estimation of optical flow in the overall network architecture. We analyze the usage of optical flow for video super-resolution and find that common off-the-shelf image warping does not allow video super-resolution to benefit much from optical flow. We rather propose an operation for motion compensation that performs warping from low to high resolution directly. We show that with this network configuration, video superresolution can benefit from optical flow and we obtain state-of-the-art results on the popular test sets. We also show that the processing of whole images rather than independent patches is responsible for a large increase in accuracy.

DeMoN: Depth and Motion Network for Learning Monocular Stereo

Benjamin Ummenhofer, Huizhong Zhou, Jonas Uhrig, Nikolaus Mayer, Eddy Ilg, Alexey Dosovitskiy and Thomas Brox

Conference on Computer Vision and Pattern Recognition (CVPR) 2017

In this paper we formulate structure from motion as a learning problem. We train a convolutional network end-to-end to compute depth and camera motion from successive, unconstrained image pairs. The architecture is composed of multiple stacked encoder-decoder networks, the core part being an iterative network that is able to improve its own predictions. The network estimates not only depth and motion, but additionally surface normals, optical flow between the images and confidence of the matching. A crucial component of the approach is a training loss based on spatial relative differences. Compared to traditional two-frame structure from motion methods, results are more accurate and more robust. In contrast to the popular depth-from-single-image networks, DeMoN learns the concept of matching and, thus, better generalizes to structures not seen during training.

Lucid Data Dreaming for Object Tracking

Anna Khoreva, Rodrigo Benenson, Eddy Ilg, Thomas Brox and Bernt Schiele

Conference on Computer Vision and Pattern Recognition Workshop (CVPRW) 2017

Convolutional networks reach top quality in pixel-level object tracking but require a large amount of training data (1k to 10k) to deliver such results. We propose a new training strategy which achieves state-of-the-art results across three evaluation datasets while using 20 to 100 less annotated data than competing methods. Our approach is suitable for both for single and multiple object tracking. Instead of using large training sets hoping to generalize across domains, we generate in-domain training data using the provided annotation on the first frame of each video to synthesize (“lucid dream”1 ) plausible future video frames. In-domain per-video training data allows us to train high quality appearance- and motion-based models, as well as tune the post-processing stage. This approach allows to reach competitive results even when training from only a single annotated frame, without ImageNet pre-training. Our results indicate that using a larger training set is not automatically better, and that for the tracking task a smaller training set that is closer to the target domain is more effective. This changes the mindset regarding how many training samples and general “objectness” knowledge are required for the object tracking task.

FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks

Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy and Thomas Brox

Conference on Computer Vision and Pattern Recognition (CVPR) 2017

The FlowNet demonstrated that optical flow estimation can be cast as a learning problem. However, the state of the art with regard to the quality of the flow has still been defined by traditional methods. Particularly on small displacements and real-world data, FlowNet cannot compete with variational methods. In this paper, we advance the concept of end-to-end learning of optical flow and make it work really well. The large improvements in quality and speed are caused by three major contributions: first, we focus on the training data and show that the schedule of presenting data during training is very important. Second, we develop a stacked architecture that includes warping of the second image with intermediate optical flow. Third, we elaborate on small displacements by introducing a subnetwork specializing on small motions. FlowNet 2.0 is only marginally slower than the original FlowNet but decreases the estimation error by more than 50%. It performs on par with state-of-the-art methods, while running at interactive frame rates. Moreover, we present faster variants that allow optical flow computation at up to 140fps with accuracy matching the original FlowNet.


A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation

Nikolaus Mayer, Eddy Ilg, Philip Hausser, Philipp Fischer, Daniel Cremers, Alexey Dosovitskiy and Thomas Brox

Conference on Computer Vision and Pattern Recognition (CVPR) 2016

Recent work has shown that optical flow estimation can be formulated as a supervised learning task and can be successfully solved with convolutional networks. Training of the so-called FlowNet was enabled by a large synthetically generated dataset. The present paper extends the concept of optical flow estimation via convolutional networks to disparity and scene flow estimation. To this end, we propose three synthetic stereo video datasets with sufficient realism, variation, and size to successfully train large networks. Our datasets are the first large-scale datasets to enable training and evaluation of scene flow methods. Besides the datasets, we present a convolutional network for real-time disparity estimation that provides state-of-the-art results. By combining a flow and disparity estimation network and training it jointly, we demonstrate the first scene flow estimation with a convolutional network.


FlowNet: Learning Optical Flow with Convolutional Networks

Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van Der Smagt, Daniel Cremers and Thomas Brox

Int. Conference on Computer Vision (ICCV) 2015

Convolutional neural networks (CNNs) have recently been very successful in a variety of computer vision tasks, especially on those linked to recognition. Optical flow estimation has not been among the tasks CNNs succeeded at. In this paper we construct CNNs which are capable of solving the optical flow estimation problem as a supervised learning task. We propose and compare two architectures: a generic architecture and another one including a layer that correlates feature vectors at different image locations. Since existing ground truth data sets are not sufficiently large to train a CNN, we generate a large synthetic Flying Chairs dataset. We show that networks trained on this unrealistic data still generalize very well to existing datasets such as Sintel and KITTI, achieving competitive accuracy at frame rates of 5 to 10 fps.


Reconstruction of Rigid Body Models from Motion Distorted Laser Range Data Using Optical Flow

Eddy Ilg, Rainer Kuemmerle, Wolfram Burgard, Thomas Brox

Int. Conference on Robotics and Automation (ICRA) 2014

The setup of tilting a 2D laser range finder up and down is a widespread strategy to acquire 3D point clouds. This setup requires that the scene is static while the robot takes a 3D scan. If an object moves through the scene during the measurement process and one does not take into account these movements, the resulting model will get distorted. This paper presents an approach to reconstruct the 3D model of a moving rigid object from the inconsistent set of 2D measurements by the help of a camera. Our approach utilizes optical flow in the camera images to estimate the motion in the image plane and point-line constraints to compensate the missing information about the motion in depth. We combine multiple sweeps and/or views into to a single consistent model using a point-to-plane ICP approach and optimize single sweeps by smoothing the resulting trajectory. Experiments obtained in real outdoor scenarios with moving cars demonstrate that our approach yields accurate models.


Computer Vision and Machine Perception Lab
Saarland University
Building E1.7, Room 1.05
66123 Saarbrücken, Germany
Google Maps
Front Office:
Mona Linn
Prof. Ilg Office Hours:
Thursdays 2:45 to 4:45pm
in E1.7, Room 1.05
+49 681 302 64047


Please contact for questions. The author is not responsible for the content of external pages.