Multitask-Learning Paper Overview 1

3 minute read

An Overview of Multi-Task Learning in Deep Neural Networks

Reference: Ruder S “An Overview of Multi-Task Learning in Deep Neural Networks”, arXiv:1706.05098, Hune 2017

paper source

MTL in Deep Neural Net:

Hard Sharing

Hard sharing is most commonly used in MTL, especially in MTL+neural nets. This is applied by sharing hidden layers between all tasks, to lower the risk of overfitting.

“The more tasks we are learning simultaneously, the more our model has to find a representation that captures all of the tasks and the less is our chance of overfitting on our original task.”

Soft Sharing

The soft sharing on the other hand, each task has their own model and parameters. The parameters between each task are encouraged to be similar with regulations and penalties.

Underlying MTL Mechanisms:

Implicit Data Augumentation
- increase sample size
- different noise pattern among tasks smoothing the overall noise
Attention Focusing
- Focus on features that really matters
- Providing relevence and irrlevence between features
Eavesdropping
- Some task A interact with feature G better, which share information to task B to learn feature G better
Representation Bias
- Helps with model generalization
Regularization:
- Act as a “regularizer” by introducing an inductive bias -> reduce overfiting rate

Recent works MTL in Deep Learning

Deep Relationship Networks

Long, Mingsheng and Jianmin Wang. “Learning Multiple Tasks with Deep Relationship Networks.” ArXiv abs/1506.02117 (2015): n. pag.

paper source

Fully-Adaptive Feature Sharing

Lu, Y., Kumar, A., Zhai, S., Cheng, Y., Javidi, T., and Feris, R. (2016). Fullyadaptive Feature Sharing in Multi-Task Networks with Applications in Person Attribute Classification.

paper source

Cross-stitch Networks

Misra, I., Shrivastava, A., Gupta, A., and Hebert, M. (2016). Cross-stitch Networks for Multi-task Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

paper source

Low Supervision

Søgaard, A. and Goldberg, Y. (2016). Deep multi-task learning with low level tasks supervised at lower layers. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 231–235.

paper source
A Joint Many Task Model

Hashimoto, K., Xiong, C., Tsuruoka, Y., and Socher, R. (2016). A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks.

paper source

image-center

Weighting Losses with Uncertainty

Kendall, A., Gal, Y., and Cipolla, R. (2017). Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics.

paper source

Tensor Factorization for MTL

Yang, Y. and Hospedales, T. (2017a). Deep Multi-task Representation Learning: A Tensor Factorisation Approach. In Proceedings of ICLR 2017

paper source

Sluice Networks

Ruder, S., Bingel, J., Augenstein, I., and Søgaard, A. (2017). Sluice networks: Learning what to share between loosely related tasks.

paper source

MTL in non-neural models

Block-sparse Regularization:
- Assumes: tasks used in multi-task learning are closely related
- enforce Lasso (L1 norm) to 0 out features
- Block-sparse regularization: \(l_1/l_q\) norms
Learning task relationships:
- No assumption between tasks
- A constraint that enforces a clustering of tasks: penalizing both the norms of our task column vectors \(a_{·,1},\dots, a_{·,t}\) as well as their variance:
  
  \(\Omega = ||\bar{a}||^2+\frac{\lambda}{T}\sum^{T}_{t=1}||a_{·,t}-\bar{a}||^2\) where \(\bar{a}=(\sum_{t=1}^{T}a_{.t})/T\) is the mean parameter vector. The penalty on the other hand forces all parameter \(a_{.t}\) to their mean \(\bar{a}\)

Auxiliary Tasks

MTL is a natural fit in situations where we are interested in obtaining predictions for multiple tasks at once. Such scenarios are common for instance in finance or economics forecasting, where we might want to predict the value of many possibly related indicators, or in bioinformatics where we might want to predict symptoms for multiple diseases simultaneously.

In scenarios such as drug discovery, where tens or hundreds of active compounds should be predicted, MTL accuracy increases continuously with the number of tasks

Related task
- Using a related task as an auxiliary task for MTL
  - uses tasks that predict different characteristics of the road as auxiliary tasks for predicting the steering direction in a self-driving car;
  - use head pose estimation and facial attribute inference as auxiliary tasks for facial landmark detection;
  - jointly learn query classification and web search;
  - jointly predicts the class and the coordinates of an object in an image;
  - jointly predict the phoneme duration and frequency profile for text-to-speech.
Adversarial

Often, labeled data for a related task is unavailable. In some circumstances, however, we have access to a task that is opposite of what we want to achieve.
- maximize the training error using a gradient reversal layer.
Hints

learn features that might not be easy to learn just using the original task
Focusing attention

focus attention on parts of the image that a network might normally ignore.
Quantization smoothing
Predicting inputs
Using the future to predict the present
Representation learning

Share on

Twitter Facebook LinkedIn

Max Jiang

Multitask-Learning Paper Overview 1

An Overview of Multi-Task Learning in Deep Neural Networks

MTL in Deep Neural Net:

Underlying MTL Mechanisms:

Recent works MTL in Deep Learning

MTL in non-neural models

Auxiliary Tasks

Share on

You may also enjoy

Pytorch Study Note 2

Pytorch Study Note 1

State-of-the-Art Edge Computing

Machine Learning Resources (Keep Updating)

Max Jiang

An Overview of Multi-Task Learning in Deep Neural Networks

MTL in Deep Neural Net:

Sharing types:

Underlying MTL Mechanisms:

Recent works MTL in Deep Learning

MTL in non-neural models

Auxiliary Tasks

Share on

You may also enjoy

Pytorch Study Note 2

Pytorch Study Note 1

State-of-the-Art Edge Computing

Machine Learning Resources (Keep Updating)