Shashank Tripathi
Research
My research lies at the intesection of machine-learning, computer vision and computer graphics.
Specifically, I am interested in 3D modeling of human bodies, modeling human-object interactions and physics-inspired human motion understanding. In the past, I have worked on synthetic data
for applications like object detection and human pose estimation from limited supervision.
Before diving into human body research, I dabbled in visual-servoing, medical-image analysis, pedestrian-detection and
reinforcement learning.
|
Publications |
|
DECO: Dense Estimation of 3D Human-Scene COntact in the Wild
Shashank Tripathi, Agniv Chatterjee, Jean-Claude Passy, Hongwei Yi, Dimitrios Tzionas, Michael J. Black
International Conference on Computer Vision (ICCV)
2023
(Oral presentation)
DECO estimates dense vertex-level 3D human-scene and human-object contact across the full body mesh and works on diverse and challenging human-object interactions in arbitrary in-the-wild images. DECO is trained on DAMON, a new and unique dataset with 3D contact annotations for in-the-wild images, manually annotated using a custom 3D contact labeling tool.
paper |
abstract |
project |
dataset |
video |
bibtex |
poster
Understanding how humans use physical contact to interact with the world is key to enabling human-centric artificial intelligence. While inferring 3D contact is crucial for modeling realistic and physically-plausible human-object interactions, existing methods either focus on 2D, consider body joints rather than the surface, use coarse 3D body regions, or do not generalize to in-the-wild images. In contrast, we focus on inferring dense, 3D contact between the full body surface and objects in arbitrary images. To achieve this, we first collect DAMON, a new dataset containing dense vertex-level contact annotations paired with RGB images containing complex human-object and human-scene contact. Second, we train DECO, a novel 3D contact detector that uses both body-part-driven and scene-context-driven attention to estimate vertex-level contact on the SMPL body. DECO builds on the insight that human observers recognize contact by reasoning about the contacting body parts, their proximity to scene objects, and the surrounding scene context. We perform extensive evaluations of our detector on DAMON as well as on the RICH and BEHAVE datasets. We significantly outperform existing SOTA methods across all benchmarks. We also show qualitatively that DECO generalizes well to diverse and challenging real-world human interactions in natural images. The code, data, and models are available for research purposes.
@inproceedings{tripathi2023deco,
title = {{DECO}: Dense Estimation of {3D} Human-Scene Contact In The Wild},
author = {Tripathi, Shashank and Chatterjee, Agniv and Passy, Jean-Claude and Yi, Hongwei and Tzionas, Dimitrios and Black, Michael J.},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2023},
pages = {8001-8013}
}
|
|
EMOTE: Emotional Speech-Driven Animation with Content-Emotion Disentanglement
Radek Danecek, Kiran Chhatre, Shashank Tripathi, Yandong Wen, Michael J. Black, Timo Bolkart
SIGGRAPH ASIA 2023
Given audio input and an emotion label, EMOTE generates an animated 3D head that has state-of-the-art lip synchronization while expressing the emotion. The method is trained from 2D video sequences using a novel video emotion loss and a mechanism to disentangle emotion from speech.
paper |
abstract |
project |
bibtex
To be widely adopted, 3D facial avatars must be animated easily, realistically, and directly from speech signals. While the best recent methods generate 3D animations that are synchronized with the input audio, they largely ignore the impact of emotions on facial expressions. Realistic facial animation requires lip-sync together with the natural expression of emotion. To that end, we propose EMOTE (Expressive Model Optimized for Talking with Emotion), which generates 3D talking-head avatars that maintain lip-sync from speech while enabling explicit control over the expression of emotion. To achieve this, we supervise EMOTE with decoupled losses for speech (i.e., lip-sync) and emotion. These losses are based on two key observations: (1) deformations of the face due to speech are spatially localized around the mouth and have high temporal frequency, whereas (2) facial expressions may deform the whole face and occur over longer intervals. Thus, we train EMOTE with a per-frame lip-reading loss to preserve the speech-dependent content, while supervising emotion at the sequence level. Furthermore, we employ a content-emotion exchange mechanism in order to supervise different emotions on the same audio, while maintaining the lip motion synchronized with the speech. To employ deep perceptual losses without getting undesirable artifacts, we devise a motion prior in the form of a temporal VAE. Due to the absence of high-quality aligned emotional 3D face datasets with speech, EMOTE is trained with 3D pseudo-ground-truth extracted from an emotional video dataset (i.e., MEAD). Extensive qualitative and perceptual evaluations demonstrate that EMOTE produces speech-driven facial animations with better lip-sync than state-of-the-art methods trained on the same data, while offering additional, high-quality emotional control.
@inproceedings{EMOTE,
title = {Emotional Speech-Driven Animation with Content-Emotion Disentanglement},
author = {Danecek, Radek and Chhatre, Kiran and Tripathi, Shashank and Wen, Yandong and Black, Michael and Bolkart, Timo},
publisher = {ACM},
year = {2023},
doi = {10.1145/3610548.3618183},
url = {https://emote.is.tue.mpg.de/index.html}
}
|
|
3D Human Pose Estimation via Intuitive Physics
Shashank Tripathi, Lea Müller, Chun-Hao P. Huang, Omid Taheri, Michael Black, Dimitrios Tzionas
Computer Vision and Pattern Recognition (CVPR)
2023
IPMAN estimates a 3D body from a color image in a "stable" configuration by encouraging plausible floor contact and
overlapping CoP and CoM. It exploits interpenetration of the body mesh with the ground plane as a heuristic for pressure.
paper |
abstract |
project |
dataset |
video |
bibtex |
poster
The estimation of 3D human body shape and pose from images has advanced rapidly. While the results are often well aligned with image features in the camera view, the 3D pose is often physically implausible; bodies lean, float, or penetrate the floor. This is because most methods ignore the fact that bodies are typically supported by the scene. To address this, some methods exploit physics engines to enforce physical plausibility. Such methods, however, are not differentiable, rely on unrealistic proxy bodies, and are difficult to integrate into existing optimization and learning frameworks. To account for this, we take a different approach that exploits novel intuitive-physics (IP) terms that can be inferred from a 3D SMPL body interacting with the scene. Specifically, we infer biomechanically relevant features such as the pressure heatmap of the body on the floor, the Center of Pressure (CoP) from the heatmap, and the SMPL body’s Center of Mass (CoM) projected on the floor. With these, we develop IPMAN, to estimate a 3D body from a color image in a “stable” configuration by encouraging plausible floor contact and overlapping CoP and CoM. Our IP terms are intuitive, easy to implement, fast to compute, and can be integrated into any SMPL-based optimization or regression method; we show examples of both. To evaluate our method, we present MoYo, a dataset with synchronized multi-view color images and 3D bodies with complex poses, body-floor contact, and ground-truth CoM and pressure. Evaluation on MoYo, RICH and Human3.6M show that our IP terms produce more plausible results than the state of the art; they improve accuracy for static poses, while not hurting dynamic ones. Code and data will be available for research.
@inproceedings{tripathi2023ipman,
title = {{3D} Human Pose Estimation via Intuitive Physics},
author = {Tripathi, Shashank and M{\"u}ller, Lea and Huang, Chun-Hao P. and Taheri Omid
and Black, Michael J. and Tzionas, Dimitrios},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR)},
month = {June},
year = {2023}
}
|
|
BITE: Beyond Priors for Improved Three-D Dog Pose Estimation
Nadine Rüegg, Shashank Tripathi, Konrad Schindler, Michael J. Black, Silvia Zuffi
Computer Vision and Pattern Recognition (CVPR)
2023
BITE enables 3D shape and pose estimation of dogs from a single input image. The model handles a wide range of shapes and breeds, as well as challenging postures far from the available training poses, like sitting or lying on the ground.
paper |
abstract |
project |
video |
bibtex
We address the problem of inferring the 3D shape and pose of dogs from images. Given the lack of 3D training data, this problem is challenging, and the best methods lag behind those designed to estimate human shape and pose. To make progress, we attack the problem from multiple sides at once. First, we need a good 3D shape prior, like those available for humans. To that end, we learn a dog-specific 3D parametric model, called D-SMAL. Second, existing methods focus on dogs in standing poses because when they sit or lie down, their legs are self occluded and their bodies deform. Without access to a good pose prior or 3D data, we need an alternative approach. To that end, we exploit contact with the ground as a form of side information. We consider an existing large dataset of dog images and label any 3D contact of the dog with the ground. We exploit body-ground contact in estimating dog pose and find that it significantly improves results. Third, we develop a novel neural network architecture to infer and exploit this contact information. Fourth, to make progress, we have to be able to measure it. Current evaluation metrics are based on 2D features like keypoints and silhouettes, which do not directly correlate with 3D errors. To address this, we create a synthetic dataset containing rendered images of scanned 3D dogs. With these advances, our method recovers significantly better dog shape and pose than the state of the art, and we evaluate this improvement in 3D. Our code, model and test dataset are publicly available for research purposes at https://bite.is.tue.mpg.de.
@inproceedings{bite2023rueegg,
title = {{BITE}: Beyond Priors for Improved Three-{D} Dog Pose Estimation},
author = {R\"uegg, Nadine and Tripathi, Shashank and Schindler, Konrad and Black, Michael J. and Zuffi, Silvia},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR)},
pages = {8867-8876},
month = {June},
year = {2023}
}
|
|
MIME: Human-Aware 3D Scene Generation
Hongwei Yi, Chun-Hao P. Huang, Shashank tripathi, Lea Hering, Justus Thies, Michael J. Black
Computer Vision and Pattern Recognition (CVPR)
2023
MIME takes 3D human motion capture and generates plausible 3D scenes that are consistent with the motion. Why? Most mocap sessions capture the person but not the scene.
paper |
abstract |
project |
video |
bibtex
Generating realistic 3D worlds occupied by moving humans has many applications in games, architecture, and synthetic data creation. But generating such scenes is expensive and labor intensive. Recent work generates human poses and motions given a 3D scene. Here, we take the opposite approach and generate 3D indoor scenes given 3D human motion. Such motions can come from archival motion capture or from IMU sensors worn on the body, effectively turning human movement in a "scanner" of the 3D world. Intuitively, human movement indicates the free-space in a room and human contact indicates surfaces or objects that support activities such as sitting, lying or touching. We propose MIME (Mining Interaction and Movement to infer 3D Environments), which is a generative model of indoor scenes that produces furniture layouts that are consistent with the human movement. MIME uses an auto-regressive transformer architecture that takes the already generated objects in the scene as well as the human motion as input, and outputs the next plausible object. To train MIME, we build a dataset by populating the 3D FRONT scene dataset with 3D humans. Our experiments show that MIME produces more diverse and plausible 3D scenes than a recent generative scene method that does not know about human movement. Code and data will be available for research.
@inproceedings{yi2022mime,
title = {{MIME}: Human-Aware {3D} Scene Generation},
author = {Yi, Hongwei and Huang, Chun-Hao P. and Tripathi, Shashank and Hering, Lea and
Thies, Justus and Black, Michael J.},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR)},
pages={12965-12976},
month = {June},
year = {2023}
}
|
|
PERI: Part Aware Emotion Recognition in the Wild
Akshita Mittel, Shashank Tripathi
European Conference on Computer Vision Workshops (ECCVW)
2022
An in-the-wild emotion recognition network that leverages both body pose and facial landmarks using a novel part aware spatial (PAS) image representation and context infusion (Cont-In) blocks.
paper |
abstract
Emotion recognition aims to interpret the emotional states of a person based on various inputs including audio, visual, and textual cues. This paper focuses on emotion recognition using visual features. To leverage the correlation between facial expression and the emotional state of a person, pioneering methods rely primarily on facial features. However, facial features are often unreliable in natural unconstrained scenarios, such as in crowded scenes, as the face lacks pixel resolution and contains artifacts due to occlusion and blur. To address this, in the wild emotion recognition exploits full-body person crops as well as the surrounding scene context. In a bid to use body pose for emotion recognition, such methods fail to realize the potential that facial expressions, when available, offer. Thus, the aim of this paper is two-fold. First, we demonstrate our method, PERI, to leverage both body pose and facial landmarks. We create part aware spatial (PAS) images by extracting key regions from the input image using a mask generated from both body pose and facial landmarks. This allows us to exploit body pose in addition to facial context whenever available. Second, to reason from the PAS images, we introduce context infusion (Cont-In) blocks. These blocks attend to part-specific information, and pass them onto the intermediate features of an emotion recognition network. Our approach is conceptually simple and can be applied to any existing emotion recognition method. We provide our results on the publicly available in the wild EMOTIC dataset. Compared to existing methods, PERI achieves superior performance and leads to significant improvements in the mAP of emotion categories, while decreasing Valence, Arousal and Dominance errors. Importantly, we observe that our method improves performance in both images with fully visible faces as well as in images with occluded or blurred faces.
@inproceedings{mitell2022peri,
title = {{PERI}: Part Aware Emotion Recognition in the Wild},
author = {Mittel, Akshita and Tripathi, Shashank},
booktitle="Computer Vision -- ECCV 2022 Workshops",
year = {2023},
publisher="Springer Nature Switzerland",
pages="76--92",
}
|
|
Occluded Human Mesh Recovery
Rawal Khirodkar, Shashank Tripathi, Kris Kitani
Computer Vision and Pattern Recognition (CVPR)
2022
A novel top-down mesh recovery architecture capable of leveraging image spatial context for handling multi-person occlusion and crowding.
paper |
abstract |
project |
Top-down methods for monocular human mesh recovery have two stages: (1) detect human bounding boxes; (2) treat each bounding box as an independent single-human mesh recovery task. Unfortunately, the single-human assumption does not hold in images with multi-human occlusion and crowding. Consequently, top-down methods have difficulties in recovering accurate 3D human meshes under severe person-person occlusion. To address this, we present Occluded Human Mesh Recovery (OCHMR) - a novel top-down mesh recovery approach that incorporates image spatial context to overcome the limitations of the single-human assumption. The approach is conceptually simple and can be applied to any existing top-down architecture. Along with the input image, we condition the top-down model on spatial context from the image in the form of body-center heatmaps. To reason from the predicted body centermaps, we introduce Contextual Normalization (CoNorm) blocks to adaptively modulate intermediate features of the top-down model. The contextual conditioning helps our model disambiguate between two severely overlapping human bounding-boxes, making it robust to multi-person occlusion. Compared with state-of-the-art methods, OCHMR achieves superior performance on challenging multi-person benchmarks like 3DPW, CrowdPose and OCHuman. Specifically, our proposed contextual reasoning architecture applied to the SPIN model with ResNet-50 backbone results in 75.2 PMPJPE on 3DPW-PC, 23.6 AP on CrowdPose and 37.7 AP on OCHuman datasets, a significant improvement of 6.9 mm, 6.4 AP and 20.8 AP respectively over the baseline. Code and models will be released.
@inproceedings{khirodkar_ochmr_2022,
title = {Occluded Human Mesh Recovery},
author = {Khirodkar, Rawal and Tripathi, Shashank and Kitani, Kris},
booktitle = {IEEE/CVF Conf.~on Computer Vision and Pattern Recognition (CVPR)},
month = jun,
year = {2022},
doi = {},
month_numeric = {6}
}
|
|
AGORA: Avatars in Geography Optimized for Regression Analysis
Priyanka Patel, Chun-Hao P. Huang, Joachim Tesch, David T. Hoffman, Shashank Tripathi and Michael J. Black
Computer Vision and Pattern Recognition (CVPR)
2021
A synthetic dataset with high realism and highly accurate ground truth containing 4240 textured scans and SMPLX fits.
paper |
abstract |
project |
video
While the accuracy of 3D human pose estimation from images has steadily improved on benchmark datasets, the best methods still fail in many real-world scenarios. This suggests that there is a domain gap between current datasets and common scenes containing people. To obtain ground-truth 3D pose, current datasets limit the complexity of clothing, environmental conditions, number of subjects, and occlusion. Moreover, current datasets evaluate sparse 3D joint locations corresponding to the major joints of the body, ignoring the hand pose and the face shape. To evaluate the current state-of-the-art methods on more challenging images, and to drive the field to address new problems, we introduce AGORA, a synthetic dataset with high realism and highly accurate ground truth. Here we use 4240 commercially-available, high-quality, textured human scans in diverse poses and natural clothing; this includes 257 scans of children. We create reference 3D poses and body shapes by fitting the SMPL-X body model (with face and hands) to the 3D scans, taking into account clothing. We create around 14K training and 3K test images by rendering between 5 and 15 people per image us- ing either image-based lighting or rendered 3D environments, taking care to make the images physically plausible and photoreal. In total, AGORA consists of 173K individual person crops. We evaluate existing state-of-the- art methods for 3D human pose estimation on this dataset. and find that most methods perform poorly on images of children. Hence, we extend the SMPL-X model to better capture the shape of children. Additionally, we fine- tune methods on AGORA and show improved performance on both AGORA and 3DPW, confirming the realism of the dataset. We provide all the registered 3D reference training data, rendered images, and a web-based evaluation site at https://agora.is.tue.mpg.de/.
@inproceedings{tripathi2019learning,
title={Learning to generate synthetic data via compositing},
author={Tripathi, Shashank and Chandra, Siddhartha and Agrawal, Amit and Tyagi, Ambrish
and Rehg, James M and Chari, Visesh},
booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition},
pages={461--470},
year={2019}
}
|
|
PoseNet3D: Learning Temporally Consistent 3D Human Pose via Knowledge Distillation
Shashank Tripathi, Siddhant Ranade, Ambrish
Tyagi and Amit Agrawal
International Conference on 3D Vision (3DV), 2020
(Oral presentation)
Temporally consistent recovery of 3D human pose from 2D joints without using 3D data in any
form
paper |
abstract |
videos
Recovering 3D human pose
from 2D joints is a highly unconstrained problem. We propose a novel neural network
architecture, PoseNet3D, that takes 2D joints as input and outputs 3D skeletons and SMPL
pose parameters. By casting our learning approach in a Knowledge Distillation framework,
we avoid using any 3D data such as paired 2D-3D data, unpaired 3D data, motion capture
sequences or multi-view images during training. We first train a teacher network that
outputs 3D skeletons, using only 2D poses for training. The teacher network distills its
knowledge to a student network that predicts 3D pose in SMPL representation. Finally,
both the teacher and the student networks are jointly fine tuned in an end-to-end manner
using self-consistency and adversarial losses, improving the accuracy of the individual
networks. Results on Human3.6M dataset for 3D human pose estimation demonstrate that our
approach reduces the 3D joint prediction error by 18% or more compared to previous
methods. Qualitative results show that the recovered 3D poses and meshes are natural,
realistic, and flow smoothly over consecutive frames.
@inproceedings{tripathi2019learning,
title={Learning to generate synthetic data via compositing},
author={Tripathi, Shashank and Chandra, Siddhartha and Agrawal, Amit and Tyagi, Ambrish
and Rehg, James M and Chari, Visesh},
booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition},
pages={461--470},
year={2019}
}
|
|
Learning to Generate Synthetic Data via Compositing
Shashank Tripathi, Siddhartha Chandra,
Amit Agrawal, Ambrish Tyagi, James Rehg and Visesh
Chari
Computer Vision and Pattern Recognition (CVPR)
2019
Efficient, task-aware and realisitic synthesis of composite images for training
classification and object detection models
paper |
abstract |
poster
We present a task-aware
approach to synthetic data generation. Our framework employs a trainable synthesizer
network that is optimized to produce meaningful training samples by assessing the
strengths and weaknesses of a `target' network. The synthesizer and target networks are
trained in an adversarial manner wherein each network is updated with a goal to outdo
the other. Additionally, we ensure the synthesizer generates realistic data by pairing
it with a discriminator trained on real-world images. Further, to make the target
classifier invariant to blending artefacts, we introduce these artefacts to background
regions of the training images so the target does not over-fit to them.
We demonstrate the efficacy of our approach by applying it to different target networks
including a classification network on AffNIST, and two object detection networks (SSD,
Faster-RCNN) on different datasets. On the AffNIST benchmark, our approach is able to
surpass the baseline results with just half the training examples. On the VOC person
detection benchmark, we show improvements of up to 2.7% as a result of our data
augmentation. Similarly on the GMU detection benchmark, we report a performance boost of
3.5% in mAP over the baseline method, outperforming the previous state of the art
approaches by up to 7.5% on specific categories.
@inproceedings{tripathi2019learning,
title={Learning to generate synthetic data via compositing},
author={Tripathi, Shashank and Chandra, Siddhartha and Agrawal, Amit and Tyagi, Ambrish
and Rehg, James M and Chari, Visesh},
booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition},
pages={461--470},
year={2019}
}
|
|
C2F: Coarse-to-Fine Vision Control System for Automated Microassembly
Shashank Tripathi, Devesh Jain and Himanshu Dutt Sharma
Nanotechnology
and Nanoscience-Asia 2018
Automated, visual-servoing based closed loop system to perform 3D micromanipulation and
microassembly tasks
paper |
abstract |
video
In this paper, authors present the
development of a completely automated system to perform 3D micromanipulation and
microassembly tasks. The microassembly workstation consists of a 3 degree-of-freedom
(DOF) MM3A® micromanipulator arm attached to a microgripper, two 2 DOF PI® linear
micromotion stages, one optical microscope coupled with a CCD image sensor, and two CMOS
cameras for coarse vision. The whole control strategy is subdivided into sequential
vision based routines: manipulator detection and coarse alignment, autofocus and fine
alignment of microgripper, target object detection, and performing the required assembly
tasks. A section comparing various objective functions useful in the autofocusing regime
is included. The control system is built entirely in the image frame, eliminating the
need for system calibration, hence improving speed of operation. A micromanipulation
experiment performing pick- and-place of a micromesh is illustrated. This demonstrates a
three-fold reduction in setup and run time for fundamental micromanipulation tasks, as
compared to manual operation. Accuracy, repeatability and reliability of the programmed
system is analyzed.
|
|
Sub-cortical Shape Morphology and Voxel-based Features for Alzheimer's Disease
Classification
Shashank Tripathi, Seyed Hossein Nozadi, Mahsa Shakeri and Samuel Kadoury
IEEE International Symposium on Biomedical
Imaging (ISBI) 2017
Alzheimer's disease patient classification using a combination of grey-matter voxel-based
intensity variations and 3D structural (shape) features extracted from MRI brain scans
paper |
abstract |
poster
Neurodegenerative
pathologies, such as Alzheimer’s disease, are linked with morphological alterations and
tissue variations in subcortical structures which can be assessed from medical imaging
and biological data. In this work, we present an unsupervised framework for the
classification of Alzheimer’s disease (AD) patients, stratifying patients into four
diagnostic groups, namely: AD, early Mild Cognitive Impairment (MCI), late MCI and
normal controls by combining shape and voxel-based features from 12 sub-cortical areas.
An automated anatomical labeling using an atlas-based segmentation approach is proposed
to extract multiple regions of interest known to be linked with AD progression. We take
advantage of gray-matter voxel-based intensity variations and structural alterations
extracted with a spherical harmonics framework to learn the discriminative features
between multiple diagnostic classes. The proposed method is validated on 600 patients
from the ADNI database by training binary SVM classifiers of dimensionality reduced
features, using both linear and RBF kernels. Results show near state-of-the-art
approaches in classification accuracy (>88%), especially for the more challenging
discrimination tasks: AD vs. LMCI (76.81%), NC vs. EMCI (75.46%) and EMCI vs. LMCI
(70.95%). By combining multimodality features, this pipeline demonstrates the potential
by exploiting complementary features to improve cognitive assessment.
@inproceedings{tripathi2017sub,
title={Sub-cortical shape morphology and voxel-based features for Alzheimer's disease
classification},
author={Tripathi, Shashank and Nozadi, Seyed Hossein and Shakeri, Mahsa and Kadoury,
Samuel},
booktitle={Biomedical Imaging (ISBI 2017), 2017 IEEE 14th International Symposium on},
pages={991--994},
year={2017},
organization={IEEE}
}
|
|
Deep Spectral-Based Shape Features for Alzheimer’s Disease Classification
Mahsa Shakeri, Hervé Lombaert, Shashank Tripathi and
Samuel Kadoury
MICCAI Spectral and Shape Analysis in Medical
Imaging (SeSAMI) 2016
Alzheimer's disease classification by using deep learning variational auto-encoder on shape
based features
paper |
abstract
Alzheimer’s disease (AD)
and mild cognitive impairment (MCI) are the most prevalent neurodegenerative brain
diseases in elderly population. Recent studies on medical imaging and biological data
have shown morphological alterations of subcortical structures in patients with these
pathologies. In this work, we take advantage of these structural deformations for
classification purposes. First, triangulated surface meshes are extracted from segmented
hippocampus structures in MRI and point-to-point correspondences are established among
population of surfaces using a spectral matching method. Then, a deep learning
variational auto-encoder is applied on the vertex coordinates of the mesh models to
learn the low dimensional feature representation. A multi-layer perceptrons using
softmax activation is trained simultaneously to classify Alzheimer’s patients from
normal subjects. Experiments on ADNI dataset demonstrate the potential of the proposed
method in classification of normal individuals from early MCI (EMCI), late MCI (LMCI),
and AD subjects with classification rates outperforming standard SVM based approach.
@inproceedings{shakeri2016deep,
title={Deep spectral-based shape features for alzheimer’s disease classification},
author={Shakeri, Mahsa and Lombaert, Herve and Tripathi, Shashank and Kadoury, Samuel
and Alzheimer’s Disease Neuroimaging Initiative and others},
booktitle={International Workshop on Spectral and Shape Analysis in Medical Imaging},
pages={15--24},
year={2016},
organization={Springer}
}
|
Miscellaneous
Some other unpublished work:
|
|