Summary for 2021-05-31, created on 2021-12-21

Efficient and Modular Implicit Differentiation arxiv:2105.15183 📈 206

Mathieu Blondel, Quentin Berthet, Marco Cuturi, Roy Frostig, Stephan Hoyer, Felipe Llinares-López, Fabian Pedregosa, Jean-Philippe Vert

**Abstract:** Automatic differentiation (autodiff) has revolutionized machine learning. It allows expressing complex computations by composing elementary ones in creative ways and removes the burden of computing their derivatives by hand. More recently, differentiation of optimization problem solutions has attracted widespread attention with applications such as optimization layers, and in bi-level problems such as hyper-parameter optimization and meta-learning. However, so far, implicit differentiation remained difficult to use for practitioners, as it often required case-by-case tedious mathematical derivations and implementations. In this paper, we propose a unified, efficient and modular approach for implicit differentiation of optimization problems. In our approach, the user defines directly in Python a function $F$ capturing the optimality conditions of the problem to be differentiated. Once this is done, we leverage autodiff of $F$ and implicit differentiation to automatically differentiate the optimization problem. Our approach thus combines the benefits of implicit differentiation and autodiff. It is efficient as it can be added on top of any state-of-the-art solver and modular as the optimality condition specification is decoupled from the implicit differentiation mechanism. We show that seemingly simple principles allow to recover many exiting implicit differentiation methods and create new ones easily. We demonstrate the ease of formulating and solving bi-level optimization problems using our framework. We also showcase an application to the sensitivity analysis of molecular dynamics.

SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers arxiv:2105.15203 📈 49

Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo

**Abstract:** We present SegFormer, a simple, efficient yet powerful semantic segmentation framework which unifies Transformers with lightweight multilayer perception (MLP) decoders. SegFormer has two appealing features: 1) SegFormer comprises a novel hierarchically structured Transformer encoder which outputs multiscale features. It does not need positional encoding, thereby avoiding the interpolation of positional codes which leads to decreased performance when the testing resolution differs from training. 2) SegFormer avoids complex decoders. The proposed MLP decoder aggregates information from different layers, and thus combining both local attention and global attention to render powerful representations. We show that this simple and lightweight design is the key to efficient segmentation on Transformers. We scale our approach up to obtain a series of models from SegFormer-B0 to SegFormer-B5, reaching significantly better performance and efficiency than previous counterparts. For example, SegFormer-B4 achieves 50.3% mIoU on ADE20K with 64M parameters, being 5x smaller and 2.2% better than the previous best method. Our best model, SegFormer-B5, achieves 84.0% mIoU on Cityscapes validation set and shows excellent zero-shot robustness on Cityscapes-C. Code will be released at: github.com/NVlabs/SegFormer.

Consistency Regularization for Variational Auto-Encoders arxiv:2105.14859 📈 45

Samarth Sinha, Adji B. Dieng

**Abstract:** Variational auto-encoders (VAEs) are a powerful approach to unsupervised learning. They enable scalable approximate posterior inference in latent-variable models using variational inference (VI). A VAE posits a variational family parameterized by a deep neural network called an encoder that takes data as input. This encoder is shared across all the observations, which amortizes the cost of inference. However the encoder of a VAE has the undesirable property that it maps a given observation and a semantics-preserving transformation of it to different latent representations. This "inconsistency" of the encoder lowers the quality of the learned representations, especially for downstream tasks, and also negatively affects generalization. In this paper, we propose a regularization method to enforce consistency in VAEs. The idea is to minimize the Kullback-Leibler (KL) divergence between the variational distribution when conditioning on the observation and the variational distribution when conditioning on a random semantic-preserving transformation of this observation. This regularization is applicable to any VAE. In our experiments we apply it to four different VAE variants on several benchmark datasets and found it always improves the quality of the learned representations but also leads to better generalization. In particular, when applied to the Nouveau Variational Auto-Encoder (NVAE), our regularization method yields state-of-the-art performance on MNIST and CIFAR-10. We also applied our method to 3D data and found it learns representations of superior quality as measured by accuracy on a downstream classification task.

Towards Real-time and Light-weight Line Segment Detection arxiv:2106.00186 📈 40

Geonmo Gu, Byungsoo Ko, SeoungHyun Go, Sung-Hyun Lee, Jingeun Lee, Minchul Shin

**Abstract:** Previous deep learning-based line segment detection (LSD) suffer from the immense model size and high computational cost for line prediction. This constrains them from real-time inference on computationally restricted environments. In this paper, we propose a real-time and light-weight line segment detector for resource-constrained environments named Mobile LSD (M-LSD). We design an extremely efficient LSD architecture by minimizing the backbone network and removing the typical multi-module process for line prediction in previous methods. To maintain competitive performance with such a light-weight network, we present novel training schemes: Segments of Line segment (SoL) augmentation and geometric learning scheme. SoL augmentation splits a line segment into multiple subparts, which are used to provide auxiliary line data during the training process. Moreover, the geometric learning scheme allows a model to capture additional geometry cues from matching loss, junction and line segmentation, length and degree regression. Compared with TP-LSD-Lite, previously the best real-time LSD method, our model (M-LSD-tiny) achieves competitive performance with 2.5% of model size and an increase of 130.5% in inference speed on GPU when evaluated with Wireframe and YorkUrban datasets. Furthermore, our model runs at 56.8 FPS and 48.6 FPS on Android and iPhone mobile devices, respectively. To the best of our knowledge, this is the first real-time deep LSD method available on mobile devices.

PIGLeT: Language Grounding Through Neuro-Symbolic Interaction in a 3D World arxiv:2106.00188 📈 29

Rowan Zellers, Ari Holtzman, Matthew Peters, Roozbeh Mottaghi, Aniruddha Kembhavi, Ali Farhadi, Yejin Choi

**Abstract:** We propose PIGLeT: a model that learns physical commonsense knowledge through interaction, and then uses this knowledge to ground language. We factorize PIGLeT into a physical dynamics model, and a separate language model. Our dynamics model learns not just what objects are but also what they do: glass cups break when thrown, plastic ones don't. We then use it as the interface to our language model, giving us a unified model of linguistic form and grounded meaning. PIGLeT can read a sentence, simulate neurally what might happen next, and then communicate that result through a literal symbolic representation, or natural language. Experimental results show that our model effectively learns world dynamics, along with how to communicate them. It is able to correctly forecast "what happens next" given an English sentence over 80% of the time, outperforming a 100x larger, text-to-text approach by over 10%. Likewise, its natural language summaries of physical interactions are also judged by humans as more accurate than LM alternatives. We present comprehensive analysis showing room for future work.

Anti-Koopmanism arxiv:2106.00106 📈 22

Efrain Gonzalez, Moad Abudia, Michael Jury, Rushikesh Kamalapurkar, Joel A. Rosenfeld

**Abstract:** This article addresses several longstanding misconceptions concerning Koopman operators, including the existence of lattices of eigenfunctions, common eigenfunctions between Koopman operators, and boundedness and compactness of Koopman operators, among others. Counterexamples are provided for each misconception. This manuscript also proves that the Gaussian RBF's native space only supports bounded Koopman operator corresponding to affine dynamics, which shows that the assumption of boundedness is very limiting. A framework for DMD is presented that requires only densely defined Koopman operators over reproducing kernel Hilbert spaces, and the effectiveness of this approach is demonstrated through reconstruction examples.

Rawlsian Fair Adaptation of Deep Learning Classifiers arxiv:2105.14890 📈 16

Kulin Shah, Pooja Gupta, Amit Deshpande, Chiranjib Bhattacharyya

**Abstract:** Group-fairness in classification aims for equality of a predictive utility across different sensitive sub-populations, e.g., race or gender. Equality or near-equality constraints in group-fairness often worsen not only the aggregate utility but also the utility for the least advantaged sub-population. In this paper, we apply the principles of Pareto-efficiency and least-difference to the utility being accuracy, as an illustrative example, and arrive at the Rawls classifier that minimizes the error rate on the worst-off sensitive sub-population. Our mathematical characterization shows that the Rawls classifier uniformly applies a threshold to an ideal score of features, in the spirit of fair equality of opportunity. In practice, such a score or a feature representation is often computed by a black-box model that has been useful but unfair. Our second contribution is practical Rawlsian fair adaptation of any given black-box deep learning model, without changing the score or feature representation it computes. Given any score function or feature representation and only its second-order statistics on the sensitive sub-populations, we seek a threshold classifier on the given score or a linear threshold classifier on the given feature representation that achieves the Rawls error rate restricted to this hypothesis class. Our technical contribution is to formulate the above problems using ambiguous chance constraints, and to provide efficient algorithms for Rawlsian fair adaptation, along with provable upper bounds on the Rawls error rate. Our empirical results show significant improvement over state-of-the-art group-fair algorithms, even without retraining for fairness.

Cascaded Head-colliding Attention arxiv:2105.14850 📈 16

Lin Zheng, Zhiyong Wu, Lingpeng Kong

**Abstract:** Transformers have advanced the field of natural language processing (NLP) on a variety of important tasks. At the cornerstone of the Transformer architecture is the multi-head attention (MHA) mechanism which models pairwise interactions between the elements of the sequence. Despite its massive success, the current framework ignores interactions among different heads, leading to the problem that many of the heads are redundant in practice, which greatly wastes the capacity of the model. To improve parameter efficiency, we re-formulate the MHA as a latent variable model from a probabilistic perspective. We present cascaded head-colliding attention (CODA) which explicitly models the interactions between attention heads through a hierarchical variational distribution. We conduct extensive experiments and demonstrate that CODA outperforms the transformer baseline, by $0.6$ perplexity on \texttt{Wikitext-103} in language modeling, and by $0.6$ BLEU on \texttt{WMT14 EN-DE} in machine translation, due to its improvements on the parameter efficiency.\footnote{Our implementation is publicly available at \url{https://github.com/LZhengisme/CODA}.}

Why does CTC result in peaky behavior? arxiv:2105.14849 📈 16

Albert Zeyer, Ralf Schlüter, Hermann Ney

**Abstract:** The peaky behavior of CTC models is well known experimentally. However, an understanding about why peaky behavior occurs is missing, and whether this is a good property. We provide a formal analysis of the peaky behavior and gradient descent convergence properties of the CTC loss and related training criteria. Our analysis provides a deep understanding why peaky behavior occurs and when it is suboptimal. On a simple example which should be trivial to learn for any model, we prove that a feed-forward neural network trained with CTC from uniform initialization converges towards peaky behavior with a 100% error rate. Our analysis further explains why CTC only works well together with the blank label. We further demonstrate that peaky behavior does not occur on other related losses including a label prior model, and that this improves convergence.

Optimal Spectral Recovery of a Planted Vector in a Subspace arxiv:2105.15081 📈 14

Cheng Mao, Alexander S. Wein

**Abstract:** Recovering a planted vector $v$ in an $n$-dimensional random subspace of $\mathbb{R}^N$ is a generic task related to many problems in machine learning and statistics, such as dictionary learning, subspace recovery, and principal component analysis. In this work, we study computationally efficient estimation and detection of a planted vector $v$ whose $\ell_4$ norm differs from that of a Gaussian vector with the same $\ell_2$ norm. For instance, in the special case of an $N ρ$-sparse vector $v$ with Rademacher nonzero entries, our results include the following: (1) We give an improved analysis of (a slight variant of) the spectral method proposed by Hopkins, Schramm, Shi, and Steurer, showing that it approximately recovers $v$ with high probability in the regime $n ρ\ll \sqrt{N}$. In contrast, previous work required either $ρ\ll 1/\sqrt{n}$ or $n \sqrtρ \lesssim \sqrt{N}$ for polynomial-time recovery. Our result subsumes both of these conditions (up to logarithmic factors) and also treats the dense case $ρ= 1$ which was not previously considered. (2) Akin to $\ell_\infty$ bounds for eigenvector perturbation, we establish an entrywise error bound for the spectral estimator via a leave-one-out analysis, from which it follows that thresholding recovers $v$ exactly. (3) We study the associated detection problem and show that in the regime $n ρ\gg \sqrt{N}$, any spectral method from a large class (and more generally, any low-degree polynomial of the input) fails to detect the planted vector. This establishes optimality of our upper bounds and offers evidence that no polynomial-time algorithm can succeed when $n ρ\gg \sqrt{N}$.

Document-level Event Extraction via Heterogeneous Graph-based Interaction Model with a Tracker arxiv:2105.14924 📈 14

Runxin Xu, Tianyu Liu, Lei Li, Baobao Chang

**Abstract:** Document-level event extraction aims to recognize event information from a whole piece of article. Existing methods are not effective due to two challenges of this task: a) the target event arguments are scattered across sentences; b) the correlation among events in a document is non-trivial to model. In this paper, we propose Heterogeneous Graph-based Interaction Model with a Tracker (GIT) to solve the aforementioned two challenges. For the first challenge, GIT constructs a heterogeneous graph interaction network to capture global interactions among different sentences and entity mentions. For the second, GIT introduces a Tracker module to track the extracted events and hence capture the interdependency among the events. Experiments on a large-scale dataset (Zheng et al., 2019) show GIT outperforms the previous methods by 2.8 F1. Further analysis reveals GIT is effective in extracting multiple correlated events and event arguments that scatter across the document. Our code is available at https://github.com/RunxinXu/GIT.

On Compositional Generalization of Neural Machine Translation arxiv:2105.14802 📈 12

Yafu Li, Yongjing Yin, Yulong Chen, Yue Zhang

**Abstract:** Modern neural machine translation (NMT) models have achieved competitive performance in standard benchmarks such as WMT. However, there still exist significant issues such as robustness, domain generalization, etc. In this paper, we study NMT models from the perspective of compositional generalization by building a benchmark dataset, CoGnition, consisting of 216k clean and consistent sentence pairs. We quantitatively analyze effects of various factors using compound translation error rate, then demonstrate that the NMT model fails badly on compositional generalization, although it performs remarkably well under traditional metrics.

An Exploratory Analysis of Multilingual Word-Level Quality Estimation with Cross-Lingual Transformers arxiv:2106.00143 📈 10

Tharindu Ranasinghe, Constantin Orasan, Ruslan Mitkov

**Abstract:** Most studies on word-level Quality Estimation (QE) of machine translation focus on language-specific models. The obvious disadvantages of these approaches are the need for labelled data for each language pair and the high cost required to maintain several language-specific models. To overcome these problems, we explore different approaches to multilingual, word-level QE. We show that these QE models perform on par with the current language-specific models. In the cases of zero-shot and few-shot QE, we demonstrate that it is possible to accurately predict word-level quality for any given new language pair from models trained on other language pairs. Our findings suggest that the word-level QE models based on powerful pre-trained transformers that we propose in this paper generalise well across languages, making them more useful in real-world scenarios.

GANs Can Play Lottery Tickets Too arxiv:2106.00134 📈 10

Xuxi Chen, Zhenyu Zhang, Yongduo Sui, Tianlong Chen

**Abstract:** Deep generative adversarial networks (GANs) have gained growing popularity in numerous scenarios, while usually suffer from high parameter complexities for resource-constrained real-world applications. However, the compression of GANs has less been explored. A few works show that heuristically applying compression techniques normally leads to unsatisfactory results, due to the notorious training instability of GANs. In parallel, the lottery ticket hypothesis shows prevailing success on discriminative models, in locating sparse matching subnetworks capable of training in isolation to full model performance. In this work, we for the first time study the existence of such trainable matching subnetworks in deep GANs. For a range of GANs, we certainly find matching subnetworks at 67%-74% sparsity. We observe that with or without pruning discriminator has a minor effect on the existence and quality of matching subnetworks, while the initialization weights used in the discriminator play a significant role. We then show the powerful transferability of these subnetworks to unseen tasks. Furthermore, extensive experimental results demonstrate that our found subnetworks substantially outperform previous state-of-the-art GAN compression approaches in both image generation (e.g. SNGAN) and image-to-image translation GANs (e.g. CycleGAN). Codes available at https://github.com/VITA-Group/GAN-LTH.

Crowdsourcing Learning as Domain Adaptation: A Case Study on Named Entity Recognition arxiv:2105.14980 📈 10

Xin Zhang, Guangwei Xu, Yueheng Sun, Meishan Zhang, Pengjun Xie

**Abstract:** Crowdsourcing is regarded as one prospective solution for effective supervised learning, aiming to build large-scale annotated training data by crowd workers. Previous studies focus on reducing the influences from the noises of the crowdsourced annotations for supervised models. We take a different point in this work, regarding all crowdsourced annotations as gold-standard with respect to the individual annotators. In this way, we find that crowdsourcing could be highly similar to domain adaptation, and then the recent advances of cross-domain methods can be almost directly applied to crowdsourcing. Here we take named entity recognition (NER) as a study case, suggesting an annotator-aware representation learning model that inspired by the domain adaptation methods which attempt to capture effective domain-aware features. We investigate both unsupervised and supervised crowdsourcing learning, assuming that no or only small-scale expert annotations are available. Experimental results on a benchmark crowdsourced NER dataset show that our method is highly effective, leading to a new state-of-the-art performance. In addition, under the supervised setting, we can achieve impressive performance gains with only a very small scale of expert annotations.

Supporting Cognitive and Emotional Empathic Writing of Students arxiv:2105.14815 📈 10

Thiemo Wambsganss, Christina Niklaus, Matthias Söllner, Siegfried Handschuh, Jan Marco Leimeister

**Abstract:** We present an annotation approach to capturing emotional and cognitive empathy in student-written peer reviews on business models in German. We propose an annotation scheme that allows us to model emotional and cognitive empathy scores based on three types of review components. Also, we conducted an annotation study with three annotators based on 92 student essays to evaluate our annotation scheme. The obtained inter-rater agreement of α=0.79 for the components and the multi-π=0.41 for the empathy scores indicate that the proposed annotation scheme successfully guides annotators to a substantial to moderate agreement. Moreover, we trained predictive models to detect the annotated empathy structures and embedded them in an adaptive writing support system for students to receive individual empathy feedback independent of an instructor, time, and location. We evaluated our tool in a peer learning exercise with 58 students and found promising results for perceived empathy skill learning, perceived feedback accuracy, and intention to use. Finally, we present our freely available corpus of 500 empathy-annotated, student-written peer reviews on business models and our annotation guidelines to encourage future research on the design and development of empathy support systems.

SDNet: mutil-branch for single image deraining using swin arxiv:2105.15077 📈 9

Fuxiang Tan, YuTing Kong, Yingying Fan, Feng Liu, Daxin Zhou, Hao zhang, Long Chen, Liang Gao, Yurong Qian

**Abstract:** Rain streaks degrade the image quality and seriously affect the performance of subsequent computer vision tasks, such as autonomous driving, social security, etc. Therefore, removing rain streaks from a given rainy images is of great significance. Convolutional neural networks(CNN) have been widely used in image deraining tasks, however, the local computational characteristics of convolutional operations limit the development of image deraining tasks. Recently, the popular transformer has global computational features that can further facilitate the development of image deraining tasks. In this paper, we introduce Swin-transformer into the field of image deraining for the first time to study the performance and potential of Swin-transformer in the field of image deraining. Specifically, we improve the basic module of Swin-transformer and design a three-branch model to implement single-image rain removal. The former implements the basic rain pattern feature extraction, while the latter fuses different features to further extract and process the image features. In addition, we employ a jump connection to fuse deep features and shallow features. In terms of experiments, the existing public dataset suffers from image duplication and relatively homogeneous background. So we propose a new dataset Rain3000 to validate our model. Therefore, we propose a new dataset Rain3000 for validating our model. Experimental results on the publicly available datasets Rain100L, Rain100H and our dataset Rain3000 show that our proposed method has performance and inference speed advantages over the current mainstream single-image rain streaks removal models.The source code will be available at https://github.com/H-tfx/SDNet.

The use of Generative Adversarial Networks to characterise new physics in multi-lepton final states at the LHC arxiv:2105.14933 📈 9

Thabang Lebese, Bruce Mellado, Xifeng Ruan

**Abstract:** Semi-supervision in Machine Learning can be used in searches for new physics where the signal plus background regions are not labelled. This strongly reduces model dependency in the search for signals Beyond the Standard Model. This approach displays the drawback in that over-fitting can give rise to fake signals. Tossing toy Monte Carlo (MC) events can be used to estimate the corresponding trials factor through a frequentist inference. However, MC events that are based on full detector simulations are resource intensive. Generative Adversarial Networks (GANs) can be used to mimic MC generators. GANs are powerful generative models, but often suffer from training instability. We henceforth show a review of GANs. We advocate the use of Wasserstein GAN (WGAN) with weight clipping and WGAN with gradient penalty (WGAN-GP) where the norm of gradient of the critic is penalized with respect to its input. Following the emergence of multi-lepton anomalies, we apply GANs for the generation of di-leptons final states in association with $b$-quarks at the LHC. A good agreement between the MC and the WGAN-GP generated events is found for the observables selected in the study.

Bounded logit attention: Learning to explain image classifiers arxiv:2105.14824 📈 9

Thomas Baumhauer, Djordje Slijepcevic, Matthias Zeppelzauer

**Abstract:** Explainable artificial intelligence is the attempt to elucidate the workings of systems too complex to be directly accessible to human cognition through suitable side-information referred to as "explanations". We present a trainable explanation module for convolutional image classifiers we call bounded logit attention (BLA). The BLA module learns to select a subset of the convolutional feature map for each input instance, which then serves as an explanation for the classifier's prediction. BLA overcomes several limitations of the instancewise feature selection method "learning to explain" (L2X) introduced by Chen et al. (2018): 1) BLA scales to real-world sized image classification problems, and 2) BLA offers a canonical way to learn explanations of variable size. Due to its modularity BLA lends itself to transfer learning setups and can also be employed as a post-hoc add-on to trained classifiers. Beyond explainability, BLA may serve as a general purpose method for differentiable approximation of subset selection. In a user study we find that BLA explanations are preferred over explanations generated by the popular (Grad-)CAM method.

Counterfactual Invariance to Spurious Correlations: Why and How to Pass Stress Tests arxiv:2106.00545 📈 8

Victor Veitch, Alexander D'Amour, Steve Yadlowsky, Jacob Eisenstein

**Abstract:** Informally, a 'spurious correlation' is the dependence of a model on some aspect of the input data that an analyst thinks shouldn't matter. In machine learning, these have a know-it-when-you-see-it character; e.g., changing the gender of a sentence's subject changes a sentiment predictor's output. To check for spurious correlations, we can 'stress test' models by perturbing irrelevant parts of input data and seeing if model predictions change. In this paper, we study stress testing using the tools of causal inference. We introduce counterfactual invariance as a formalization of the requirement that changing irrelevant parts of the input shouldn't change model predictions. We connect counterfactual invariance to out-of-domain model performance, and provide practical schemes for learning (approximately) counterfactual invariant predictors (without access to counterfactual examples). It turns out that both the means and implications of counterfactual invariance depend fundamentally on the true underlying causal structure of the data -- in particular, whether the label causes the features or the features cause the label. Distinct causal structures require distinct regularization schemes to induce counterfactual invariance. Similarly, counterfactual invariance implies different domain shift guarantees depending on the underlying causal structure. This theory is supported by empirical results on text classification.

Telling Stories through Multi-User Dialogue by Modeling Character Relations arxiv:2105.15054 📈 8

Wai Man Si, Prithviraj Ammanabrolu, Mark O. Riedl

**Abstract:** This paper explores character-driven story continuation, in which the story emerges through characters' first- and second-person narration as well as dialogue -- requiring models to select language that is consistent with a character's persona and their relationships with other characters while following and advancing the story. We hypothesize that a multi-task model that trains on character dialogue plus character relationship information improves transformer-based story continuation. To this end, we extend the Critical Role Dungeons and Dragons Dataset (Rameshkumar and Bailey, 2020) -- consisting of dialogue transcripts of people collaboratively telling a story while playing the role-playing game Dungeons and Dragons -- with automatically extracted relationships between each pair of interacting characters as well as their personas. A series of ablations lend evidence to our hypothesis, showing that our multi-task model using character relationships improves story continuation accuracy over strong baselines.

Privileged Graph Distillation for Cold Start Recommendation arxiv:2105.14975 📈 8

Shuai Wang, Kun Zhang, Le Wu, Haiping Ma, Richang Hong, Meng Wang

**Abstract:** The cold start problem in recommender systems is a long-standing challenge, which requires recommending to new users (items) based on attributes without any historical interaction records. In these recommendation systems, warm users (items) have privileged collaborative signals of interaction records compared to cold start users (items), and these Collaborative Filtering (CF) signals are shown to have competing performance for recommendation. Many researchers proposed to learn the correlation between collaborative signal embedding space and the attribute embedding space to improve the cold start recommendation, in which user and item categorical attributes are available in many online platforms. However, the cold start recommendation is still limited by two embedding spaces modeling and simple assumptions of space transformation. As user-item interaction behaviors and user (item) attributes naturally form a heterogeneous graph structure, in this paper, we propose a privileged graph distillation model~(PGD). The teacher model is composed of a heterogeneous graph structure for warm users and items with privileged CF links. The student model is composed of an entity-attribute graph without CF links. Specifically, the teacher model can learn better embeddings of each entity by injecting complex higher-order relationships from the constructed heterogeneous graph. The student model can learn the distilled output with privileged CF embeddings from the teacher embeddings. Our proposed model is generally applicable to different cold start scenarios with new user, new item, or new user-new item. Finally, extensive experimental results on the real-world datasets clearly show the effectiveness of our proposed model on different types of cold start problems, with average $6.6\%, 5.6\%, $ and $17.1\%$ improvement over state-of-the-art baselines on three datasets, respectively.

Do Multilingual Neural Machine Translation Models Contain Language Pair Specific Attention Heads? arxiv:2105.14940 📈 8

Zae Myung Kim, Laurent Besacier, Vassilina Nikoulina, Didier Schwab

**Abstract:** Recent studies on the analysis of the multilingual representations focus on identifying whether there is an emergence of language-independent representations, or whether a multilingual model partitions its weights among different languages. While most of such work has been conducted in a "black-box" manner, this paper aims to analyze individual components of a multilingual neural translation (NMT) model. In particular, we look at the encoder self-attention and encoder-decoder attention heads (in a many-to-one NMT model) that are more specific to the translation of a certain language pair than others by (1) employing metrics that quantify some aspects of the attention weights such as "variance" or "confidence", and (2) systematically ranking the importance of attention heads with respect to translation quality. Experimental results show that surprisingly, the set of most important attention heads are very similar across the language pairs and that it is possible to remove nearly one-third of the less important heads without hurting the translation quality greatly.

A unified view of likelihood ratio and reparameterization gradients arxiv:2105.14900 📈 8

Paavo Parmas, Masashi Sugiyama

**Abstract:** Reparameterization (RP) and likelihood ratio (LR) gradient estimators are used to estimate gradients of expectations throughout machine learning and reinforcement learning; however, they are usually explained as simple mathematical tricks, with no insight into their nature. We use a first principles approach to explain that LR and RP are alternative methods of keeping track of the movement of probability mass, and the two are connected via the divergence theorem. Moreover, we show that the space of all possible estimators combining LR and RP can be completely parameterized by a flow field $u(x)$ and an importance sampling distribution $q(x)$. We prove that there cannot exist a single-sample estimator of this type outside our characterized space, thus, clarifying where we should be searching for better Monte Carlo gradient estimators.

BaMBNet: A Blur-aware Multi-branch Network for Defocus Deblurring arxiv:2105.14766 📈 8

Pengwei Liang, Junjun Jiang, Xianming Liu, Jiayi Ma

**Abstract:** The defocus deblurring raised from the finite aperture size and exposure time is an essential problem in the computational photography. It is very challenging because the blur kernel is spatially varying and difficult to estimate by traditional methods. Due to its great breakthrough in low-level tasks, convolutional neural networks (CNNs) have been introduced to the defocus deblurring problem and achieved significant progress. However, they apply the same kernel for different regions of the defocus blurred images, thus it is difficult to handle these nonuniform blurred images. To this end, this study designs a novel blur-aware multi-branch network (BaMBNet), in which different regions (with different blur amounts) should be treated differentially. In particular, we estimate the blur amounts of different regions by the internal geometric constraint of the DP data, which measures the defocus disparity between the left and right views. Based on the assumption that different image regions with different blur amounts have different deblurring difficulties, we leverage different networks with different capacities (\emph{i.e.} parameters) to process different image regions. Moreover, we introduce a meta-learning defocus mask generation algorithm to assign each pixel to a proper branch. In this way, we can expect to well maintain the information of the clear regions while recovering the missing details of the blurred regions. Both quantitative and qualitative experiments demonstrate that our BaMBNet outperforms the state-of-the-art methods. Source code will be available at https://github.com/junjun-jiang/BaMBNet.

G-Transformer for Document-level Machine Translation arxiv:2105.14761 📈 8

Guangsheng Bao, Yue Zhang, Zhiyang Teng, Boxing Chen, Weihua Luo

**Abstract:** Document-level MT models are still far from satisfactory. Existing work extend translation unit from single sentence to multiple sentences. However, study shows that when we further enlarge the translation unit to a whole document, supervised training of Transformer can fail. In this paper, we find such failure is not caused by overfitting, but by sticking around local minima during training. Our analysis shows that the increased complexity of target-to-source attention is a reason for the failure. As a solution, we propose G-Transformer, introducing locality assumption as an inductive bias into Transformer, reducing the hypothesis space of the attention from target to source. Experiments show that G-Transformer converges faster and more stably than Transformer, achieving new state-of-the-art BLEU scores for both non-pretraining and pre-training settings on three benchmark datasets.

Variational Combinatorial Sequential Monte Carlo Methods for Bayesian Phylogenetic Inference arxiv:2106.00075 📈 7

Antonio Khalil Moretti, Liyi Zhang, Christian A. Naesseth, Hadiah Venner, David Blei, Itsik Pe'er

**Abstract:** Bayesian phylogenetic inference is often conducted via local or sequential search over topologies and branch lengths using algorithms such as random-walk Markov chain Monte Carlo (MCMC) or Combinatorial Sequential Monte Carlo (CSMC). However, when MCMC is used for evolutionary parameter learning, convergence requires long runs with inefficient exploration of the state space. We introduce Variational Combinatorial Sequential Monte Carlo (VCSMC), a powerful framework that establishes variational sequential search to learn distributions over intricate combinatorial structures. We then develop nested CSMC, an efficient proposal distribution for CSMC and prove that nested CSMC is an exact approximation to the (intractable) locally optimal proposal. We use nested CSMC to define a second objective, VNCSMC which yields tighter lower bounds than VCSMC. We show that VCSMC and VNCSMC are computationally efficient and explore higher probability spaces than existing methods on a range of tasks.

MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens arxiv:2105.15168 📈 7

Jiemin Fang, Lingxi Xie, Xinggang Wang, Xiaopeng Zhang, Wenyu Liu, Qi Tian

**Abstract:** Transformers have offered a new methodology of designing neural networks for visual recognition. Compared to convolutional networks, Transformers enjoy the ability of referring to global features at each stage, yet the attention module brings higher computational overhead that obstructs the application of Transformers to process high-resolution visual data. This paper aims to alleviate the conflict between efficiency and flexibility, for which we propose a specialized token for each region that serves as a messenger (MSG). Hence, by manipulating these MSG tokens, one can flexibly exchange visual information across regions and the computational complexity is reduced. We then integrate the MSG token into a multi-scale architecture named MSG-Transformer. In standard image classification and object detection, MSG-Transformer achieves competitive performance and the inference on both GPU and CPU is accelerated. Code is available at https://github.com/hustvl/MSG-Transformer.

Variational Autoencoders: A Harmonic Perspective arxiv:2105.14866 📈 7

Alexander Camuto, Matthew Willetts

**Abstract:** In this work we study Variational Autoencoders (VAEs) from the perspective of harmonic analysis. By viewing a VAE's latent space as a Gaussian Space, a variety of measure space, we derive a series of results that show that the encoder variance of a VAE controls the frequency content of the functions parameterised by the VAE encoder and decoder neural networks. In particular we demonstrate that larger encoder variances reduce the high frequency content of these functions. Our analysis allows us to show that increasing this variance effectively induces a soft Lipschitz constraint on the decoder network of a VAE, which is a core contributor to the adversarial robustness of VAEs. We further demonstrate that adding Gaussian noise to the input of a VAE allows us to more finely control the frequency content and the Lipschitz constant of the VAE encoder networks. To support our theoretical analysis we run experiments with VAEs with small fully-connected neural networks and with larger convolutional networks, demonstrating empirically that our theory holds for a variety of neural network architectures.

Q-attention: Enabling Efficient Learning for Vision-based Robotic Manipulation arxiv:2105.14829 📈 7

Stephen James, Andrew J. Davison

**Abstract:** Despite the success of reinforcement learning methods, they have yet to have their breakthrough moment when applied to a broad range of robotic manipulation tasks. This is partly due to the fact that reinforcement learning algorithms are notoriously difficult and time consuming to train, which is exacerbated when training from images rather than full-state inputs. As humans perform manipulation tasks, our eyes closely monitor every step of the process with our gaze focusing sequentially on the objects being manipulated. With this in mind, we present our Attention-driven Robotic Manipulation (ARM) algorithm, which is a general manipulation algorithm that can be applied to a range of sparse-rewarded tasks, given only a small number of demonstrations. ARM splits the complex task of manipulation into a 3 stage pipeline: (1) a Q-attention agent extracts interesting pixel locations from RGB and point cloud inputs, (2) a next-best pose agent that accepts crops from the Q-attention agent and outputs poses, and (3) a control agent that takes the goal pose and outputs joint actions. We show that current learning algorithms fail on a range of RLBench tasks, whilst ARM is successful.

Transfer Learning for Sequence Generation: from Single-source to Multi-source arxiv:2105.14809 📈 7

Xuancheng Huang, Jingfang Xu, Maosong Sun, Yang Liu

**Abstract:** Multi-source sequence generation (MSG) is an important kind of sequence generation tasks that takes multiple sources, including automatic post-editing, multi-source translation, multi-document summarization, etc. As MSG tasks suffer from the data scarcity problem and recent pretrained models have been proven to be effective for low-resource downstream tasks, transferring pretrained sequence-to-sequence models to MSG tasks is essential. Although directly finetuning pretrained models on MSG tasks and concatenating multiple sources into a single long sequence is regarded as a simple method to transfer pretrained models to MSG tasks, we conjecture that the direct finetuning method leads to catastrophic forgetting and solely relying on pretrained self-attention layers to capture cross-source information is not sufficient. Therefore, we propose a two-stage finetuning method to alleviate the pretrain-finetune discrepancy and introduce a novel MSG model with a fine encoder to learn better representations in MSG tasks. Experiments show that our approach achieves new state-of-the-art results on the WMT17 APE task and multi-source translation task using the WMT14 test set. When adapted to document-level translation, our framework outperforms strong baselines significantly.

Automatic audiovisual synchronisation for ultrasound tongue imaging arxiv:2105.15162 📈 6

Aciel Eshky, Joanne Cleland, Manuel Sam Ribeiro, Eleanor Sugden, Korin Richmond, Steve Renals

**Abstract:** Ultrasound tongue imaging is used to visualise the intra-oral articulators during speech production. It is utilised in a range of applications, including speech and language therapy and phonetics research. Ultrasound and speech audio are recorded simultaneously, and in order to correctly use this data, the two modalities should be correctly synchronised. Synchronisation is achieved using specialised hardware at recording time, but this approach can fail in practice resulting in data of limited usability. In this paper, we address the problem of automatically synchronising ultrasound and audio after data collection. We first investigate the tolerance of expert ultrasound users to synchronisation errors in order to find the thresholds for error detection. We use these thresholds to define accuracy scoring boundaries for evaluating our system. We then describe our approach for automatic synchronisation, which is driven by a self-supervised neural network, exploiting the correlation between the two signals to synchronise them. We train our model on data from multiple domains with different speaker characteristics, different equipment, and different recording environments, and achieve an accuracy >92.4% on held-out in-domain data. Finally, we introduce a novel resource, the Cleft dataset, which we gathered with a new clinical subgroup and for which hardware synchronisation proved unreliable. We apply our model to this out-of-domain data, and evaluate its performance subjectively with expert users. Results show that users prefer our model's output over the original hardware output 79.3% of the time. Our results demonstrate the strength of our approach and its ability to generalise to data from new domains.

Toward Understanding the Feature Learning Process of Self-supervised Contrastive Learning arxiv:2105.15134 📈 6

Zixin Wen, Yuanzhi Li

**Abstract:** How can neural networks trained by contrastive learning extract features from the unlabeled data? Why does contrastive learning usually need much stronger data augmentations than supervised learning to ensure good representations? These questions involve both the optimization and statistical aspects of deep learning, but can hardly be answered by analyzing supervised learning, where the target functions are the highest pursuit. Indeed, in self-supervised learning, it is inevitable to relate to the optimization/generalization of neural networks to how they can encode the latent structures in the data, which we refer to as the feature learning process. In this work, we formally study how contrastive learning learns the feature representations for neural networks by analyzing its feature learning process. We consider the case where our data are comprised of two types of features: the more semantically aligned sparse features which we want to learn from, and the other dense features we want to avoid. Theoretically, we prove that contrastive learning using $\mathbf{ReLU}$ networks provably learns the desired sparse features if proper augmentations are adopted. We present an underlying principle called $\textbf{feature decoupling}$ to explain the effects of augmentations, where we theoretically characterize how augmentations can reduce the correlations of dense features between positive samples while keeping the correlations of sparse features intact, thereby forcing the neural networks to learn from the self-supervision of sparse features. Empirically, we verified that the feature decoupling principle matches the underlying mechanism of contrastive learning in practice.

Dominant Patterns: Critical Features Hidden in Deep Neural Networks arxiv:2105.15057 📈 6

Zhixing Ye, Shaofei Qin, Sizhe Chen, Xiaolin Huang

**Abstract:** In this paper, we find the existence of critical features hidden in Deep NeuralNetworks (DNNs), which are imperceptible but can actually dominate the outputof DNNs. We call these features dominant patterns. As the name suggests, for a natural image, if we add the dominant pattern of a DNN to it, the output of this DNN is determined by the dominant pattern instead of the original image, i.e., DNN's prediction is the same with the dominant pattern's. We design an algorithm to find such patterns by pursuing the insensitivity in the feature space. A direct application of the dominant patterns is the Universal Adversarial Perturbations(UAPs). Numerical experiments show that the found dominant patterns defeat state-of-the-art UAP methods, especially in label-free settings. In addition, dominant patterns are proved to have the potential to attack downstream tasks in which DNNs share the same backbone. We claim that DNN-specific dominant patterns reveal some essential properties of a DNN and are of great importance for its feature analysis and robustness enhancement.

A remark on a paper of Krotov and Hopfield [arXiv:2008.06996] arxiv:2105.15034 📈 6

Fei Tang, Michael Kopp

**Abstract:** In their recent paper titled "Large Associative Memory Problem in Neurobiology and Machine Learning" [arXiv:2008.06996] the authors gave a biologically plausible microscopic theory from which one can recover many dense associative memory models discussed in the literature. We show that the layers of the recent "MLP-mixer" [arXiv:2105.01601] as well as the essentially equivalent model in [arXiv:2105.02723] are amongst them.

Retweet communities reveal the main sources of hate speech arxiv:2105.14898 📈 6

Bojan Evkoski, Andraz Pelicon, Igor Mozetic, Nikola Ljubesic, Petra Kralj Novak

**Abstract:** We address a challenging problem of identifying main sources of hate speech on Twitter. On one hand, we carefully annotate a large set of tweets for hate speech, and deploy advanced deep learning to produce high quality hate speech classification models. On the other hand, we create retweet networks, detect communities and monitor their evolution through time. This combined approach is applied to three years of Slovenian Twitter data. We report a number of interesting results. Hate speech is dominated by offensive tweets, related to political and ideological issues. The share of unacceptable tweets is moderately increasing with time, from the initial 20% to 30% by the end of 2020. Unacceptable tweets are retweeted significantly more often than acceptable tweets. About 60% of unacceptable tweets are produced by a single right-wing community of only moderate size. Institutional Twitter accounts and media accounts post significantly less unacceptable tweets than individual accounts. However, the main sources of unacceptable tweets are anonymous accounts, and accounts that were suspended or closed during the last three years.

Adaptive Multi-Source Causal Inference arxiv:2105.14877 📈 6

Thanh Vinh Vo, Pengfei Wei, Trong Nghia Hoang, Tze-Yun Leong

**Abstract:** Data scarcity is a tremendous challenge in causal effect estimation. In this paper, we propose to exploit additional data sources to facilitate estimating causal effects in the target population. Specifically, we leverage additional source datasets which share similar causal mechanisms with the target observations to help infer causal effects of the target population. We propose three levels of knowledge transfer, through modelling the outcomes, treatments, and confounders. To achieve consistent positive transfer, we introduce learnable parametric transfer factors to adaptively control the transfer strength, and thus achieving a fair and balanced knowledge transfer between the sources and the target. The proposed method can infer causal effects in the target population without prior knowledge of data discrepancy between the additional data sources and the target. Experiments on both synthetic and real-world datasets show the effectiveness of the proposed method as compared with recent baselines.

Bangla Natural Language Processing: A Comprehensive Review of Classical, Machine Learning, and Deep Learning Based Methods arxiv:2105.14875 📈 6

Ovishake Sen, Mohtasim Fuad, MD. Nazrul Islam, Jakaria Rabbi, MD. Kamrul Hasan, Mohammed Baz, Mehedi Masud, Md. Abdul Awal, Awal Ahmed Fime, Md. Tahmid Hasan Fuad, Delowar Sikder, MD. Akil Raihan Iftee

**Abstract:** The Bangla language is the seventh most spoken language, with 265 million native and non-native speakers worldwide. However, English is the predominant language for online resources and technical knowledge, journals, and documentation. Consequently, many Bangla-speaking people, who have limited command of English, face hurdles to utilize English resources. To bridge the gap between limited support and increasing demand, researchers conducted many experiments and developed valuable tools and techniques to create and process Bangla language materials. Many efforts are also ongoing to make it easy to use the Bangla language in the online and technical domains. There are some review papers to understand the past, previous, and future Bangla Natural Language Processing (BNLP) trends. The studies are mainly concentrated on the specific domains of BNLP, such as sentiment analysis, speech recognition, optical character recognition, and text summarization. There is an apparent scarcity of resources that contain a comprehensive study of the recent BNLP tools and methods. Therefore, in this paper, we present a thorough review of 71 BNLP research papers and categorize them into 11 categories, namely Information Extraction, Machine Translation, Named Entity Recognition, Parsing, Parts of Speech Tagging, Question Answering System, Sentiment Analysis, Spam and Fake Detection, Text Summarization, Word Sense Disambiguation, and Speech Processing and Recognition. We study articles published between 1999 to 2021, and 50% of the papers were published after 2015. We discuss Classical, Machine Learning and Deep Learning approaches with different datasets while addressing the limitations and current and future trends of the BNLP.

Refined Deep Neural Network and U-Net for Polyps Segmentation arxiv:2105.14848 📈 6

Quoc-Huy Trinh, Minh-Van Nguyen, Thiet-Gia Huynh, Minh-Triet Tran

**Abstract:** The Medico: Multimedia Task 2020 focuses on developing an efficient and accurate computer-aided diagnosis system for automatic segmentation [3]. We participate in task 1, Polyps segmentation task, which is to develop algorithms for segmenting polyps on a comprehensive dataset. In this task, we propose methods combining Residual module, Inception module, Adaptive Convolutional neural network with U-Net model, and PraNet for semantic segmentation of various types of polyps in endoscopic images. We select 5 runs with different architecture and parameters in our methods. Our methods show potential results in accuracy and efficiency through multiple experiments, and our team is in the Top 3 best results with a Jaccard index of 0.765.

Meta-HAR: Federated Representation Learning for Human Activity Recognition arxiv:2106.00615 📈 5

Chenglin Li, Di Niu, Bei Jiang, Xiao Zuo, Jianming Yang

**Abstract:** Human activity recognition (HAR) based on mobile sensors plays an important role in ubiquitous computing. However, the rise of data regulatory constraints precludes collecting private and labeled signal data from personal devices at scale. Federated learning has emerged as a decentralized alternative solution to model training, which iteratively aggregates locally updated models into a shared global model, therefore being able to leverage decentralized, private data without central collection. However, the effectiveness of federated learning for HAR is affected by the fact that each user has different activity types and even a different signal distribution for the same activity type. Furthermore, it is uncertain if a single global model trained can generalize well to individual users or new users with heterogeneous data. In this paper, we propose Meta-HAR, a federated representation learning framework, in which a signal embedding network is meta-learned in a federated manner, while the learned signal representations are further fed into a personalized classification network at each user for activity prediction. In order to boost the representation ability of the embedding network, we treat the HAR problem at each user as a different task and train the shared embedding network through a Model-Agnostic Meta-learning framework, such that the embedding network can generalize to any individual user. Personalization is further achieved on top of the robustly learned representations in an adaptation procedure. We conducted extensive experiments based on two publicly available HAR datasets as well as a newly created HAR dataset. Results verify that Meta-HAR is effective at maintaining high test accuracies for individual users, including new users, and significantly outperforms several baselines, including Federated Averaging, Reptile and even centralized learning in certain cases.

3D map creation using crowdsourced GNSS data arxiv:2106.00107 📈 5

Terence Lines, Ana Basiri

**Abstract:** 3D maps are increasingly useful for many applications such as drone navigation, emergency services, and urban planning. However, creating 3D maps and keeping them up-to-date using existing technologies, such as laser scanners, is expensive. This paper proposes and implements a novel approach to generate 2.5D (otherwise known as 3D level-of-detail (LOD) 1) maps for free using Global Navigation Satellite Systems (GNSS) signals, which are globally available and are blocked only by obstacles between the satellites and the receivers. This enables us to find the patterns of GNSS signal availability and create 3D maps. The paper applies algorithms to GNSS signal strength patterns based on a boot-strapped technique that iteratively trains the signal classifiers while generating the map. Results of the proposed technique demonstrate the ability to create 3D maps using automatically processed GNSS data. The results show that the third dimension, i.e. height of the buildings, can be estimated with below 5 metre accuracy, which is the benchmark recommended by the CityGML standard.

Continual 3D Convolutional Neural Networks for Real-time Processing of Videos arxiv:2106.00050 📈 5

Lukas Hedegaard, Alexandros Iosifidis

**Abstract:** This paper introduces Continual 3D Convolutional Neural Networks (Co3D CNNs), a new computational formulation of spatio-temporal 3D CNNs, in which videos are processed frame-by-frame rather than by clip. In online processing tasks demanding frame-wise predictions, Co3D CNNs dispense with the computational redundancies of regular 3D CNNs, namely the repeated convolutions over frames, which appear in multiple clips. While yielding an order of magnitude in computational savings, Co3D CNNs have memory requirements comparable with that of corresponding regular 3D CNNs and are less affected by changes in the size of the temporal receptive field. We show that Continual 3D CNNs initialised on the weights from preexisting state-of-the-art video recognition models reduce the floating point operations for frame-wise computations by 10.0-12.4x while improving accuracy on Kinetics-400 by 2.3-3.8. Moreover, we investigate the transient start-up response of Co3D CNNs and perform an extensive benchmark of online processing speed as well as accuracy for publicly available state-of-the-art 3D CNNs on modern hardware.

Diffusion Self-Organizing Map on the Hypersphere arxiv:2106.00014 📈 5

M. Andrecut

**Abstract:** We discuss a diffusion based implementation of the self-organizing map on the unit hypersphere. We show that this approach can be efficiently implemented using just linear algebra methods, we give a python numpy implementation, and we illustrate the approach using the well known MNIST dataset.

A Simple and General Debiased Machine Learning Theorem with Finite Sample Guarantees arxiv:2105.15197 📈 5

Victor Chernozhukov, Whitney K. Newey, Rahul Singh

**Abstract:** Debiased machine learning is a meta algorithm based on bias correction and sample splitting to calculate confidence intervals for functionals (i.e. scalar summaries) of machine learning algorithms. For example, an analyst may desire the confidence interval for a treatment effect estimated with a neural network. We provide a nonasymptotic debiased machine learning theorem that encompasses any global or local functional of any machine learning algorithm that satisfies a few simple, interpretable conditions. Formally, we prove consistency, Gaussian approximation, and semiparametric efficiency by finite sample arguments. The rate of convergence is root-n for global functionals, and it degrades gracefully for local functionals. Our results culminate in a simple set of conditions that an analyst can use to translate modern learning theory rates into traditional statistical inference. The conditions reveal a new double robustness property for ill posed inverse problems.

Reinforcement Learning-based Dynamic Service Placement in Vehicular Networks arxiv:2105.15022 📈 5

Anum Talpur, Mohan Gurusamy

**Abstract:** The emergence of technologies such as 5G and mobile edge computing has enabled provisioning of different types of services with different resource and service requirements to the vehicles in a vehicular network.The growing complexity of traffic mobility patterns and dynamics in the requests for different types of services has made service placement a challenging task. A typical static placement solution is not effective as it does not consider the traffic mobility and service dynamics. In this paper, we propose a reinforcement learning-based dynamic (RL-Dynamic) service placement framework to find the optimal placement of services at the edge servers while considering the vehicle's mobility and dynamics in the requests for different types of services. We use SUMO and MATLAB to carry out simulation experiments. In our learning framework, for the decision module, we consider two alternative objective functions-minimizing delay and minimizing edge server utilization. We developed an ILP based problem formulation for the two objective functions. The experimental results show that 1) compared to static service placement, RL-based dynamic service placement achieves fair utilization of edge server resources and low service delay, and 2) compared to delay-optimized placement, server utilization optimized placement utilizes resources more effectively, achieving higher fairness with lower edge-server utilization.

Generalization Error Rates in Kernel Regression: The Crossover from the Noiseless to Noisy Regime arxiv:2105.15004 📈 5

Hugo Cui, Bruno Loureiro, Florent Krzakala, Lenka Zdeborová

**Abstract:** In this manuscript we consider Kernel Ridge Regression (KRR) under the Gaussian design. Exponents for the decay of the excess generalization error of KRR have been reported in various works under the assumption of power-law decay of eigenvalues of the features co-variance. These decays were, however, provided for sizeably different setups, namely in the noiseless case with constant regularization and in the noisy optimally regularized case. Intermediary settings have been left substantially uncharted. In this work, we unify and extend this line of work, providing characterization of all regimes and excess error decay rates that can be observed in terms of the interplay of noise and regularization. In particular, we show the existence of a transition in the noisy setting between the noiseless exponents to its noisy values as the sample complexity is increased. Finally, we illustrate how this crossover can also be observed on real data sets.

Representation Learning Beyond Linear Prediction Functions arxiv:2105.14989 📈 5

Ziping Xu, Ambuj Tewari

**Abstract:** Recent papers on the theory of representation learning has shown the importance of a quantity called diversity when generalizing from a set of source tasks to a target task. Most of these papers assume that the function mapping shared representations to predictions is linear, for both source and target tasks. In practice, researchers in deep learning use different numbers of extra layers following the pretrained model based on the difficulty of the new task. This motivates us to ask whether diversity can be achieved when source tasks and the target task use different prediction function spaces beyond linear functions. We show that diversity holds even if the target task uses a neural network with multiple layers, as long as source tasks use linear functions. If source tasks use nonlinear prediction functions, we provide a negative result by showing that depth-1 neural networks with ReLu activation function need exponentially many source tasks to achieve diversity. For a general function class, we find that eluder dimension gives a lower bound on the number of tasks required for diversity. Our theoretical results imply that simpler tasks generalize better. Though our theoretical results are shown for the global minimizer of empirical risks, their qualitative predictions still hold true for gradient-based optimization algorithms as verified by our simulations on deep neural networks.

Feasibility Assessment of Multitasking in MRI Neuroimaging Analysis: Tissue Segmentation, Cross-Modality Conversion and Bias correction arxiv:2105.14986 📈 5

Mohammad Eslami, Solale Tabarestani, Malek Adjouadi

**Abstract:** Neuroimaging is essential in brain studies for the diagnosis and identification of disease, structure, and function of the brain in its healthy and disease states. Literature shows that there are advantages of multitasking with some deep learning (DL) schemes in challenging neuroimaging applications. This study examines the feasibility of using multitasking in three different applications, including tissue segmentation, cross-modality conversion, and bias-field correction. These applications reflect five different scenarios in which multitasking is explored and 280 training and testing sessions conducted for empirical evaluations. Two well-known networks, U-Net as a well-known convolutional neural network architecture, and a closed architecture based on the conditional generative adversarial network are implemented. Different metrics such as the normalized cross-correlation coefficient and Dice scores are used for comparison of methods and results of the different experiments. Statistical analysis is also provided by paired t-test. The present study explores the pros and cons of these methods and their practical impacts on multitasking in different implementation scenarios. This investigation shows that bias correction and cross-modality conversion applications are significantly easier than the segmentation application, and having multitasking with segmentation is not reasonable if one of them is identified as the main target application. However, when the main application is the segmentation of tissues, multitasking with cross-modality conversion is beneficial, especially for the U-net architecture.

Fast, Accurate and Interpretable Time Series Classification Through Randomization arxiv:2105.14876 📈 5

Nestor Cabello, Elham Naghizade, Jianzhong Qi, Lars Kulik

**Abstract:** Time series classification (TSC) aims to predict the class label of a given time series, which is critical to a rich set of application areas such as economics and medicine. State-of-the-art TSC methods have mostly focused on classification accuracy and efficiency, without considering the interpretability of their classifications, which is an important property required by modern applications such as appliance modeling and legislation such as the European General Data Protection Regulation. To address this gap, we propose a novel TSC method - the Randomized-Supervised Time Series Forest (r-STSF). r-STSF is highly efficient, achieves state-of-the-art classification accuracy and enables interpretability. r-STSF takes an efficient interval-based approach to classify time series according to aggregate values of discriminatory sub-series (intervals). To achieve state-of-the-art accuracy, r-STSF builds an ensemble of randomized trees using the discriminatory sub-series. It uses four time series representations, nine aggregation functions and a supervised binary-inspired search combined with a feature ranking metric to identify highly discriminatory sub-series. The discriminatory sub-series enable interpretable classifications. Experiments on extensive datasets show that r-STSF achieves state-of-the-art accuracy while being orders of magnitude faster than most existing TSC methods. It is the only classifier from the state-of-the-art group that enables interpretability. Our findings also highlight that r-STSF is the best TSC method when classifying complex time series datasets.

Towards Lower Bounds on the Depth of ReLU Neural Networks arxiv:2105.14835 📈 5

Christoph Hertrich, Amitabh Basu, Marco Di Summa, Martin Skutella

**Abstract:** We contribute to a better understanding of the class of functions that is represented by a neural network with ReLU activations and a given architecture. Using techniques from mixed-integer optimization, polyhedral theory, and tropical geometry, we provide a mathematical counterbalance to the universal approximation theorems which suggest that a single hidden layer is sufficient for learning tasks. In particular, we investigate whether the class of exactly representable functions strictly increases by adding more layers (with no restrictions on size). This problem has potential impact on algorithmic and statistical aspects because of the insight it provides into the class of functions represented by neural hypothesis classes. However, to the best of our knowledge, this question has not been investigated in the neural network literature. We also present upper bounds on the sizes of neural networks required to represent functions in these neural hypothesis classes.

An exact counterfactual-example-based approach to tree-ensemble models interpretability arxiv:2105.14820 📈 5

Pierre Blanchart

**Abstract:** Explaining the decisions of machine learning models is becoming a necessity in many areas where trust in ML models decision is key to their accreditation/adoption. The ability to explain models decisions also allows to provide diagnosis in addition to the model decision, which is highly valuable in scenarios such as fault detection. Unfortunately, high-performance models do not exhibit the necessary transparency to make their decisions fully understandable. And the black-boxes approaches, which are used to explain such model decisions, suffer from a lack of accuracy in tracing back the exact cause of a model decision regarding a given input. Indeed, they do not have the ability to explicitly describe the decision regions of the model around that input, which is necessary to determine what influences the model towards one decision or the other. We thus asked ourselves the question: is there a category of high-performance models among the ones currently used for which we could explicitly and exactly characterise the decision regions in the input feature space using a geometrical characterisation? Surprisingly we came out with a positive answer for any model that enters the category of tree ensemble models, which encompasses a wide range of high-performance models such as XGBoost, LightGBM, random forests ... We could derive an exact geometrical characterisation of their decision regions under the form of a collection of multidimensional intervals. This characterisation makes it straightforward to compute the optimal counterfactual (CF) example associated with a query point. We demonstrate several possibilities of the approach, such as computing the CF example based only on a subset of features. This allows to obtain more plausible explanations by adding prior knowledge about which variables the user can control. An adaptation to CF reasoning on regression problems is also envisaged.

WiCluster: Passive Indoor 2D/3D Positioning using WiFi without Precise Labels arxiv:2107.01002 📈 4

Ilia Karmanov, Farhad G. Zanjani, Simone Merlin, Ishaque Kadampot, Daniel Dijkman

**Abstract:** We introduce WiCluster, a new machine learning (ML) approach for passive indoor positioning using radio frequency (RF) channel state information (CSI). WiCluster can predict both a zone-level position and a precise 2D or 3D position, without using any precise position labels during training. Prior CSI-based indoor positioning work has relied on non-parametric approaches using digital signal-processing (DSP) and, more recently, parametric approaches (e.g., fully supervised ML methods). However these do not handle the complexity of real-world environments well and do not meet requirements for large-scale commercial deployments: the accuracy of DSP-based method deteriorates significantly in non-line-of-sight conditions, while supervised ML methods need large amounts of hard-to-acquire centimeter accuracy position labels. In contrast, WiCluster is precise, requires weaker label-information that can be easily collected, and works well in non-line-of-sight conditions. Our first contribution is a novel dimensionality reduction method for charting. It combines a triplet-loss with a multi-scale clustering-loss to map the high-dimensional CSI representation to a 2D/3D latent space. Our second contribution is two weakly supervised losses that map this latent space into a Cartesian map, resulting in meter-accuracy position results. These losses only require simple to acquire priors: a sketch of the floorplan, approximate access-point locations and a few CSI packets that are labelled with the corresponding zone in the floorplan. Thirdly, we report results and a robustness study for 2D positioning in two single-floor office buildings and 3D positioning in a two-story home.

Concurrent Adversarial Learning for Large-Batch Training arxiv:2106.00221 📈 4

Yong Liu, Xiangning Chen, Minhao Cheng, Cho-Jui Hsieh, Yang You

**Abstract:** Large-batch training has become a commonly used technique when training neural networks with a large number of GPU/TPU processors. As batch size increases, stochastic optimizers tend to converge to sharp local minima, leading to degraded test performance. Current methods usually use extensive data augmentation to increase the batch size, but we found the performance gain with data augmentation decreases as batch size increases, and data augmentation will become insufficient after certain point. In this paper, we propose to use adversarial learning to increase the batch size in large-batch training. Despite being a natural choice for smoothing the decision surface and biasing towards a flat region, adversarial learning has not been successfully applied in large-batch training since it requires at least two sequential gradient computations at each step, which will at least double the running time compared with vanilla training even with a large number of processors. To overcome this issue, we propose a novel Concurrent Adversarial Learning (ConAdv) method that decouple the sequential gradient computations in adversarial learning by utilizing staled parameters. Experimental results demonstrate that ConAdv can successfully increase the batch size on both ResNet-50 and EfficientNet training on ImageNet while maintaining high accuracy. In particular, we show ConAdv along can achieve 75.3\% top-1 accuracy on ImageNet ResNet-50 training with 96K batch size, and the accuracy can be further improved to 76.2\% when combining ConAdv with data augmentation. This is the first work successfully scales ResNet-50 training batch size to 96K.

Adaptive Conformal Inference Under Distribution Shift arxiv:2106.00170 📈 4

Isaac Gibbs, Emmanuel Candès

**Abstract:** We develop methods for forming prediction sets in an online setting where the data generating distribution is allowed to vary over time in an unknown fashion. Our framework builds on ideas from conformal inference to provide a general wrapper that can be combined with any black box method that produces point predictions of the unseen label or estimated quantiles of its distribution. While previous conformal inference methods rely on the assumption that the data points are exchangeable, our adaptive approach provably achieves the desired coverage frequency over long-time intervals irrespective of the true data generating process. We accomplish this by modelling the distribution shift as a learning problem in a single parameter whose optimal value is varying over time and must be continuously re-estimated. We test our method, adaptive conformal inference, on two real world datasets and find that its predictions are robust to visible and significant distribution shifts.

Corpus-Based Paraphrase Detection Experiments and Review arxiv:2106.00145 📈 4

Tedo Vrbanec, Ana Mestrovic

**Abstract:** Paraphrase detection is important for a number of applications, including plagiarism detection, authorship attribution, question answering, text summarization, text mining in general, etc. In this paper, we give a performance overview of various types of corpus-based models, especially deep learning (DL) models, with the task of paraphrase detection. We report the results of eight models (LSI, TF-IDF, Word2Vec, Doc2Vec, GloVe, FastText, ELMO, and USE) evaluated on three different public available corpora: Microsoft Research Paraphrase Corpus, Clough and Stevenson and Webis Crowd Paraphrase Corpus 2011. Through a great number of experiments, we decided on the most appropriate approaches for text pre-processing: hyper-parameters, sub-model selection-where they exist (e.g., Skipgram vs. CBOW), distance measures, and semantic similarity/paraphrase detection threshold. Our findings and those of other researchers who have used deep learning models show that DL models are very competitive with traditional state-of-the-art approaches and have potential that should be further developed.

Unifying Distillation with Personalization in Federated Learning arxiv:2105.15191 📈 4

Siddharth Divi, Habiba Farrukh, Berkay Celik

**Abstract:** Federated learning (FL) is a decentralized privacy-preserving learning technique in which clients learn a joint collaborative model through a central aggregator without sharing their data. In this setting, all clients learn a single common predictor (FedAvg), which does not generalize well on each client's local data due to the statistical data heterogeneity among clients. In this paper, we address this problem with PersFL, a discrete two-stage personalized learning algorithm. In the first stage, PersFL finds the optimal teacher model of each client during the FL training phase. In the second stage, PersFL distills the useful knowledge from optimal teachers into each user's local model. The teacher model provides each client with some rich, high-level representation that a client can easily adapt to its local model, which overcomes the statistical heterogeneity present at different clients. We evaluate PersFL on CIFAR-10 and MNIST datasets using three data-splitting strategies to control the diversity between clients' data distributions. We empirically show that PersFL outperforms FedAvg and three state-of-the-art personalization methods, pFedMe, Per-FedAvg, and FedPer on majority data-splits with minimal communication cost. Further, we study the performance of PersFL on different distillation objectives, how this performance is affected by the equitable notion of fairness among clients, and the number of required communication rounds. PersFL code is available at https://tinyurl.com/hdh5zhxs for public use and validation.

Reinforced Generative Adversarial Network for Abstractive Text Summarization arxiv:2105.15176 📈 4

Tianyang Xu, Chunyun Zhang

**Abstract:** Sequence-to-sequence models provide a viable new approach to generative summarization, allowing models that are no longer limited to simply selecting and recombining sentences from the original text. However, these models have three drawbacks: their grasp of the details of the original text is often inaccurate, and the text generated by such models often has repetitions, while it is difficult to handle words that are beyond the word list. In this paper, we propose a new architecture that combines reinforcement learning and adversarial generative networks to enhance the sequence-to-sequence attention model. First, we use a hybrid pointer-generator network that copies words directly from the source text, contributing to accurate reproduction of information without sacrificing the ability of generators to generate new words. Second, we use both intra-temporal and intra-decoder attention to penalize summarized content and thus discourage repetition. We apply our model to our own proposed COVID-19 paper title summarization task and achieve close approximations to the current model on ROUEG, while bringing better readability.

DISSECT: Disentangled Simultaneous Explanations via Concept Traversals arxiv:2105.15164 📈 4

Asma Ghandeharioun, Been Kim, Chun-Liang Li, Brendan Jou, Brian Eoff, Rosalind W. Picard

**Abstract:** Explaining deep learning model inferences is a promising venue for scientific understanding, improving safety, uncovering hidden biases, evaluating fairness, and beyond, as argued by many scholars. One of the principal benefits of counterfactual explanations is allowing users to explore "what-if" scenarios through what does not and cannot exist in the data, a quality that many other forms of explanation such as heatmaps and influence functions are inherently incapable of doing. However, most previous work on generative explainability cannot disentangle important concepts effectively, produces unrealistic examples, or fails to retain relevant information. We propose a novel approach, DISSECT, that jointly trains a generator, a discriminator, and a concept disentangler to overcome such challenges using little supervision. DISSECT generates Concept Traversals (CTs), defined as a sequence of generated examples with increasing degrees of concepts that influence a classifier's decision. By training a generative model from a classifier's signal, DISSECT offers a way to discover a classifier's inherent "notion" of distinct concepts automatically rather than rely on user-predefined concepts. We show that DISSECT produces CTs that (1) disentangle several concepts, (2) are influential to a classifier's decision and are coupled to its reasoning due to joint training (3), are realistic, (4) preserve relevant information, and (5) are stable across similar inputs. We validate DISSECT on several challenging synthetic and realistic datasets where previous methods fall short of satisfying desirable criteria for interpretability and show that it performs consistently well and better than existing methods. Finally, we present experiments showing applications of DISSECT for detecting potential biases of a classifier and identifying spurious artifacts that impact predictions.

Can Attention Enable MLPs To Catch Up With CNNs? arxiv:2105.15078 📈 4

Meng-Hao Guo, Zheng-Ning Liu, Tai-Jiang Mu, Dun Liang, Ralph R. Martin, Shi-Min Hu

**Abstract:** In the first week of May, 2021, researchers from four different institutions: Google, Tsinghua University, Oxford University and Facebook, shared their latest work [16, 7, 12, 17] on arXiv.org almost at the same time, each proposing new learning architectures, consisting mainly of linear layers, claiming them to be comparable, or even superior to convolutional-based models. This sparked immediate discussion and debate in both academic and industrial communities as to whether MLPs are sufficient, many thinking that learning architectures are returning to MLPs. Is this true? In this perspective, we give a brief history of learning architectures, including multilayer perceptrons (MLPs), convolutional neural networks (CNNs) and transformers. We then examine what the four newly proposed architectures have in common. Finally, we give our views on challenges and directions for new learning architectures, hoping to inspire future research.

First Full-Event Reconstruction from Imaging Atmospheric Cherenkov Telescope Real Data with Deep Learning arxiv:2105.14927 📈 4

Mikaël Jacquemont, Thomas Vuillaume, Alexandre Benoit, Gilles Maurin, Patrick Lambert, Giovanni Lamanna

**Abstract:** The Cherenkov Telescope Array is the future of ground-based gamma-ray astronomy. Its first prototype telescope built on-site, the Large Size Telescope 1, is currently under commissioning and taking its first scientific data. In this paper, we present for the first time the development of a full-event reconstruction based on deep convolutional neural networks and its application to real data. We show that it outperforms the standard analysis, both on simulated and on real data, thus validating the deep approach for the CTA data analysis. This work also illustrates the difficulty of moving from simulated data to actual data.

Gradient-based Data Subversion Attack Against Binary Classifiers arxiv:2105.14803 📈 4

Rosni K Vasu, Sanjay Seetharaman, Shubham Malaviya, Manish Shukla, Sachin Lodha

**Abstract:** Machine learning based data-driven technologies have shown impressive performances in a variety of application domains. Most enterprises use data from multiple sources to provide quality applications. The reliability of the external data sources raises concerns for the security of the machine learning techniques adopted. An attacker can tamper the training or test datasets to subvert the predictions of models generated by these techniques. Data poisoning is one such attack wherein the attacker tries to degrade the performance of a classifier by manipulating the training data. In this work, we focus on label contamination attack in which an attacker poisons the labels of data to compromise the functionality of the system. We develop Gradient-based Data Subversion strategies to achieve model degradation under the assumption that the attacker has limited-knowledge of the victim model. We exploit the gradients of a differentiable convex loss function (residual errors) with respect to the predicted label as a warm-start and formulate different strategies to find a set of data instances to contaminate. Further, we analyze the transferability of attacks and the susceptibility of binary classifiers. Our experiments show that the proposed approach outperforms the baselines and is computationally efficient.

Low-Dose CT Denoising Using a Structure-Preserving Kernel Prediction Network arxiv:2105.14758 📈 4

Lu Xu, Yuwei Zhang, Ying Liu, Daoye Wang, Mu Zhou, Jimmy Ren, Jingwei Wei, Zhaoxiang Ye

**Abstract:** Low-dose CT has been a key diagnostic imaging modality to reduce the potential risk of radiation overdose to patient health. Despite recent advances, CNN-based approaches typically apply filters in a spatially invariant way and adopt similar pixel-level losses, which treat all regions of the CT image equally and can be inefficient when fine-grained structures coexist with non-uniformly distributed noises. To address this issue, we propose a Structure-preserving Kernel Prediction Network (StructKPN) that combines the kernel prediction network with a structure-aware loss function that utilizes the pixel gradient statistics and guides the model towards spatially-variant filters that enhance noise removal, prevent over-smoothing and preserve detailed structures for different regions in CT imaging. Extensive experiments demonstrated that our approach achieved superior performance on both synthetic and non-synthetic datasets, and better preserves structures that are highly desired in clinical screening and low-dose protocol optimization.

Active Learning of Continuous-time Bayesian Networks through Interventions arxiv:2105.14742 📈 4

Dominik Linzner, Heinz Koeppl

**Abstract:** We consider the problem of learning structures and parameters of Continuous-time Bayesian Networks (CTBNs) from time-course data under minimal experimental resources. In practice, the cost of generating experimental data poses a bottleneck, especially in the natural and social sciences. A popular approach to overcome this is Bayesian optimal experimental design (BOED). However, BOED becomes infeasible in high-dimensional settings, as it involves integration over all possible experimental outcomes. We propose a novel criterion for experimental design based on a variational approximation of the expected information gain. We show that for CTBNs, a semi-analytical expression for this criterion can be calculated for structure and parameter learning. By doing so, we can replace sampling over experimental outcomes by solving the CTBNs master-equation, for which scalable approximations exist. This alleviates the computational burden of sampling possible experimental outcomes in high-dimensions. We employ this framework in order to recommend interventional sequences. In this context, we extend the CTBN model to conditional CTBNs in order to incorporate interventions. We demonstrate the performance of our criterion on synthetic and real-world data.

Similarity Embedding Networks for Robust Human Activity Recognition arxiv:2106.15283 📈 3

Chenglin Li, Carrie Lu Tong, Di Niu, Bei Jiang, Xiao Zuo, Lei Cheng, Jian Xiong, Jianming Yang

**Abstract:** Deep learning models for human activity recognition (HAR) based on sensor data have been heavily studied recently. However, the generalization ability of deep models on complex real-world HAR data is limited by the availability of high-quality labeled activity data, which are hard to obtain. In this paper, we design a similarity embedding neural network that maps input sensor signals onto real vectors through carefully designed convolutional and LSTM layers. The embedding network is trained with a pairwise similarity loss, encouraging the clustering of samples from the same class in the embedded real space, and can be effectively trained on a small dataset and even on a noisy dataset with mislabeled samples. Based on the learned embeddings, we further propose both nonparametric and parametric approaches for activity recognition. Extensive evaluation based on two public datasets has shown that the proposed similarity embedding network significantly outperforms state-of-the-art deep models on HAR classification tasks, is robust to mislabeled samples in the training set, and can also be used to effectively denoise a noisy dataset.

Byakto Speech: Real-time long speech synthesis with convolutional neural network: Transfer learning from English to Bangla arxiv:2106.03937 📈 3

Zabir Al Nazi, Sayed Mohammed Tasmimul Huda

**Abstract:** Speech synthesis is one of the challenging tasks to automate by deep learning, also being a low-resource language there are very few attempts at Bangla speech synthesis. Most of the existing works can't work with anything other than simple Bangla characters script, very short sentences, etc. This work attempts to solve these problems by introducing Byakta, the first-ever open-source deep learning-based bilingual (Bangla and English) text to a speech synthesis system. A speech recognition model-based automated scoring metric was also proposed to evaluate the performance of a TTS model. We also introduce a test benchmark dataset for Bangla speech synthesis models for evaluating speech quality. The TTS is available at https://github.com/zabir-nabil/bangla-tts

Federated Estimation of Causal Effects from Observational Data arxiv:2106.00456 📈 3

Thanh Vinh Vo, Trong Nghia Hoang, Young Lee, Tze-Yun Leong

**Abstract:** Many modern applications collect data that comes in federated spirit, with data kept locally and undisclosed. Till date, most insight into the causal inference requires data to be stored in a central repository. We present a novel framework for causal inference with federated data sources. We assess and integrate local causal effects from different private data sources without centralizing them. Then, the treatment effects on subjects from observational data using a non-parametric reformulation of the classical potential outcomes framework is estimated. We model the potential outcomes as a random function distributed by Gaussian processes, whose defining parameters can be efficiently learned from multiple data sources, respecting privacy constraints. We demonstrate the promise and efficiency of the proposed approach through a set of simulated and real-world benchmark examples.

Duckworth-Lewis-Stern Method Comparison with Machine Learning Approach arxiv:2106.00175 📈 3

Kumail Abbas, Sajjad Haider

**Abstract:** This work presents an analysis of the Duckworth-Lewis-Stern (DLS) method for One Day International (ODI) cricket matches. The accuracy of the DLS method is compared against various supervised learning algorithms for result prediction. The result of a cricket match is predicted during the second inning. The paper also optimized DLS resource table which is used in the Duckworth-Lewis (D/L) formula to increase its predictive power. Finally, an Unpredictability Index is developed that ranks different cricket playing nations according to how unpredictable they are while playing an ODI match.

Enhancing Trajectory Prediction using Sparse Outputs: Application to Team Sports arxiv:2106.00173 📈 3

Brandon Victor, Aiden Nibali, Zhen He, David L. Carey

**Abstract:** Sophisticated trajectory prediction models that effectively mimic team dynamics have many potential uses for sports coaches, broadcasters and spectators. However, through experiments on soccer data we found that it can be surprisingly challenging to train a deep learning model for player trajectory prediction which outperforms linear extrapolation on average distance between predicted and true future trajectories. We propose and test a novel method for improving training by predicting a sparse trajectory and interpolating using constant acceleration, which improves performance for several models. This interpolation can also be used on models that aren't trained with sparse outputs, and we find that this consistently improves performance for all tested models. Additionally, we find that the accuracy of predicted trajectories for a subset of players can be improved by conditioning on the full trajectories of the other players, and that this is further improved when combined with sparse predictions. We also propose a novel architecture using graph networks and multi-head attention (GraN-MA) which achieves better performance than other tested state-of-the-art models on our dataset and is trivially adapted for both sparse trajectories and full-trajectory conditioned trajectory prediction.

A Road-map to Robot Task Execution with the Functional Object-Oriented Network arxiv:2106.00158 📈 3

David Paulius, Alejandro Agostini, Yu Sun, Dongheui Lee

**Abstract:** Following work on joint object-action representations, the functional object-oriented network (FOON) was introduced as a knowledge graph representation for robots. Taking the form of a bipartite graph, a FOON contains symbolic or high-level information that would be pertinent to a robot's understanding of its environment and tasks in a way that mirrors human understanding of actions. In this work, we outline a road-map for future development of FOON and its application in robotic systems for task planning as well as knowledge acquisition from demonstration. We propose preliminary ideas to show how a FOON can be created in a real-world scenario with a robot and human teacher in a way that can jointly augment existing knowledge in a FOON and teach a robot the skills it needs to replicate the demonstrated actions and solve a given manipulation problem.

Explanations for Monotonic Classifiers arxiv:2106.00154 📈 3

Joao Marques-Silva, Thomas Gerspacher, Martin Cooper, Alexey Ignatiev, Nina Narodytska

**Abstract:** In many classification tasks there is a requirement of monotonicity. Concretely, if all else remains constant, increasing (resp. decreasing) the value of one or more features must not decrease (resp. increase) the value of the prediction. Despite comprehensive efforts on learning monotonic classifiers, dedicated approaches for explaining monotonic classifiers are scarce and classifier-specific. This paper describes novel algorithms for the computation of one formal explanation of a (black-box) monotonic classifier. These novel algorithms are polynomial in the run time complexity of the classifier and the number of features. Furthermore, the paper presents a practically efficient model-agnostic algorithm for enumerating formal explanations.

Clustering-friendly Representation Learning via Instance Discrimination and Feature Decorrelation arxiv:2106.00131 📈 3

Yaling Tao, Kentaro Takagi, Kouta Nakata

**Abstract:** Clustering is one of the most fundamental tasks in machine learning. Recently, deep clustering has become a major trend in clustering techniques. Representation learning often plays an important role in the effectiveness of deep clustering, and thus can be a principal cause of performance degradation. In this paper, we propose a clustering-friendly representation learning method using instance discrimination and feature decorrelation. Our deep-learning-based representation learning method is motivated by the properties of classical spectral clustering. Instance discrimination learns similarities among data and feature decorrelation removes redundant correlation among features. We utilize an instance discrimination method in which learning individual instance classes leads to learning similarity among instances. Through detailed experiments and examination, we show that the approach can be adapted to learning a latent space for clustering. We design novel softmax-formulated decorrelation constraints for learning. In evaluations of image clustering using CIFAR-10 and ImageNet-10, our method achieves accuracy of 81.5% and 95.4%, respectively. We also show that the softmax-formulated constraints are compatible with various neural networks.

Probabilistic Deep Learning with Probabilistic Neural Networks and Deep Probabilistic Models arxiv:2106.00120 📈 3

Daniel T. Chang

**Abstract:** Probabilistic deep learning is deep learning that accounts for uncertainty, both model uncertainty and data uncertainty. It is based on the use of probabilistic models and deep neural networks. We distinguish two approaches to probabilistic deep learning: probabilistic neural networks and deep probabilistic models. The former employs deep neural networks that utilize probabilistic layers which can represent and process uncertainty; the latter uses probabilistic models that incorporate deep neural network components which capture complex non-linear stochastic relationships between the random variables. We discuss some major examples of each approach including Bayesian neural networks and mixture density networks (for probabilistic neural networks), and variational autoencoders, deep Gaussian processes and deep mixed effects models (for deep probabilistic models). TensorFlow Probability is a library for probabilistic modeling and inference which can be used for both approaches of probabilistic deep learning. We include its code examples for illustration.

Early Detection of COVID-19 Hotspots Using Spatio-Temporal Data arxiv:2106.00072 📈 3

Shixiang Zhu, Alexander Bukharin, Liyan Xie, Khurram Yamin, Shihao Yang, Pinar Keskinocak, Yao Xie

**Abstract:** Recently, the Centers for Disease Control and Prevention (CDC) has worked with other federal agencies to identify counties with increasing coronavirus disease 2019 (COVID-19) incidence (hotspots) and offers support to local health departments to limit the spread of the disease. Understanding the spatio-temporal dynamics of hotspot events is of great importance to support policy decisions and prevent large-scale outbreaks. This paper presents a spatio-temporal Bayesian framework for early detection of COVID-19 hotspots (at the county level) in the United States. We assume both the observed number of cases and hotspots depend on a class of latent random variables, which encode the underlying spatio-temporal dynamics of the transmission of COVID-19. Such latent variables follow a zero-mean Gaussian process, whose covariance is specified by a non-stationary kernel function. The most salient feature of our kernel function is that deep neural networks are introduced to enhance the model's representative power while still enjoying the interpretability of the kernel. We derive a sparse model and fit the model using a variational learning strategy to circumvent the computational intractability for large data sets. Our model demonstrates better interpretability and superior hotspot-detection performance compared to other baseline methods.

Low-Resource Spoken Language Identification Using Self-Attentive Pooling and Deep 1D Time-Channel Separable Convolutions arxiv:2106.00052 📈 3

Roman Bedyakin, Nikolay Mikhaylovskiy

**Abstract:** This memo describes NTR/TSU winning submission for Low Resource ASR challenge at Dialog2021 conference, language identification track. Spoken Language Identification (LID) is an important step in a multilingual Automated Speech Recognition (ASR) system pipeline. Traditionally, the ASR task requires large volumes of labeled data that are unattainable for most of the world's languages, including most of the languages of Russia. In this memo, we show that a convolutional neural network with a Self-Attentive Pooling layer shows promising results in low-resource setting for the language identification task and set up a SOTA for the Low Resource ASR challenge dataset. Additionally, we compare the structure of confusion matrices for this and significantly more diverse VoxForge dataset and state and substantiate the hypothesis that whenever the dataset is diverse enough so that the other classification factors, like gender, age etc. are well-averaged, the confusion matrix for LID system bears the language similarity measure.

Not All Images are Worth 16x16 Words: Dynamic Transformers for Efficient Image Recognition arxiv:2105.15075 📈 3

Yulin Wang, Rui Huang, Shiji Song, Zeyi Huang, Gao Huang

**Abstract:** Vision Transformers (ViT) have achieved remarkable success in large-scale image recognition. They split every 2D image into a fixed number of patches, each of which is treated as a token. Generally, representing an image with more tokens would lead to higher prediction accuracy, while it also results in drastically increased computational cost. To achieve a decent trade-off between accuracy and speed, the number of tokens is empirically set to 16x16 or 14x14. In this paper, we argue that every image has its own characteristics, and ideally the token number should be conditioned on each individual input. In fact, we have observed that there exist a considerable number of "easy" images which can be accurately predicted with a mere number of 4x4 tokens, while only a small fraction of "hard" ones need a finer representation. Inspired by this phenomenon, we propose a Dynamic Transformer to automatically configure a proper number of tokens for each input image. This is achieved by cascading multiple Transformers with increasing numbers of tokens, which are sequentially activated in an adaptive fashion at test time, i.e., the inference is terminated once a sufficiently confident prediction is produced. We further design efficient feature reuse and relationship reuse mechanisms across different components of the Dynamic Transformer to reduce redundant computations. Extensive empirical results on ImageNet, CIFAR-10, and CIFAR-100 demonstrate that our method significantly outperforms the competitive baselines in terms of both theoretical computational efficiency and practical inference speed. Code and pre-trained models (based on PyTorch and MindSpore) are available at https://github.com/blackfeather-wang/Dynamic-Vision-Transformer and https://github.com/blackfeather-wang/Dynamic-Vision-Transformer-MindSpore.

The effectiveness of feature attribution methods and its correlation with automatic evaluation scores arxiv:2105.14944 📈 3

Giang Nguyen, Daeyoung Kim, Anh Nguyen

**Abstract:** Explaining the decisions of an Artificial Intelligence (AI) model is increasingly critical in many real-world, high-stake applications. Hundreds of papers have either proposed new feature attribution methods, discussed or harnessed these tools in their work. However, despite humans being the target end-users, most attribution methods were only evaluated on proxy automatic-evaluation metrics (Zhang et al. 2018; Zhou et al. 2016; Petsiuk et al. 2018). In this paper, we conduct the first user study to measure attribution map effectiveness in assisting humans in ImageNet classification and Stanford Dogs fine-grained classification, and when an image is natural or adversarial (i.e., contains adversarial perturbations). Overall, feature attribution is surprisingly not more effective than showing humans nearest training-set examples. On a harder task of fine-grained dog categorization, presenting attribution maps to humans does not help, but instead hurts the performance of human-AI teams compared to AI alone. Importantly, we found automatic attribution-map evaluation measures to correlate poorly with the actual human-AI team performance. Our findings encourage the community to rigorously test their methods on the downstream human-in-the-loop applications and to rethink the existing evaluation metrics.

Semi-orthogonal Embedding for Efficient Unsupervised Anomaly Segmentation arxiv:2105.14737 📈 3

Jin-Hwa Kim, Do-Hyeong Kim, Saehoon Yi, Taehoon Lee

**Abstract:** We present the efficiency of semi-orthogonal embedding for unsupervised anomaly segmentation. The multi-scale features from pre-trained CNNs are recently used for the localized Mahalanobis distances with significant performance. However, the increased feature size is problematic to scale up to the bigger CNNs, since it requires the batch-inverse of multi-dimensional covariance tensor. Here, we generalize an ad-hoc method, random feature selection, into semi-orthogonal embedding for robust approximation, cubically reducing the computational cost for the inverse of multi-dimensional covariance tensor. With the scrutiny of ablation studies, the proposed method achieves a new state-of-the-art with significant margins for the MVTec AD, KolektorSDD, KolektorSDD2, and mSTC datasets. The theoretical and empirical analyses offer insights and verification of our straightforward yet cost-effective approach.

Explainability via Interactivity? Supporting Nonexperts' Sensemaking of Pretrained CNN by Interacting with Their Daily Surroundings arxiv:2107.01996 📈 2

Chao Wang, Pengcheng An

**Abstract:** Current research on Explainable AI (XAI) heavily targets on expert users (data scientists or AI developers). However, increasing importance has been argued for making AI more understandable to nonexperts, who are expected to leverage AI techniques, but have limited knowledge about AI. We present a mobile application to support nonexperts to interactively make sense of Convolutional Neural Networks (CNN); it allows users to play with a pretrained CNN by taking pictures of their surrounding objects. We use an up-to-date XAI technique (Class Activation Map) to intuitively visualize the model's decision (the most important image regions that lead to a certain result). Deployed in a university course, this playful learning tool was found to support design students to gain vivid understandings about the capabilities and limitations of pretrained CNNs in real-world environments. Concrete examples of students' playful explorations are reported to characterize their sensemaking processes reflecting different depths of thought.

Data Fusion for Deep Learning on Transport Mode Detection: A Case Study arxiv:2106.05876 📈 2

Hugues Moreau, Andréa Vassilev, Liming Chen

**Abstract:** In Transport Mode Detection, a great diversity of methodologies exist according to the choice made on sensors, preprocessing, model used, etc. In this domain, the comparisons between each option are not always complete. Experiments on a public, real-life dataset are led here to evaluate carefully each of the choices that were made, with a specific emphasis on data fusion methods. Our most surprising finding is that none of the methods we implemented from the literature is better than a simple late fusion. Two important decisions are the choice of a sensor and the choice of a representation for the data: we found that using 2D convolutions on spectrograms with a logarithmic axis for the frequencies was better than 1-dimensional temporal representations.

Analysis of convolutional neural network image classifiers in a hierarchical max-pooling model with additional local pooling arxiv:2106.05233 📈 2

Benjamin Walter

**Abstract:** Image classification is considered, and a hierarchical max-pooling model with additional local pooling is introduced. Here the additional local pooling enables the hierachical model to combine parts of the image which have a variable relative distance towards each other. Various convolutional neural network image classifiers are introduced and compared in view of their rate of convergence. The finite sample size performance of the estimates is analyzed by applying them to simulated and real data.

An ordinal CNN approach for the assessment of neurological damage in Parkinson's disease patients arxiv:2106.05230 📈 2

Javier Barbero-Gómez, Pedro-Antonio Gutiérrez, Víctor-Manuel Vargas, Juan-Antonio Vallejo-Casas, César Hervás-Martínez

**Abstract:** 3D image scans are an assessment tool for neurological damage in Parkinson's disease (PD) patients. This diagnosis process can be automatized to help medical staff through Decision Support Systems (DSSs), and Convolutional Neural Networks (CNNs) are good candidates, because they are effective when applied to spatial data. This paper proposes a 3D CNN ordinal model for assessing the level or neurological damage in PD patients. Given that CNNs need large datasets to achieve acceptable performance, a data augmentation method is adapted to work with spatial data. We consider the Ordinal Graph-based Oversampling via Shortest Paths (OGO-SP) method, which applies a gamma probability distribution for inter-class data generation. A modification of OGO-SP is proposed, the OGO-SP-$β$ algorithm, which applies the beta distribution for generating synthetic samples in the inter-class region, a better suited distribution when compared to gamma. The evaluation of the different methods is based on a novel 3D image dataset provided by the Hospital Universitario 'Reina Sofía' (Córdoba, Spain). We show how the ordinal methodology improves the performance with respect to the nominal one, and how OGO-SP-$β$ yields better performance than OGO-SP.

Neural message passing for joint paratope-epitope prediction arxiv:2106.00757 📈 2

Alice Del Vecchio, Andreea Deac, Pietro Liò, Petar Veličković

**Abstract:** Antibodies are proteins in the immune system which bind to antigens to detect and neutralise them. The binding sites in an antibody-antigen interaction are known as the paratope and epitope, respectively, and the prediction of these regions is key to vaccine and synthetic antibody development. Contrary to prior art, we argue that paratope and epitope predictors require asymmetric treatment, and propose distinct neural message passing architectures that are geared towards the specific aspects of paratope and epitope prediction, respectively. We obtain significant improvements on both tasks, setting the new state-of-the-art and recovering favourable qualitative predictions on antigens of relevance to COVID-19.

Fine-grained Generalization Analysis of Structured Output Prediction arxiv:2106.00115 📈 2

Waleed Mustafa, Yunwen Lei, Antoine Ledent, Marius Kloft

**Abstract:** In machine learning we often encounter structured output prediction problems (SOPPs), i.e. problems where the output space admits a rich internal structure. Application domains where SOPPs naturally occur include natural language processing, speech recognition, and computer vision. Typical SOPPs have an extremely large label set, which grows exponentially as a function of the size of the output. Existing generalization analysis implies generalization bounds with at least a square-root dependency on the cardinality $d$ of the label set, which can be vacuous in practice. In this paper, we significantly improve the state of the art by developing novel high-probability bounds with a logarithmic dependency on $d$. Moreover, we leverage the lens of algorithmic stability to develop generalization bounds in expectation without any dependency on $d$. Our results therefore build a solid theoretical foundation for learning in large-scale SOPPs. Furthermore, we extend our results to learning with weakly dependent data.

Text Summarization with Latent Queries arxiv:2106.00104 📈 2

Yumo Xu, Mirella Lapata

**Abstract:** The availability of large-scale datasets has driven the development of neural models that create summaries from single documents, for generic purposes. When using a summarization system, users often have specific intents with various language realizations, which, depending on the information need, can range from a single keyword to a long narrative composed of multiple questions. Existing summarization systems, however, often either fail to support or act robustly on this query focused summarization task. We introduce LaQSum, the first unified text summarization system that learns Latent Queries from documents for abstractive summarization with any existing query forms. Under a deep generative framework, our system jointly optimizes a latent query model and a conditional language model, allowing users to plug-and-play queries of any type at test time. Despite learning from only generic summarization data and requiring no further optimization for downstream summarization tasks, our system robustly outperforms strong comparison systems across summarization benchmarks with different query types, document settings, and target domains.

Generalized AdaGrad (G-AdaGrad) and Adam: A State-Space Perspective arxiv:2106.00092 📈 2

Kushal Chakrabarti, Nikhil Chopra

**Abstract:** Accelerated gradient-based methods are being extensively used for solving non-convex machine learning problems, especially when the data points are abundant or the available data is distributed across several agents. Two of the prominent accelerated gradient algorithms are AdaGrad and Adam. AdaGrad is the simplest accelerated gradient method, which is particularly effective for sparse data. Adam has been shown to perform favorably in deep learning problems compared to other methods. In this paper, we propose a new fast optimizer, Generalized AdaGrad (G-AdaGrad), for accelerating the solution of potentially non-convex machine learning problems. Specifically, we adopt a state-space perspective for analyzing the convergence of gradient acceleration algorithms, namely G-AdaGrad and Adam, in machine learning. Our proposed state-space models are governed by ordinary differential equations. We present simple convergence proofs of these two algorithms in the deterministic settings with minimal assumptions. Our analysis also provides intuition behind improving upon AdaGrad's convergence rate. We provide empirical results on MNIST dataset to reinforce our claims on the convergence and performance of G-AdaGrad and Adam.

Automating Visualization Quality Assessment: a Case Study in Higher Education arxiv:2106.00077 📈 2

Nicolas Steven Holliman

**Abstract:** We present a case study in the use of machine+human mixed intelligence for visualization quality assessment, applying automated visualization quality metrics to support the human assessment of data visualizations produced as coursework by students taking higher education courses. A set of image informatics algorithms including edge congestion, visual saliency and colour analysis generate machine analysis of student visualizations. The insight from the image informatics outputs has proved helpful for the marker in assessing the work and is also provided to the students as part of a written report on their work. Student and external reviewer comments suggest that the addition of the image informatics outputs to the standard feedback document was a positive step. We review the ethical challenges of working with assessment data and of automating assessment processes.

GRAVITAS: Graphical Reticulated Attack Vectors for Internet-of-Things Aggregate Security arxiv:2106.00073 📈 2

Jacob Brown, Tanujay Saha, Niraj K. Jha

**Abstract:** Internet-of-Things (IoT) and cyber-physical systems (CPSs) may consist of thousands of devices connected in a complex network topology. The diversity and complexity of these components present an enormous attack surface, allowing an adversary to exploit security vulnerabilities of different devices to execute a potent attack. Though significant efforts have been made to improve the security of individual devices in these systems, little attention has been paid to security at the aggregate level. In this article, we describe a comprehensive risk management system, called GRAVITAS, for IoT/CPS that can identify undiscovered attack vectors and optimize the placement of defenses within the system for optimal performance and cost. While existing risk management systems consider only known attacks, our model employs a machine learning approach to extrapolate undiscovered exploits, enabling us to identify attacks overlooked by manual penetration testing (pen-testing). The model is flexible enough to analyze practically any IoT/CPS and provide the system administrator with a concrete list of suggested defenses that can reduce system vulnerability at optimal cost. GRAVITAS can be employed by governments, companies, and system administrators to design secure IoT/CPS at scale, providing a quantitative measure of security and efficiency in a world where IoT/CPS devices will soon be ubiquitous.

Using machine learning for quantum annealing accuracy prediction arxiv:2106.00065 📈 2

Aaron Barbosa, Elijah Pelofske, Georg Hahn, Hristo N. Djidjev

**Abstract:** Quantum annealers, such as the device built by D-Wave Systems, Inc., offer a way to compute solutions of NP-hard problems that can be expressed in Ising or QUBO (quadratic unconstrained binary optimization) form. Although such solutions are typically of very high quality, problem instances are usually not solved to optimality due to imperfections of the current generations quantum annealers. In this contribution, we aim to understand some of the factors contributing to the hardness of a problem instance, and to use machine learning models to predict the accuracy of the D-Wave 2000Q annealer for solving specific problems. We focus on the Maximum Clique problem, a classic NP-hard problem with important applications in network analysis, bioinformatics, and computational chemistry. By training a machine learning classification model on basic problem characteristics such as the number of edges in the graph, or annealing parameters such as D-Wave's chain strength, we are able to rank certain features in the order of their contribution to the solution hardness, and present a simple decision tree which allows to predict whether a problem will be solvable to optimality with the D-Wave 2000Q. We extend these results by training a machine learning regression model that predicts the clique size found by D-Wave.

PUDLE: Implicit Acceleration of Dictionary Learning by Backpropagation arxiv:2106.00058 📈 2

Bahareh Tolooshams, Demba Ba

**Abstract:** The dictionary learning problem, representing data as a combination of few atoms, has long stood as a popular method for learning representations in statistics and signal processing. The most popular dictionary learning algorithm alternates between sparse coding and dictionary update steps, and a rich literature has studied its theoretical convergence. The growing popularity of neurally plausible unfolded sparse coding networks has led to the empirical finding that backpropagation through such networks performs dictionary learning. This paper offers the first theoretical proof for these empirical results through PUDLE, a Provable Unfolded Dictionary LEarning method. We highlight the impact of loss, unfolding, and backpropagation on convergence. We discover an implicit acceleration: as a function of unfolding, the backpropagated gradient converges faster and is more accurate than the gradient from alternating minimization. We complement our findings through synthetic and image denoising experiments. The findings support the use of accelerated deep learning optimizers and unfolded networks for dictionary learning.

Persistent Homology Captures the Generalization of Neural Networks Without A Validation Set arxiv:2106.00012 📈 2

Asier Gutiérrez-Fandiño, David Pérez-Fernández, Jordi Armengol-Estapé, Marta Villegas

**Abstract:** The training of neural networks is usually monitored with a validation (holdout) set to estimate the generalization of the model. This is done instead of measuring intrinsic properties of the model to determine whether it is learning appropriately. In this work, we suggest studying the training of neural networks with Algebraic Topology, specifically Persistent Homology (PH). Using simplicial complex representations of neural networks, we study the PH diagram distance evolution on the neural network learning process with different architectures and several datasets. Results show that the PH diagram distance between consecutive neural network states correlates with the validation accuracy, implying that the generalization error of a neural network could be intrinsically estimated without any holdout set.

Fast Policy Extragradient Methods for Competitive Games with Entropy Regularization arxiv:2105.15186 📈 2

Shicong Cen, Yuting Wei, Yuejie Chi

**Abstract:** This paper investigates the problem of computing the equilibrium of competitive games, which is often modeled as a constrained saddle-point optimization problem with probability simplex constraints. Despite recent efforts in understanding the last-iterate convergence of extragradient methods in the unconstrained setting, the theoretical underpinnings of these methods in the constrained settings, especially those using multiplicative updates, remain highly inadequate, even when the objective function is bilinear. Motivated by the algorithmic role of entropy regularization in single-agent reinforcement learning and game theory, we develop provably efficient extragradient methods to find the quantal response equilibrium (QRE) -- which are solutions to zero-sum two-player matrix games with entropy regularization -- at a linear rate. The proposed algorithms can be implemented in a decentralized manner, where each player executes symmetric and multiplicative updates iteratively using its own payoff without observing the opponent's actions directly. In addition, by controlling the knob of entropy regularization, the proposed algorithms can locate an approximate Nash equilibrium of the unregularized matrix game at a sublinear rate without assuming the Nash equilibrium to be unique. Our methods also lead to efficient policy extragradient algorithms for solving (entropy-regularized) zero-sum Markov games at similar rates. All of our convergence rates are nearly dimension-free, which are independent of the size of the state and action spaces up to logarithm factors, highlighting the positive role of entropy regularization for accelerating convergence.

Policies for the Dynamic Traveling Maintainer Problem with Alerts arxiv:2105.15119 📈 2

Paulo da Costa, Peter Verleijsdonk, Simon Voorberg, Alp Akcay, Stella Kapodistria, Willem van Jaarsveld, Yingqian Zhang

**Abstract:** Companies require modern capital assets such as wind turbines, trains and hospital equipment to experience minimal downtime. Ideally, assets are maintained right before failure to ensure maximum availability at minimum maintenance costs. To this end, two challenges arise: failure times of assets are unknown a priori and assets can be part of a larger asset network. Nowadays, it is common for assets to be equipped with real-time monitoring that emits alerts, typically triggered by the first signs of degradation. Thus, it becomes crucial to plan maintenance considering information received via alerts, asset locations and maintenance costs. This problem is referred to as the Dynamic Traveling Maintainer Problem with Alerts (DTMPA). We propose a modeling framework for the DTMPA, where the alerts are early and imperfect indicators of failures. The objective is to minimize discounted maintenance costs accrued over an infinite time horizon. We propose three methods to solve this problem, leveraging different information levels from the alert signals. The proposed methods comprise various greedy heuristics that rank assets based on proximity, urgency and economic risk; a Traveling Maintainer Heuristic employing combinatorial optimization to optimize near-future costs; a Deep Reinforcement Learning (DRL) method trained to minimize the long-term costs using exclusively the history of alerts. In a simulated environment, all methods can approximate optimal policies with access to perfect condition information for small asset networks. For larger networks, where computing the optimal policy is intractable, the proposed methods yield competitive maintenance policies, with DRL consistently achieving the lowest costs.

Detecting Fetal Alcohol Spectrum Disorder in children using Artificial Neural Network arxiv:2105.15074 📈 2

Vannessa de J. Duarte, Paul Leger, Sergio Contreras, Hiroaki Fukuda

**Abstract:** Fetal alcohol spectrum disorder (FASD) is a syndrome whose only difference compared to other children's conditions is the mother's alcohol consumption during pregnancy. An earlier diagnosis of FASD improving the quality of life of children and adolescents. For this reason, this study focus on evaluating the use of the artificial neural network (ANN) to classify children with FASD and explore how accurate it is. ANN has been used to diagnose cancer, diabetes, and other diseases in the medical area, being a tool that presents good results. The data used is from a battery of tests from children for 5-18 years old (include tests of psychometric, saccade eye movement, and diffusion tensor imaging (DTI)). We study the different configurations of ANN with dense layers. The first one predicts 75\% of the outcome correctly for psychometric data. The others models include a feature layer, and we used it to predict FASD using every test individually. The models accurately predict over 70\% of the cases, and psychometric and memory guides predict over 88\% accuracy. The results suggest that the ANN approach is a competitive and efficient methodology to detect FASD. However, we could be careful in used as a diagnostic technique.

Max-Margin is Dead, Long Live Max-Margin! arxiv:2105.15069 📈 2

Alex Nowak-Vila, Alessandro Rudi, Francis Bach

**Abstract:** The foundational concept of Max-Margin in machine learning is ill-posed for output spaces with more than two labels such as in structured prediction. In this paper, we show that the Max-Margin loss can only be consistent to the classification task under highly restrictive assumptions on the discrete loss measuring the error between outputs. These conditions are satisfied by distances defined in tree graphs, for which we prove consistency, thus being the first losses shown to be consistent for Max-Margin beyond the binary setting. We finally address these limitations by correcting the concept of Max-Margin and introducing the Restricted-Max-Margin, where the maximization of the loss-augmented scores is maintained, but performed over a subset of the original domain. The resulting loss is also a generalization of the binary support vector machine and it is consistent under milder conditions on the discrete loss.

Scorpion detection and classification systems based on computer vision and deep learning for health security purposes arxiv:2105.15041 📈 2

Francisco Luis Giambelluca, Marcelo A. Cappelletti, Jorge Osio, Luis A. Giambelluca

**Abstract:** In this paper, two novel automatic and real-time systems for the detection and classification of two genera of scorpions found in La Plata city (Argentina) were developed using computer vision and deep learning techniques. The object detection technique was implemented with two different methods, YOLO (You Only Look Once) and MobileNet, based on the shape features of the scorpions. High accuracy values of 88% and 91%, and high recall values of 90% and 97%, have been achieved for both models, respectively, which guarantees that they can successfully detect scorpions. In addition, the MobileNet method has been shown to have excellent performance to detect scorpions within an uncontrolled environment and to perform multiple detections. The MobileNet model was also used for image classification in order to successfully distinguish between dangerous scorpion (Tityus) and non-dangerous scorpion (Bothriurus) with the purpose of providing a health security tool. Applications for smartphones were developed, with the advantage of the portability of the systems, which can be used as a help tool for emergency services, or for biological research purposes. The developed systems can be easily scalable to other genera and species of scorpions to extend the region where these applications can be used.

DiaKG: an Annotated Diabetes Dataset for Medical Knowledge Graph Construction arxiv:2105.15033 📈 2

Dejie Chang, Mosha Chen, Chaozhen Liu, Liping Liu, Dongdong Li, Wei Li, Fei Kong, Bangchang Liu, Xiaobin Luo, Ji Qi, Qiao Jin, Bin Xu

**Abstract:** Knowledge Graph has been proven effective in modeling structured information and conceptual knowledge, especially in the medical domain. However, the lack of high-quality annotated corpora remains a crucial problem for advancing the research and applications on this task. In order to accelerate the research for domain-specific knowledge graphs in the medical domain, we introduce DiaKG, a high-quality Chinese dataset for Diabetes knowledge graph, which contains 22,050 entities and 6,890 relations in total. We implement recent typical methods for Named Entity Recognition and Relation Extraction as a benchmark to evaluate the proposed dataset thoroughly. Empirical results show that the DiaKG is challenging for most existing methods and further analysis is conducted to discuss future research direction for improvements. We hope the release of this dataset can assist the construction of diabetes knowledge graphs and facilitate AI-based applications.

SNIPS: Solving Noisy Inverse Problems Stochastically arxiv:2105.14951 📈 2

Bahjat Kawar, Gregory Vaksman, Michael Elad

**Abstract:** In this work we introduce a novel stochastic algorithm dubbed SNIPS, which draws samples from the posterior distribution of any linear inverse problem, where the observation is assumed to be contaminated by additive white Gaussian noise. Our solution incorporates ideas from Langevin dynamics and Newton's method, and exploits a pre-trained minimum mean squared error (MMSE) Gaussian denoiser. The proposed approach relies on an intricate derivation of the posterior score function that includes a singular value decomposition (SVD) of the degradation operator, in order to obtain a tractable iterative algorithm for the desired sampling. Due to its stochasticity, the algorithm can produce multiple high perceptual quality samples for the same noisy observation. We demonstrate the abilities of the proposed paradigm for image deblurring, super-resolution, and compressive sensing. We show that the samples produced are sharp, detailed and consistent with the given measurements, and their diversity exposes the inherent uncertainty in the inverse problem being solved.

A Multilingual Modeling Method for Span-Extraction Reading Comprehension arxiv:2105.14880 📈 2

Gaochen Wu, Bin Xu, Dejie Chang, Bangchang Liu

**Abstract:** Span-extraction reading comprehension models have made tremendous advances enabled by the availability of large-scale, high-quality training datasets. Despite such rapid progress and widespread application, extractive reading comprehension datasets in languages other than English remain scarce, and creating such a sufficient amount of training data for each language is costly and even impossible. An alternative to creating large-scale high-quality monolingual span-extraction training datasets is to develop multilingual modeling approaches and systems which can transfer to the target language without requiring training data in that language. In this paper, in order to solve the scarce availability of extractive reading comprehension training data in the target language, we propose a multilingual extractive reading comprehension approach called XLRC by simultaneously modeling the existing extractive reading comprehension training data in a multilingual environment using self-adaptive attention and multilingual attention. Specifically, we firstly construct multilingual parallel corpora by translating the existing extractive reading comprehension datasets (i.e., CMRC 2018) from the target language (i.e., Chinese) into different language families (i.e., English). Secondly, to enhance the final target representation, we adopt self-adaptive attention (SAA) to combine self-attention and inter-attention to extract the semantic relations from each pair of the target and source languages. Furthermore, we propose multilingual attention (MLA) to learn the rich knowledge from various language families. Experimental results show that our model outperforms the state-of-the-art baseline (i.e., RoBERTa_Large) on the CMRC 2018 task, which demonstrate the effectiveness of our proposed multi-lingual modeling approach and show the potentials in multilingual NLP tasks.

Verdi: Quality Estimation and Error Detection for Bilingual Corpora arxiv:2105.14878 📈 2

Mingjun Zhao, Haijiang Wu, Di Niu, Zixuan Wang, Xiaoli Wang

**Abstract:** Translation Quality Estimation is critical to reducing post-editing efforts in machine translation and to cross-lingual corpus cleaning. As a research problem, quality estimation (QE) aims to directly estimate the quality of translation in a given pair of source and target sentences, and highlight the words that need corrections, without referencing to golden translations. In this paper, we propose Verdi, a novel framework for word-level and sentence-level post-editing effort estimation for bilingual corpora. Verdi adopts two word predictors to enable diverse features to be extracted from a pair of sentences for subsequent quality estimation, including a transformer-based neural machine translation (NMT) model and a pre-trained cross-lingual language model (XLM). We exploit the symmetric nature of bilingual corpora and apply model-level dual learning in the NMT predictor, which handles a primal task and a dual task simultaneously with weight sharing, leading to stronger context prediction ability than single-direction NMT models. By taking advantage of the dual learning scheme, we further design a novel feature to directly encode the translated target information without relying on the source context. Extensive experiments conducted on WMT20 QE tasks demonstrate that our method beats the winner of the competition and outperforms other baseline methods by a great margin. We further use the sentence-level scores provided by Verdi to clean a parallel corpus and observe benefits on both model performance and training efficiency.

SN-Graph: a Minimalist 3D Object Representation for Classification arxiv:2105.14784 📈 2

Siyu Zhang, Hui Cao, Yuqi Liu, Shen Cai, Yanting Zhang, Yuanzhan Li, Xiaoyu Chi

**Abstract:** Using deep learning techniques to process 3D objects has achieved many successes. However, few methods focus on the representation of 3D objects, which could be more effective for specific tasks than traditional representations, such as point clouds, voxels, and multi-view images. In this paper, we propose a Sphere Node Graph (SN-Graph) to represent 3D objects. Specifically, we extract a certain number of internal spheres (as nodes) from the signed distance field (SDF), and then establish connections (as edges) among the sphere nodes to construct a graph, which is seamlessly suitable for 3D analysis using graph neural network (GNN). Experiments conducted on the ModelNet40 dataset show that when there are fewer nodes in the graph or the tested objects are rotated arbitrarily, the classification accuracy of SN-Graph is significantly higher than the state-of-the-art methods.

Energy-Efficient and Federated Meta-Learning via Projected Stochastic Gradient Ascent arxiv:2105.14772 📈 2

Anis Elgabli, Chaouki Ben Issaid, Amrit S. Bedi, Mehdi Bennis, Vaneet Aggarwal

**Abstract:** In this paper, we propose an energy-efficient federated meta-learning framework. The objective is to enable learning a meta-model that can be fine-tuned to a new task with a few number of samples in a distributed setting and at low computation and communication energy consumption. We assume that each task is owned by a separate agent, so a limited number of tasks is used to train a meta-model. Assuming each task was trained offline on the agent's local data, we propose a lightweight algorithm that starts from the local models of all agents, and in a backward manner using projected stochastic gradient ascent (P-SGA) finds a meta-model. The proposed method avoids complex computations such as computing hessian, double looping, and matrix inversion, while achieving high performance at significantly less energy consumption compared to the state-of-the-art methods such as MAML and iMAML on conducted experiments for sinusoid regression and image classification tasks.

Bio-inspired visual attention for silicon retinas based on spiking neural networks applied to pattern classification arxiv:2105.14753 📈 2

Amélie Gruel, Jean Martinet

**Abstract:** Visual attention can be defined as the behavioral and cognitive process of selectively focusing on a discrete aspect of sensory cues while disregarding other perceivable information. This biological mechanism, more specifically saliency detection, has long been used in multimedia indexing to drive the analysis only on relevant parts of images or videos for further processing. The recent advent of silicon retinas (or event cameras -- sensors that measure pixel-wise changes in brightness and output asynchronous events accordingly) raises the question of how to adapt attention and saliency to the unconventional type of such sensors' output. Silicon retina aims to reproduce the biological retina behaviour. In that respect, they produce punctual events in time that can be construed as neural spikes and interpreted as such by a neural network. In particular, Spiking Neural Networks (SNNs) represent an asynchronous type of artificial neural network closer to biology than traditional artificial networks, mainly because they seek to mimic the dynamics of neural membrane and action potentials over time. SNNs receive and process information in the form of spike trains. Therefore, they make for a suitable candidate for the efficient processing and classification of incoming event patterns measured by silicon retinas. In this paper, we review the biological background behind the attentional mechanism, and introduce a case study of event videos classification with SNNs, using a biology-grounded low-level computational attention mechanism, with interesting preliminary results.

Hierarchical Deep Network with Uncertainty-aware Semi-supervised Learning for Vessel Segmentation arxiv:2105.14732 📈 2

Chenxin Li, Wenao Ma, Liyan Sun, Xinghao Ding, Yue Huang, Guisheng Wang, Yizhou Yu

**Abstract:** The analysis of organ vessels is essential for computer-aided diagnosis and surgical planning. But it is not a easy task since the fine-detailed connected regions of organ vessel bring a lot of ambiguity in vessel segmentation and sub-type recognition, especially for the low-contrast capillary regions. Furthermore, recent two-staged approaches would accumulate and even amplify these inaccuracies from the first-stage whole vessel segmentation into the second-stage sub-type vessel pixel-wise classification. Moreover, the scarcity of manual annotation in organ vessels poses another challenge. In this paper, to address the above issues, we propose a hierarchical deep network where an attention mechanism localizes the low-contrast capillary regions guided by the whole vessels, and enhance the spatial activation in those areas for the sub-type vessels. In addition, we propose an uncertainty-aware semi-supervised training framework to alleviate the annotation-hungry limitation of deep models. The proposed method achieves the state-of-the-art performance in the benchmarks of both retinal artery/vein segmentation in fundus images and liver portal/hepatic vessel segmentation in CT images.

Transfer Learning as an Enhancement for Reconfiguration Management of Cyber-Physical Production Systems arxiv:2105.14730 📈 2

Benjamin Maschler, Timo Müller, Andreas Löcklin, Michael Weyrich

**Abstract:** Reconfiguration demand is increasing due to frequent requirement changes for manufacturing systems. Recent approaches aim at investigating feasible configuration alternatives from which they select the optimal one. This relies on processes whose behavior is not reliant on e.g. the production sequence. However, when machine learning is used, components' behavior depends on the process' specifics, requiring additional concepts to successfully conduct reconfiguration management. Therefore, we propose the enhancement of the comprehensive reconfiguration management with transfer learning. This provides the ability to assess the machine learning dependent behavior of the different CPPS configurations with reduced effort and further assists the recommissioning of the chosen one. A real cyber-physical production system from the discrete manufacturing domain is utilized to demonstrate the aforementioned proposal.

The Role of Entropy in Guiding a Connection Prover arxiv:2105.14706 📈 2

Zsolt Zombori, Josef Urban, Miroslav Olšák

**Abstract:** In this work we study how to learn good algorithms for selecting reasoning steps in theorem proving. We explore this in the connection tableau calculus implemented by leanCoP where the partial tableau provides a clean and compact notion of a state to which a limited number of inferences can be applied. We start by incorporating a state-of-the-art learning algorithm -- a graph neural network (GNN) -- into the plCoP theorem prover. Then we use it to observe the system's behaviour in a reinforcement learning setting, i.e., when learning inference guidance from successful Monte-Carlo tree searches on many problems. Despite its better pattern matching capability, the GNN initially performs worse than a simpler previously used learning algorithm. We observe that the simpler algorithm is less confident, i.e., its recommendations have higher entropy. This leads us to explore how the entropy of the inference selection implemented via the neural network influences the proof search. This is related to research in human decision-making under uncertainty, and in particular the probability matching theory. Our main result shows that a proper entropy regularisation, i.e., training the GNN not to be overconfident, greatly improves plCoP's performance on a large mathematical corpus.

Control Occupation Kernel Regression for Nonlinear Control-Affine Systems arxiv:2106.00103 📈 1

Moad Abudia, Tejasvi Channagiri, Joel A. Rosenfeld, Rushikesh Kamalapurkar

**Abstract:** This manuscript presents an algorithm for obtaining an approximation of nonlinear high order control affine dynamical systems, that leverages the controlled trajectories as the central unit of information. As the fundamental basis elements leveraged in approximation, higher order control occupation kernels represent iterated integration after multiplication by a given controller in a vector valued reproducing kernel Hilbert space. In a regularized regression setting, the unique optimizer for a particular optimization problem is expressed as a linear combination of these occupation kernels, which converts an infinite dimensional optimization problem to a finite dimensional optimization problem through the representer theorem. Interestingly, the vector valued structure of the Hilbert space allows for simultaneous approximation of the drift and control effectiveness components of the control affine system. Several experiments are performed to demonstrate the effectiveness of the approach.

HEMET: A Homomorphic-Encryption-Friendly Privacy-Preserving Mobile Neural Network Architecture arxiv:2106.00038 📈 1

Qian Lou, Lei Jiang

**Abstract:** Recently Homomorphic Encryption (HE) is used to implement Privacy-Preserving Neural Networks (PPNNs) that perform inferences directly on encrypted data without decryption. Prior PPNNs adopt mobile network architectures such as SqueezeNet for smaller computing overhead, but we find naïvely using mobile network architectures for a PPNN does not necessarily achieve shorter inference latency. Despite having less parameters, a mobile network architecture typically introduces more layers and increases the HE multiplicative depth of a PPNN, thereby prolonging its inference latency. In this paper, we propose a \textbf{HE}-friendly privacy-preserving \textbf{M}obile neural n\textbf{ET}work architecture, \textbf{HEMET}. Experimental results show that, compared to state-of-the-art (SOTA) PPNNs, HEMET reduces the inference latency by $59.3\%\sim 61.2\%$, and improves the inference accuracy by $0.4 \% \sim 0.5\%$.

A Novel Automatic Modulation Classification Scheme Based on Multi-Scale Networks arxiv:2105.15037 📈 1

Hao Zhang, Fuhui Zhou, Qihui Wu, Wei Wu, Rose Qingyang Hu

**Abstract:** Automatic modulation classification enables intelligent communications and it is of crucial importance in today's and future wireless communication networks. Although many automatic modulation classification schemes have been proposed, they cannot tackle the intra-class diversity problem caused by the dynamic changes of the wireless communication environment. In order to overcome this problem, inspired by face recognition, a novel automatic modulation classification scheme is proposed by using the multi-scale network in this paper. Moreover, a novel loss function that combines the center loss and the cross entropy loss is exploited to learn both discriminative and separable features in order to further improve the classification performance. Extensive simulation results demonstrate that our proposed automatic modulation classification scheme can achieve better performance than the benchmark schemes in terms of the classification accuracy. The influence of the network parameters and the loss function with the two-stage training strategy on the classification accuracy of our proposed scheme are investigated.

Boosting the Performance of Video Compression Artifact Reduction with Reference Frame Proposals and Frequency Domain Information arxiv:2105.14962 📈 1

Yi Xu, Minyi Zhao, Jing Liu, Xinjian Zhang, Longwen Gao, Shuigeng Zhou, Huyang Sun

**Abstract:** Many deep learning based video compression artifact removal algorithms have been proposed to recover high-quality videos from low-quality compressed videos. Recently, methods were proposed to mine spatiotemporal information via utilizing multiple neighboring frames as reference frames. However, these post-processing methods take advantage of adjacent frames directly, but neglect the information of the video itself, which can be exploited. In this paper, we propose an effective reference frame proposal strategy to boost the performance of the existing multi-frame approaches. Besides, we introduce a loss based on fast Fourier transformation~(FFT) to further improve the effectiveness of restoration. Experimental results show that our method achieves better fidelity and perceptual performance on MFQE 2.0 dataset than the state-of-the-art methods. And our method won Track 1 and Track 2, and was ranked the 2nd in Track 3 of NTIRE 2021 Quality enhancement of heavily compressed videos Challenge.

CTSpine1K: A Large-Scale Dataset for Spinal Vertebrae Segmentation in Computed Tomography arxiv:2105.14711 📈 1

Yang Deng, Ce Wang, Yuan Hui, Qian Li, Jun Li, Shiwei Luo, Mengke Sun, Quan Quan, Shuxin Yang, You Hao, Pengbo Liu, Honghu Xiao, Chunpeng Zhao, Xinbao Wu, S. Kevin Zhou

**Abstract:** Spine-related diseases have high morbidity and cause a huge burden of social cost. Spine imaging is an essential tool for noninvasively visualizing and assessing spinal pathology. Segmenting vertebrae in computed tomography (CT) images is the basis of quantitative medical image analysis for clinical diagnosis and surgery planning of spine diseases. Current publicly available annotated datasets on spinal vertebrae are small in size. Due to the lack of a large-scale annotated spine image dataset, the mainstream deep learning-based segmentation methods, which are data-driven, are heavily restricted. In this paper, we introduce a large-scale spine CT dataset, called CTSpine1K, curated from multiple sources for vertebra segmentation, which contains 1,005 CT volumes with over 11,100 labeled vertebrae belonging to different spinal conditions. Based on this dataset, we conduct several spinal vertebrae segmentation experiments to set the first benchmark. We believe that this large-scale dataset will facilitate further research in many spine-related image analysis tasks, including but not limited to vertebrae segmentation, labeling, 3D spine reconstruction from biplanar radiographs, image super-resolution, and enhancement.

Know Your Model (KYM): Increasing Trust in AI and Machine Learning arxiv:2106.11036 📈 0

Mary Roszel, Robert Norvill, Jean Hilger, Radu State

**Abstract:** The widespread utilization of AI systems has drawn attention to the potential impacts of such systems on society. Of particular concern are the consequences that prediction errors may have on real-world scenarios, and the trust humanity places in AI systems. It is necessary to understand how we can evaluate trustworthiness in AI and how individuals and entities alike can develop trustworthy AI systems. In this paper, we analyze each element of trustworthiness and provide a set of 20 guidelines that can be leveraged to ensure optimal AI functionality while taking into account the greater ethical, technical, and practical impacts to humanity. Moreover, the guidelines help ensure that trustworthiness is provable and can be demonstrated, they are implementation agnostic, and they can be applied to any AI system in any sector.

Locally Valid and Discriminative Prediction Intervals for Deep Learning Models arxiv:2106.00225 📈 0

Zhen Lin, Shubhendu Trivedi, Jimeng Sun

**Abstract:** Crucial for building trust in deep learning models for critical real-world applications is efficient and theoretically sound uncertainty quantification, a task that continues to be challenging. Useful uncertainty information is expected to have two key properties: It should be valid (guaranteeing coverage) and discriminative (more uncertain when the expected risk is high). Moreover, when combined with deep learning (DL) methods, it should be scalable and affect the DL model performance minimally. Most existing Bayesian methods lack frequentist coverage guarantees and usually affect model performance. The few available frequentist methods are rarely discriminative and/or violate coverage guarantees due to unrealistic assumptions. Moreover, many methods are expensive or require substantial modifications to the base neural network. Building upon recent advances in conformal prediction [13, 33] and leveraging the classical idea of kernel regression, we propose Locally Valid and Discriminative prediction intervals (LVD), a simple, efficient, and lightweight method to construct discriminative prediction intervals (PIs) for almost any DL model. With no assumptions on the data distribution, such PIs also offer finite-sample local coverage guarantees (contrasted to the simpler marginal coverage). We empirically verify, using diverse datasets, that besides being the only locally valid method for DL, LVD also exceeds or matches the performance (including coverage rate and prediction accuracy) of existing uncertainty quantification methods, while offering additional benefits in scalability and flexibility.

Iterative Hierarchical Attention for Answering Complex Questions over Long Documents arxiv:2106.00200 📈 0

Haitian Sun, William W. Cohen, Ruslan Salakhutdinov

**Abstract:** We propose a new model, DocHopper, that iteratively attends to different parts of long, hierarchically structured documents to answer complex questions. Similar to multi-hop question-answering (QA) systems, at each step, DocHopper uses a query $q$ to attend to information from a document, combines this ``retrieved'' information with $q$ to produce the next query. However, in contrast to most previous multi-hop QA systems, DocHopper is able to ``retrieve'' either short passages or long sections of the document, thus emulating a multi-step process of ``navigating'' through a long document to answer a question. To enable this novel behavior, DocHopper does not combine document information with $q$ by concatenating text to the text of $q$, but by combining a compact neural representation of $q$ with a compact neural representation of a hierarchical part of the document, which can potentially be quite large. We experiment with DocHopper on four different QA tasks that require reading long and complex documents to answer multi-hop questions, and show that DocHopper achieves state-of-the-art results on three of the datasets. Additionally, DocHopper is efficient at inference time, being 3--10 times faster than the baselines.

Effect of Pre-Training Scale on Intra- and Inter-Domain Full and Few-Shot Transfer Learning for Natural and Medical X-Ray Chest Images arxiv:2106.00116 📈 0

Mehdi Cherti, Jenia Jitsev

**Abstract:** Transfer learning aims to exploit pre-trained models for more efficient follow-up training on wide range of downstream tasks and datasets, enabling successful training also on small data. Recently, strong improvement was shown for transfer learning and model generalization when increasing model, data and compute budget scale in the pre-training. To compare effect of scale both in intra- and inter-domain full and few-shot transfer, in this study we combine for the first time large openly available medical X-Ray chest imaging datasets to reach a dataset scale comparable to ImageNet-1k. We then conduct pre-training and transfer to different natural or medical targets while varying network size and source data scale and domain, being either large natural (ImageNet-1k/21k) or large medical chest X-Ray datasets. We observe strong improvement due to larger pre-training scale for intra-domain natural-natural and medical-medical transfer. For inter-domain natural-medical transfer, we find improvements due to larger pre-training scale on larger X-Ray targets in full shot regime, while for smaller targets and for few-shot regime the improvement is not visible. Remarkably, large networks pre-trained on very large natural ImageNet-21k are as good or better than networks pre-trained on largest available medical X-Ray data when performing transfer to large X-Ray targets. We conclude that high quality models for inter-domain transfer can be also obtained by substantially increasing scale of model and generic natural source data, removing necessity for large domain-specific medical source data in the pre-training. Code is available at: \url{https://github.com/SLAMPAI/large-scale-pretraining-transfer}}

A Methodology for Exploring Deep Convolutional Features in Relation to Hand-Crafted Features with an Application to Music Audio Modeling arxiv:2106.00110 📈 0

Anna K. Yanchenko, Mohammadreza Soltani, Robert J. Ravier, Sayan Mukherjee, Vahid Tarokh

**Abstract:** Understanding the features learned by deep models is important from a model trust perspective, especially as deep systems are deployed in the real world. Most recent approaches for deep feature understanding or model explanation focus on highlighting input data features that are relevant for classification decisions. In this work, we instead take the perspective of relating deep features to well-studied, hand-crafted features that are meaningful for the application of interest. We propose a methodology and set of systematic experiments for exploring deep features in this setting, where input feature importance approaches for deep feature understanding do not apply. Our experiments focus on understanding which hand-crafted and deep features are useful for the classification task of interest, how robust these features are for related tasks and how similar the deep features are to the meaningful hand-crafted features. Our proposed method is general to many application areas and we demonstrate its utility on orchestral music audio data.

SHAQ: Incorporating Shapley Value Theory into Multi-Agent Q-Learning arxiv:2105.15013 📈 0

Jianhong Wang, Jinxin Wang, Yuan Zhang, Yunjie Gu, Tae-Kyun Kim

**Abstract:** Value factorisation proves to be a useful technique in multi-agent reinforcement learning (MARL), but the underlying mechanism is not yet fully understood. This paper explores a theoretical framework for value factorisation with interpretability. We generalise Shapley value in coalitional game theory to Markov convex game (MCG) and use it as a value factorisation method for MARL. We show that the generalised Shapley value possesses several features such as (1) efficiency: the sum of optimal generalised Shapley values is equal to the optimal global value, (2) fairness in factorisation of the global value, and (3) sensitiveness to dummy agents. Moreover, we show that MCG with the grand coalition and the generalised Shapley value is within $ε$-core, which means no agents would deviate from the grand coalition. Since MCG with the grand coalition is equivalent to global reward game, it is the first time that Shapley value is rigorously proved to be rationally applied as a value factorisation method for global reward game. Moreover, extending from the Bellman operator we propose Shapley-Q operator that is proved to converge to the optimal generalised Shapley value. With stochastic approximation, a new MARL algorithm called Shapley Q-learning (SHAQ) is yielded. We show the performance of SHAQ on Predator-Prey for modelling relative overgeneralisation and StarCraft Multi-Agent Challenge (SMAC). In experiments, we also demonstrate the interpretability of SHAQ that is lacking in the state-of-the-art baselines.

Locally Private $k$-Means Clustering with Constant Multiplicative Approximation and Near-Optimal Additive Error arxiv:2105.15007 📈 0

Anamay Chaturvedi, Matthew Jones, Huy L. Nguyen

**Abstract:** Given a data set of size $n$ in $d'$-dimensional Euclidean space, the $k$-means problem asks for a set of $k$ points (called centers) so that the sum of the $\ell_2^2$-distances between points of a given data set of size $n$ and the set of $k$ centers is minimized. Recent work on this problem in the locally private setting achieves constant multiplicative approximation with additive error $\tilde{O} (n^{1/2 + a} \cdot k \cdot \max \{\sqrt{d}, \sqrt{k} \})$ and proves a lower bound of $Ω(\sqrt{n})$ on the additive error for any solution with a constant number of rounds. In this work we bridge the gap between the exponents of $n$ in the upper and lower bounds on the additive error with two new algorithms. Given any $α>0$, our first algorithm achieves a multiplicative approximation guarantee which is at most a $(1+α)$ factor greater than that of any non-private $k$-means clustering algorithm with $k^{\tilde{O}(1/α^2)} \sqrt{d' n} \mbox{poly}\log n$ additive error. Given any $c>\sqrt{2}$, our second algorithm achieves $O(k^{1 + \tilde{O}(1/(2c^2-1))} \sqrt{d' n} \mbox{poly} \log n)$ additive error with constant multiplicative approximation. Both algorithms go beyond the $Ω(n^{1/2 + a})$ factor that occurs in the additive error for arbitrarily small parameters $a$ in previous work, and the second algorithm in particular shows for the first time that it is possible to solve the locally private $k$-means problem in a constant number of rounds with constant factor multiplicative approximation and polynomial dependence on $k$ in the additive error arbitrarily close to linear.

Two Coupled Rejection Metrics Can Tell Adversarial Examples Apart arxiv:2105.14785 📈 0

Tianyu Pang, Huishuai Zhang, Di He, Yinpeng Dong, Hang Su, Wei Chen, Jun Zhu, Tie-Yan Liu

**Abstract:** Correctly classifying adversarial examples is an essential but challenging requirement for safely deploying machine learning models. As reported in RobustBench, even the state-of-the-art adversarially trained models struggle to exceed 67% robust test accuracy on CIFAR-10, which is far from practical. A complementary way towards robustness is to introduce a rejection option, allowing the model to not return predictions on uncertain inputs, where confidence is a commonly used certainty proxy. Along with this routine, we find that confidence and a rectified confidence (R-Con) can form two coupled rejection metrics, which could provably distinguish wrongly classified inputs from correctly classified ones. This intriguing property sheds light on using coupling strategies to better detect and reject adversarial examples. We evaluate our rectified rejection (RR) module on CIFAR-10, CIFAR-10-C, and CIFAR-100 under several attacks including adaptive ones, and demonstrate that the RR module is compatible with different adversarial training frameworks on improving robustness, with little extra computation. The code is available at https://github.com/P2333/Rectified-Rejection.

Active Hierarchical Exploration with Stable Subgoal Representation Learning arxiv:2105.14750 📈 0

Siyuan Li, Jin Zhang, Jianhao Wang, Yang Yu, Chongjie Zhang

**Abstract:** Goal-conditioned hierarchical reinforcement learning (GCHRL) provides a promising approach to solving long-horizon tasks. Recently, its success has been extended to more general settings by concurrently learning hierarchical policies and subgoal representations. Although GCHRL possesses superior exploration ability by decomposing tasks via subgoals, existing GCHRL methods struggle in temporally extended tasks with sparse external rewards, since the high-level policy learning relies on external rewards. As the high-level policy selects subgoals in an online learned representation space, the dynamic change of the subgoal space severely hinders effective high-level exploration. In this paper, we propose a novel regularization that contributes to both stable and efficient subgoal representation learning. Building upon the stable representation, we design measures of novelty and potential for subgoals, and develop an active hierarchical exploration strategy that seeks out new promising subgoals and states without intrinsic rewards. Experimental results show that our approach significantly outperforms state-of-the-art baselines in continuous control tasks with sparse rewards.

1xN Pattern for Pruning Convolutional Neural Networks arxiv:2105.14713 📈 0

Mingbao Lin, Yuxin Zhang, Yuchao Li, Bohong Chen, Fei Chao, Mengdi Wang, Shen Li, Yonghong Tian, Rongrong Ji

**Abstract:** Though network pruning receives popularity in reducing the complexity of convolutional neural networks (CNNs), it remains an open issue to concurrently maintain model accuracy as well as achieve significant speedups on general CPUs. In this paper, we propose a novel 1xN pruning pattern to break this limitation. In particular, consecutive N output kernels with the same input channel index are grouped into one block, which serves as a basic pruning granularity of our pruning pattern. Our 1xN pattern prunes these blocks considered unimportant. We also provide a workflow of filter rearrangement that first rearranges the weight matrix in the output channel dimension to derive more influential blocks for accuracy improvements and then applies similar rearrangement to the next-layer weights in the input channel dimension to ensure correct convolutional operations. Moreover, the output computation after our 1xN pruning can be realized via a parallelized block-wise vectorized operation, leading to significant speedups on general CPUs. The efficacy of our pruning pattern is proved with experiments on ILSVRC-2012. For example, Given the pruning rate of 50% and N=4, our pattern obtains about 3.0% improvements over filter pruning in the top-1 accuracy of MobileNet-V2. Meanwhile, it obtains 56.04ms inference savings on Cortex-A7 CPU over weight pruning. Our project is made available at https://github.com/lmbxmu/1xN.

Robustifying $\ell_\infty$ Adversarial Training to the Union of Perturbation Models arxiv:2105.14710 📈 0

Ameya D. Patil, Michael Tuttle, Alexander G. Schwing, Naresh R. Shanbhag

**Abstract:** Classical adversarial training (AT) frameworks are designed to achieve high adversarial accuracy against a single attack type, typically $\ell_\infty$ norm-bounded perturbations. Recent extensions in AT have focused on defending against the union of multiple perturbations but this benefit is obtained at the expense of a significant (up to $10\times$) increase in training complexity over single-attack $\ell_\infty$ AT. In this work, we expand the capabilities of widely popular single-attack $\ell_\infty$ AT frameworks to provide robustness to the union of ($\ell_\infty, \ell_2, \ell_1$) perturbations while preserving their training efficiency. Our technique, referred to as Shaped Noise Augmented Processing (SNAP), exploits a well-established byproduct of single-attack AT frameworks -- the reduction in the curvature of the decision boundary of networks. SNAP prepends a given deep net with a shaped noise augmentation layer whose distribution is learned along with network parameters using any standard single-attack AT. As a result, SNAP enhances adversarial accuracy of ResNet-18 on CIFAR-10 against the union of ($\ell_\infty, \ell_2, \ell_1$) perturbations by 14%-to-20% for four state-of-the-art (SOTA) single-attack $\ell_\infty$ AT frameworks, and, for the first time, establishes a benchmark for ResNet-50 and ResNet-101 on ImageNet.

Next Page