Summary for 2021-10-16, created on 2021-12-15

Sharpness-Aware Minimization Improves Language Model Generalization arxiv:2110.08529 📈 10

Dara Bahri, Hossein Mobahi, Yi Tay

**Abstract:** The allure of superhuman-level capabilities has led to considerable interest in language models like GPT-3 and T5, wherein the research has, by and large, revolved around new model architectures, training tasks, and loss objectives, along with substantial engineering efforts to scale up model capacity and dataset size. Comparatively little work has been done to improve the generalization of these models through better optimization. In this work, we show that Sharpness-Aware Minimization (SAM), a recently proposed optimization procedure that encourages convergence to flatter minima, can substantially improve the generalization of language models without much computational overhead. We show that SAM is able to boost performance on SuperGLUE, GLUE, Web Questions, Natural Questions, Trivia QA, and TyDiQA, with particularly large gains when training data for these tasks is limited.

Lifelong Topological Visual Navigation arxiv:2110.08488 📈 9

Rey Reza Wiyatno, Anqi Xu, Liam Paull

**Abstract:** The ability for a robot to navigate with only the use of vision is appealing due to its simplicity. Traditional vision-based navigation approaches required a prior map-building step that was arduous and prone to failure, or could only exactly follow previously executed trajectories. Newer learning-based visual navigation techniques reduce the reliance on a map and instead directly learn policies from image inputs for navigation. There are currently two prevalent paradigms: end-to-end approaches forego the explicit map representation entirely, and topological approaches which still preserve some loose connectivity of the space. However, while end-to-end methods tend to struggle in long-distance navigation tasks, topological map-based solutions are prone to failure due to spurious edges in the graph. In this work, we propose a learning-based topological visual navigation method with graph update strategies that improve lifelong navigation performance over time. We take inspiration from sampling-based planning algorithms to build image-based topological graphs, resulting in sparser graphs yet with higher navigation performance compared to baseline methods. Also, unlike controllers that learn from fixed training environments, we show that our model can be finetuned using a relatively small dataset from the real-world environment where the robot is deployed. We further assess performance of our system in real-world deployments.

Towards Robust Waveform-Based Acoustic Models arxiv:2110.08634 📈 8

Dino Oglic, Zoran Cvetkovic, Peter Sollich, Steve Renals, Bin Yu

**Abstract:** We propose an approach for learning robust acoustic models in adverse environments, characterized by a significant mismatch between training and test conditions. This problem is of paramount importance for the deployment of speech recognition systems that need to perform well in unseen environments. Our approach is an instance of vicinal risk minimization, which aims to improve risk estimates during training by replacing the delta functions that define the empirical density over the input space with an approximation of the marginal population density in the vicinity of the training samples. More specifically, we assume that local neighborhoods centered at training samples can be approximated using a mixture of Gaussians, and demonstrate theoretically that this can incorporate robust inductive bias into the learning process. We characterize the individual mixture components implicitly via data augmentation schemes, designed to address common sources of spurious correlations in acoustic models. To avoid potential confounding effects on robustness due to information loss, which has been associated with standard feature extraction techniques (e.g., FBANK and MFCC features), we focus our evaluation on the waveform-based setting. Our empirical results show that the proposed approach can generalize to unseen noise conditions, with 150% relative improvement in out-of-distribution generalization compared to training using the standard risk minimization principle. Moreover, the results demonstrate competitive performance relative to models learned using a training sample designed to match the acoustic conditions characteristic of test utterances (i.e., optimal vicinal densities).

Convolutional Deep Denoising Autoencoders for Radio Astronomical Images arxiv:2110.08618 📈 8

Claudio Gheller, Franco Vazza

**Abstract:** We apply a Machine Learning technique known as Convolutional Denoising Autoencoder to denoise synthetic images of state-of-the-art radio telescopes, with the goal of detecting the faint, diffused radio sources predicted to characterise the radio cosmic web. In our application, denoising is intended to address both the reduction of random instrumental noise and the minimisation of additional spurious artefacts like the sidelobes, resulting from the aperture synthesis technique. The effectiveness and the accuracy of the method are analysed for different kinds of corrupted input images, together with its computational performance. Specific attention has been devoted to create realistic mock observations for the training, exploiting the outcomes of cosmological numerical simulations, to generate images corresponding to LOFAR HBA 8 hours observations at 150 MHz. Our autoencoder can effectively denoise complex images identifying and extracting faint objects at the limits of the instrumental sensitivity. The method can efficiently scale on large datasets, exploiting high performance computing solutions, in a fully automated way (i.e. no human supervision is required after training). It can accurately perform image segmentation, identifying low brightness outskirts of diffused sources, proving to be a viable solution for detecting challenging extended objects hidden in noisy radio observations.

MAAD: A Model and Dataset for "Attended Awareness" in Driving arxiv:2110.08610 📈 7

Deepak Gopinath, Guy Rosman, Simon Stent, Katsuya Terahata, Luke Fletcher, Brenna Argall, John Leonard

**Abstract:** We propose a computational model to estimate a person's attended awareness of their environment. We define attended awareness to be those parts of a potentially dynamic scene which a person has attended to in recent history and which they are still likely to be physically aware of. Our model takes as input scene information in the form of a video and noisy gaze estimates, and outputs visual saliency, a refined gaze estimate, and an estimate of the person's attended awareness. In order to test our model, we capture a new dataset with a high-precision gaze tracker including 24.5 hours of gaze sequences from 23 subjects attending to videos of driving scenes. The dataset also contains third-party annotations of the subjects' attended awareness based on observations of their scan path. Our results show that our model is able to reasonably estimate attended awareness in a controlled setting, and in the future could potentially be extended to real egocentric driving data to help enable more effective ahead-of-time warnings in safety systems and thereby augment driver performance. We also demonstrate our model's effectiveness on the tasks of saliency, gaze calibration, and denoising, using both our dataset and an existing saliency dataset. We make our model and dataset available at https://github.com/ToyotaResearchInstitute/att-aware/.

A Unified Speaker Adaptation Approach for ASR arxiv:2110.08545 📈 7

Yingzhu Zhao, Chongjia Ni, Cheung-Chi Leung, Shafiq Joty, Eng Siong Chng, Bin Ma

**Abstract:** Transformer models have been used in automatic speech recognition (ASR) successfully and yields state-of-the-art results. However, its performance is still affected by speaker mismatch between training and test data. Further finetuning a trained model with target speaker data is the most natural approach for adaptation, but it takes a lot of compute and may cause catastrophic forgetting to the existing speakers. In this work, we propose a unified speaker adaptation approach consisting of feature adaptation and model adaptation. For feature adaptation, we employ a speaker-aware persistent memory model which generalizes better to unseen test speakers by making use of speaker i-vectors to form a persistent memory. For model adaptation, we use a novel gradual pruning method to adapt to target speakers without changing the model architecture, which to the best of our knowledge, has never been explored in ASR. Specifically, we gradually prune less contributing parameters on model encoder to a certain sparsity level, and use the pruned parameters for adaptation, while freezing the unpruned parameters to keep the original model performance. We conduct experiments on the Librispeech dataset. Our proposed approach brings relative 2.74-6.52% word error rate (WER) reduction on general speaker adaptation. On target speaker adaptation, our method outperforms the baseline with up to 20.58% relative WER reduction, and surpasses the finetuning method by up to relative 2.54%. Besides, with extremely low-resource adaptation data (e.g., 1 utterance), our method could improve the WER by relative 6.53% with only a few epochs of training.

Analyzing Dynamic Adversarial Training Data in the Limit arxiv:2110.08514 📈 7

Eric Wallace, Adina Williams, Robin Jia, Douwe Kiela

**Abstract:** To create models that are robust across a wide range of test inputs, training datasets should include diverse examples that span numerous phenomena. Dynamic adversarial data collection (DADC), where annotators craft examples that challenge continually improving models, holds promise as an approach for generating such diverse training sets. Prior work has shown that running DADC over 1-3 rounds can help models fix some error types, but it does not necessarily lead to better generalization beyond adversarial test data. We argue that running DADC over many rounds maximizes its training-time benefits, as the different rounds can together cover many of the task-relevant phenomena. We present the first study of longer-term DADC, where we collect 20 rounds of NLI examples for a small set of premise paragraphs, with both adversarial and non-adversarial approaches. Models trained on DADC examples make 26% fewer errors on our expert-curated test set compared to models trained on non-adversarial data. Our analysis shows that DADC yields examples that are more difficult, more lexically and syntactically diverse, and contain fewer annotation artifacts compared to non-adversarial examples.

Multimodal Dialogue Response Generation arxiv:2110.08515 📈 6

Qingfeng Sun, Yujing Wang, Can Xu, Kai Zheng, Yaming Yang, Huang Hu, Fei Xu, Jessica Zhang, Xiubo Geng, Daxin Jiang

**Abstract:** Responsing with image has been recognized as an important capability for an intelligent conversational agent. Yet existing works only focus on exploring the multimodal dialogue models which depend on retrieval-based methods, but neglecting generation methods. To fill in the gaps, we first present a multimodal dialogue generation model, which takes the dialogue history as input, then generates a textual sequence or an image as response. Learning such a model often requires multimodal dialogues containing both texts and images which are difficult to obtain. Motivated by the challenge in practice, we consider multimodal dialogue generation under a natural assumption that only limited training examples are available. In such a low-resource setting, we devise a novel conversational agent, Divter, in order to isolate parameters that depend on multimodal dialogues from the entire generation model. By this means, the major part of the model can be learned from a large number of text-only dialogues and text-image pairs respectively, then the whole parameters can be well fitted using the limited training examples. Extensive experiments demonstrate our method achieves state-of-the-art results in both automatic and human evaluation, and can generate informative text and high-resolution image responses.

Terminal Embeddings in Sublinear Time arxiv:2110.08691 📈 5

Yeshwanth Cherapanamjeri, Jelani Nelson

**Abstract:** Recently (Elkin, Filtser, Neiman 2017) introduced the concept of a {\it terminal embedding} from one metric space $(X,d_X)$ to another $(Y,d_Y)$ with a set of designated terminals $T\subset X$. Such an embedding $f$ is said to have distortion $ρ\ge 1$ if $ρ$ is the smallest value such that there exists a constant $C>0$ satisfying \begin{equation*} \forall x\in T\ \forall q\in X,\ C d_X(x, q) \le d_Y(f(x), f(q)) \le C ρd_X(x, q) . \end{equation*} In the case that $X,Y$ are both Euclidean metrics with $Y$ being $m$-dimensional, recently (Narayanan, Nelson 2019), following work of (Mahabadi, Makarychev, Makarychev, Razenshteyn 2018), showed that distortion $1+ε$ is achievable via such a terminal embedding with $m = O(ε^{-2}\log n)$ for $n := |T|$. This generalizes the Johnson-Lindenstrauss lemma, which only preserves distances within $T$ and not to $T$ from the rest of space. The downside is that evaluating the embedding on some $q\in \mathbb{R}^d$ required solving a semidefinite program with $Θ(n)$ constraints in $m$ variables and thus required some superlinear $\mathrm{poly}(n)$ runtime. Our main contribution in this work is to give a new data structure for computing terminal embeddings. We show how to pre-process $T$ to obtain an almost linear-space data structure that supports computing the terminal embedding image of any $q\in\mathbb{R}^d$ in sublinear time $n^{1-Θ(ε^2)+o(1)} + dn^{o(1)}$. To accomplish this, we leverage tools developed in the context of approximate nearest neighbor search.

Transformer with a Mixture of Gaussian Keys arxiv:2110.08678 📈 5

Tam Nguyen, Tan M. Nguyen, Dung Le, Khuong Nguyen, Anh Tran, Richard G. Baraniuk, Nhat Ho, Stanley J. Osher

**Abstract:** Multi-head attention is a driving force behind state-of-the-art transformers which achieve remarkable performance across a variety of natural language processing (NLP) and computer vision tasks. It has been observed that for many applications, those attention heads learn redundant embedding, and most of them can be removed without degrading the performance of the model. Inspired by this observation, we propose Transformer with a Mixture of Gaussian Keys (Transformer-MGK), a novel transformer architecture that replaces redundant heads in transformers with a mixture of keys at each head. These mixtures of keys follow a Gaussian mixture model and allow each attention head to focus on different parts of the input sequence efficiently. Compared to its conventional transformer counterpart, Transformer-MGK accelerates training and inference, has fewer parameters, and requires less FLOPs to compute while achieving comparable or better accuracy across tasks. Transformer-MGK can also be easily extended to use with linear attentions. We empirically demonstrate the advantage of Transformer-MGK in a range of practical applications including language modeling and tasks that involve very long sequences. On the Wikitext-103 and Long Range Arena benchmark, Transformer-MGKs with 4 heads attain comparable or better performance to the baseline transformers with 8 heads.

A Variational Bayesian Approach to Learning Latent Variables for Acoustic Knowledge Transfer arxiv:2110.08598 📈 5

Hu Hu, Sabato Marco Siniscalchi, Chao-Han Huck Yang, Chin-Hui Lee

**Abstract:** We propose a variational Bayesian (VB) approach to learning distributions of latent variables in deep neural network (DNN) models for cross-domain knowledge transfer, to address acoustic mismatches between training and testing conditions. Instead of carrying out point estimation in conventional maximum a posteriori estimation with a risk of having a curse of dimensionality in estimating a huge number of model parameters, we focus our attention on estimating a manageable number of latent variables of DNNs via a VB inference framework. To accomplish model transfer, knowledge learnt from a source domain is encoded in prior distributions of latent variables and optimally combined, in a Bayesian sense, with a small set of adaptation data from a target domain to approximate the corresponding posterior distributions. Experimental results on device adaptation in acoustic scene classification show that our proposed VB approach can obtain good improvements on target devices, and consistently outperforms 13 state-of-the-art knowledge transfer algorithms.

ASR4REAL: An extended benchmark for speech models arxiv:2110.08583 📈 5

Morgane Riviere, Jade Copet, Gabriel Synnaeve

**Abstract:** Popular ASR benchmarks such as Librispeech and Switchboard are limited in the diversity of settings and speakers they represent. We introduce a set of benchmarks matching real-life conditions, aimed at spotting possible biases and weaknesses in models. We have found out that even though recent models do not seem to exhibit a gender bias, they usually show important performance discrepancies by accent, and even more important ones depending on the socio-economic status of the speakers. Finally, all tested models show a strong performance drop when tested on conversational speech, and in this precise context even a language model trained on a dataset as big as Common Crawl does not seem to have significant positive effect which reiterates the importance of developing conversational language models

BNAS v2: Learning Architectures for Binary Networks with Empirical Improvements arxiv:2110.08562 📈 5

Dahyun Kim, Kunal Pratap Singh, Jonghyun Choi

**Abstract:** Backbone architectures of most binary networks are well-known floating point (FP) architectures such as the ResNet family. Questioning that the architectures designed for FP networks might not be the best for binary networks, we propose to search architectures for binary networks (BNAS) by defining a new search space for binary architectures and a novel search objective. Specifically, based on the cell based search method, we define the new search space of binary layer types, design a new cell template, and rediscover the utility of and propose to use the Zeroise layer instead of using it as a placeholder. The novel search objective diversifies early search to learn better performing binary architectures. We show that our method searches architectures with stable training curves despite the quantization error inherent in binary networks. Quantitative analyses demonstrate that our searched architectures outperform the architectures used in state-of-the-art binary networks and outperform or perform on par with state-of-the-art binary networks that employ various techniques other than architectural changes. In addition, we further propose improvements to the training scheme of our searched architectures. With the new training scheme for our searched architectures, we achieve the state-of-the-art performance by binary networks by outperforming all previous methods by non-trivial margins.

HRKD: Hierarchical Relational Knowledge Distillation for Cross-domain Language Model Compression arxiv:2110.08551 📈 5

Chenhe Dong, Yaliang Li, Ying Shen, Minghui Qiu

**Abstract:** On many natural language processing tasks, large pre-trained language models (PLMs) have shown overwhelming performances compared with traditional neural network methods. Nevertheless, their huge model size and low inference speed have hindered the deployment on resource-limited devices in practice. In this paper, we target to compress PLMs with knowledge distillation, and propose a hierarchical relational knowledge distillation (HRKD) method to capture both hierarchical and domain relational information. Specifically, to enhance the model capability and transferability, we leverage the idea of meta-learning and set up domain-relational graphs to capture the relational information across different domains. And to dynamically select the most representative prototypes for each domain, we propose a hierarchical compare-aggregate mechanism to capture hierarchical relationships. Extensive experiments on public multi-domain datasets demonstrate the superior performance of our HRKD method as well as its strong few-shot learning ability. For reproducibility, we release the code at https://github.com/cheneydon/hrkd.

A Q-Learning-based Approach for Distributed Beam Scheduling in mmWave Networks arxiv:2110.08704 📈 4

Xiang Zhang, Shamik Sarkar, Arupjyoti Bhuyan, Sneha Kumar Kasera, Mingyue Ji

**Abstract:** We consider the problem of distributed downlink beam scheduling and power allocation for millimeter-Wave (mmWave) cellular networks where multiple base stations (BSs) belonging to different service operators share the same unlicensed spectrum with no central coordination or cooperation among them. Our goal is to design efficient distributed beam scheduling and power allocation algorithms such that the network-level payoff, defined as the weighted sum of the total throughput and a power penalization term, can be maximized. To this end, we propose a distributed scheduling approach to power allocation and adaptation for efficient interference management over the shared spectrum by modeling each BS as an independent Q-learning agent. As a baseline, we compare the proposed approach to the state-of-the-art non-cooperative game-based approach which was previously developed for the same problem. We conduct extensive experiments under various scenarios to verify the effect of multiple factors on the performance of both approaches. Experiment results show that the proposed approach adapts well to different interference situations by learning from experience and can achieve higher payoff than the game-based approach. The proposed approach can also be integrated into our previously developed Lyapunov stochastic optimization framework for the purpose of network utility maximization with optimality guarantee. As a result, the weights in the payoff function can be automatically and optimally determined by the virtual queue values from the sub-problems derived from the Lyapunov optimization framework.

Towards Instance-Optimal Offline Reinforcement Learning with Pessimism arxiv:2110.08695 📈 4

Ming Yin, Yu-Xiang Wang

**Abstract:** We study the offline reinforcement learning (offline RL) problem, where the goal is to learn a reward-maximizing policy in an unknown Markov Decision Process (MDP) using the data coming from a policy $μ$. In particular, we consider the sample complexity problems of offline RL for finite-horizon MDPs. Prior works study this problem based on different data-coverage assumptions, and their learning guarantees are expressed by the covering coefficients which lack the explicit characterization of system quantities. In this work, we analyze the Adaptive Pessimistic Value Iteration (APVI) algorithm and derive the suboptimality upper bound that nearly matches \[ O\left(\sum_{h=1}^H\sum_{s_h,a_h}d^{π^\star}_h(s_h,a_h)\sqrt{\frac{\mathrm{Var}_{P_{s_h,a_h}}{(V^\star_{h+1}+r_h)}}{d^μ_h(s_h,a_h)}}\sqrt{\frac{1}{n}}\right). \] In complementary, we also prove a per-instance information-theoretical lower bound under the weak assumption that $d^μ_h(s_h,a_h)>0$ if $d^{π^\star}_h(s_h,a_h)>0$. Different from the previous minimax lower bounds, the per-instance lower bound (via local minimaxity) is a much stronger criterion as it applies to individual instances separately. Here $π^\star$ is a optimal policy, $μ$ is the behavior policy and $d_h^μ$ is the marginal state-action probability. We call the above equation the intrinsic offline reinforcement learning bound since it directly implies all the existing optimal results: minimax rate under uniform data-coverage assumption, horizon-free setting, single policy concentrability, and the tight problem-dependent results. Later, we extend the result to the assumption-free regime (where we make no assumption on $ μ$) and obtain the assumption-free intrinsic bound. Due to its generic form, we believe the intrinsic bound could help illuminate what makes a specific problem hard and reveal the fundamental challenges in offline RL.

Hydra: A System for Large Multi-Model Deep Learning arxiv:2110.08633 📈 4

Kabir Nagrecha, Arun Kumar

**Abstract:** Training deep learning (DL) models that do not fit into the memory of a single GPU is a vexed process, forcing users to procure multiple GPUs to adopt model-parallel execution. Unfortunately, sequential dependencies in neural architectures often block efficient multi-device training, leading to suboptimal performance. We present 'model spilling', a technique aimed at models such as Transformers and CNNs to move groups of layers, or shards, between DRAM and GPU memory, thus enabling arbitrarily large models to be trained even on just one GPU. We then present a set of novel techniques leveraging spilling to raise efficiency for multi-model training workloads such as model selection: a new hybrid of task- and model-parallelism, a new shard scheduling heuristic, and 'double buffering' to hide latency. We prototype our ideas into a system we call HYDRA to support seamless single-model and multi-model training of large DL models. Experiments with real benchmark workloads show that HYDRA is over 7x faster than regular model parallelism and over 50% faster than state-of-the-art industrial tools for pipeline parallelism.

Sparse Distillation: Speeding Up Text Classification by Using Bigger Models arxiv:2110.08536 📈 4

Qinyuan Ye, Madian Khabsa, Mike Lewis, Sinong Wang, Xiang Ren, Aaron Jaech

**Abstract:** Distilling state-of-the-art transformer models into lightweight student models is an effective way to reduce computation cost at inference time. However, the improved inference speed may be still unsatisfactory for certain time-sensitive applications. In this paper, we aim to further push the limit of inference speed by exploring a new area in the design space of the student model. More specifically, we consider distilling a transformer-based text classifier into a billion-parameter, sparsely-activated student model with a embedding-averaging architecture. Our experiments show that the student models retain 97% of the RoBERTa-Large teacher performance on a collection of six text classification tasks. Meanwhile, the student model achieves up to 600x speed-up on both GPUs and CPUs, compared to the teacher models. Further investigation shows that our pipeline is also effective in privacy-preserving and domain generalization settings.

Understanding Procedural Knowledge by Sequencing Multimodal Instructional Manuals arxiv:2110.08486 📈 4

Te-Lin Wu, Alex Spangher, Pegah Alipoormolabashi, Marjorie Freedman, Ralph Weischedel, Nanyun Peng

**Abstract:** The ability to sequence unordered events is an essential skill to comprehend and reason about real world task procedures, which often requires thorough understanding of temporal common sense and multimodal information, as these procedures are often communicated through a combination of texts and images. Such capability is essential for applications such as sequential task planning and multi-source instruction summarization. While humans are capable of reasoning about and sequencing unordered multimodal procedural instructions, whether current machine learning models have such essential capability is still an open question. In this work, we benchmark models' capability of reasoning over and sequencing unordered multimodal instructions by curating datasets from popular online instructional manuals and collecting comprehensive human annotations. We find models not only perform significantly worse than humans but also seem incapable of efficiently utilizing the multimodal information. To improve machines' performance on multimodal event sequencing, we propose sequentiality-aware pretraining techniques that exploit the sequential alignment properties of both texts and images, resulting in > 5% significant improvements.

A Survey on Awesome Korean NLP Datasets arxiv:2112.01624 📈 3

Byunghyun Ban

**Abstract:** English based datasets are commonly available from Kaggle, GitHub, or recently published papers. Although benchmark tests with English datasets are sufficient to show off the performances of new models and methods, still a researcher need to train and validate the models on Korean based datasets to produce a technology or product, suitable for Korean processing. This paper introduces 15 popular Korean based NLP datasets with summarized details such as volume, license, repositories, and other research results inspired by the datasets. Also, I provide high-resolution instructions with sample or statistics of datasets. The main characteristics of datasets are presented on a single table to provide a rapid summarization of datasets for researchers.

CAE-Transformer: Transformer-based Model to Predict Invasiveness of Lung Adenocarcinoma Subsolid Nodules from Non-thin Section 3D CT Scans arxiv:2110.08721 📈 3

Shahin Heidarian, Parnian Afshar, Anastasia Oikonomou, Konstantinos N. Plataniotis, Arash Mohammadi

**Abstract:** Lung cancer is the leading cause of mortality from cancer worldwide and has various histologic types, among which Lung Adenocarcinoma (LUAC) has recently been the most prevalent. Lung adenocarcinomas are classified as pre invasive, minimally invasive, and invasive adenocarcinomas. Timely and accurate knowledge of the invasiveness of lung nodules leads to a proper treatment plan and reduces the risk of unnecessary or late surgeries. Currently, the primary imaging modality to assess and predict the invasiveness of LUACs is the chest CT. The results based on CT images, however, are subjective and suffer from a low accuracy compared to the ground truth pathological reviews provided after surgical resections. In this paper, a predictive transformer-based framework, referred to as the "CAE-Transformer", is developed to classify LUACs. The CAE-Transformer utilizes a Convolutional Auto-Encoder (CAE) to automatically extract informative features from CT slices, which are then fed to a modified transformer model to capture global inter-slice relations. Experimental results on our in-house dataset of 114 pathologically proven Sub Solid Nodules (SSNs) demonstrate the superiority of the CAE-Transformer over the histogram/radiomics-based models and its deep learning-based counterparts, achieving an accuracy of 87.73%, sensitivity of 88.67%, specificity of 86.33%, and AUC of 0.913, using a 10-fold cross-validation.

Black-box Adversarial Attacks on Network-wide Multi-step Traffic State Prediction Models arxiv:2110.08712 📈 3

Bibek Poudel, Weizi Li

**Abstract:** Traffic state prediction is necessary for many Intelligent Transportation Systems applications. Recent developments of the topic have focused on network-wide, multi-step prediction, where state of the art performance is achieved via deep learning models, in particular, graph neural network-based models. While the prediction accuracy of deep learning models is high, these models' robustness has raised many safety concerns, given that imperceptible perturbations added to input can substantially degrade the model performance. In this work, we propose an adversarial attack framework by treating the prediction model as a black-box, i.e., assuming no knowledge of the model architecture, training data, and (hyper)parameters. However, we assume that the adversary can oracle the prediction model with any input and obtain corresponding output. Next, the adversary can train a substitute model using input-output pairs and generate adversarial signals based on the substitute model. To test the attack effectiveness, two state of the art, graph neural network-based models (GCGRNN and DCRNN) are examined. As a result, the adversary can degrade the target model's prediction accuracy up to $54\%$. In comparison, two conventional statistical models (linear regression and historical average) are also examined. While these two models do not produce high prediction accuracy, they are either influenced negligibly (less than $3\%$) or are immune to the adversary's attack.

NeuralArTS: Structuring Neural Architecture Search with Type Theory arxiv:2110.08710 📈 3

Robert Wu, Nayan Saxena, Rohan Jain

**Abstract:** Neural Architecture Search (NAS) algorithms automate the task of finding optimal deep learning architectures given an initial search space of possible operations. Developing these search spaces is usually a manual affair with pre-optimized search spaces being more efficient, rather than searching from scratch. In this paper we present a new framework called Neural Architecture Type System (NeuralArTS) that categorizes the infinite set of network operations in a structured type system. We further demonstrate how NeuralArTS can be applied to convolutional layers and propose several future directions.

On the Pareto Frontier of Regret Minimization and Best Arm Identification in Stochastic Bandits arxiv:2110.08627 📈 3

Zixin Zhong, Wang Chi Cheung, Vincent Y. F. Tan

**Abstract:** We study the Pareto frontier of two archetypal objectives in stochastic bandits, namely, regret minimization (RM) and best arm identification (BAI) with a fixed horizon. It is folklore that the balance between exploitation and exploration is crucial for both RM and BAI, but exploration is more critical in achieving the optimal performance for the latter objective. To make this precise, we first design and analyze the BoBW-lil'UCB$(γ)$ algorithm, which achieves order-wise optimal performance for RM or BAI under different values of $γ$. Complementarily, we show that no algorithm can simultaneously perform optimally for both the RM and BAI objectives. More precisely, we establish non-trivial lower bounds on the regret achievable by any algorithm with a given BAI failure probability. This analysis shows that in some regimes BoBW-lil'UCB$(γ)$ achieves Pareto-optimality up to constant or small terms. Numerical experiments further demonstrate that when applied to difficult instances, BoBW-lil'UCB outperforms a close competitor UCB$_α$ (Degenne et al., 2019), which is designed for RM and BAI with a fixed confidence.

SAGAN: Adversarial Spatial-asymmetric Attention for Noisy Nona-Bayer Reconstruction arxiv:2110.08619 📈 3

S M A Sharif, Rizwan Ali Naqvi, Mithun Biswas

**Abstract:** Nona-Bayer colour filter array (CFA) pattern is considered one of the most viable alternatives to traditional Bayer patterns. Despite the substantial advantages, such non-Bayer CFA patterns are susceptible to produce visual artefacts while reconstructing RGB images from noisy sensor data. This study addresses the challenges of learning RGB image reconstruction from noisy Nona-Bayer CFA comprehensively. We propose a novel spatial-asymmetric attention module to jointly learn bi-direction transformation and large-kernel global attention to reduce the visual artefacts. We combine our proposed module with adversarial learning to produce plausible images from Nona-Bayer CFA. The feasibility of the proposed method has been verified and compared with the state-of-the-art image reconstruction method. The experiments reveal that the proposed method can reconstruct RGB images from noisy Nona-Bayer CFA without producing any visually disturbing artefacts. Also, it can outperform the state-of-the-art image reconstruction method in both qualitative and quantitative comparison. Code available: https://github.com/sharif-apu/SAGAN_BMVC21.

Mapping illegal waste dumping sites with neural-network classification of satellite imagery arxiv:2110.08599 📈 3

Maria Roberta Devesa, Antonio Vazquez Brust

**Abstract:** Public health and habitat quality are crucial goals of urban planning. In recent years, the severe social and environmental impact of illegal waste dumping sites has made them one of the most serious problems faced by cities in the Global South, in a context of scarce information available for decision making. To help identify the location of dumping sites and track their evolution over time we adopt a data-driven model from the machine learning domain, analyzing satellite images. This allows us to take advantage of the increasing availability of geo-spatial open-data, high-resolution satellite imagery, and open source tools to train machine learning algorithms with a small set of known waste dumping sites in Buenos Aires, and then predict the location of other sites over vast areas at high speed and low cost. This case study shows the results of a collaboration between Dymaxion Labs and Fundación Bunge y Born to harness this technique in order to create a comprehensive map of potential locations of illegal waste dumping sites in the region.

Generative Adversarial Imitation Learning for End-to-End Autonomous Driving on Urban Environments arxiv:2110.08586 📈 3

Gustavo Claudio Karl Couto, Eric Aislan Antonelo

**Abstract:** Autonomous driving is a complex task, which has been tackled since the first self-driving car ALVINN in 1989, with a supervised learning approach, or behavioral cloning (BC). In BC, a neural network is trained with state-action pairs that constitute the training set made by an expert, i.e., a human driver. However, this type of imitation learning does not take into account the temporal dependencies that might exist between actions taken in different moments of a navigation trajectory. These type of tasks are better handled by reinforcement learning (RL) algorithms, which need to define a reward function. On the other hand, more recent approaches to imitation learning, such as Generative Adversarial Imitation Learning (GAIL), can train policies without explicitly requiring to define a reward function, allowing an agent to learn by trial and error directly on a training set of expert trajectories. In this work, we propose two variations of GAIL for autonomous navigation of a vehicle in the realistic CARLA simulation environment for urban scenarios. Both of them use the same network architecture, which process high dimensional image input from three frontal cameras, and other nine continuous inputs representing the velocity, the next point from the sparse trajectory and a high-level driving command. We show that both of them are capable of imitating the expert trajectory from start to end after training ends, but the GAIL loss function that is augmented with BC outperforms the former in terms of convergence time and training stability.

Visual-aware Attention Dual-stream Decoder for Video Captioning arxiv:2110.08578 📈 3

Zhixin Sun, Xian Zhong, Shuqin Chen, Lin Li, Luo Zhong

**Abstract:** Video captioning is a challenging task that captures different visual parts and describes them in sentences, for it requires visual and linguistic coherence. The attention mechanism in the current video captioning method learns to assign weight to each frame, promoting the decoder dynamically. This may not explicitly model the correlation and the temporal coherence of the visual features extracted in the sequence frames.To generate semantically coherent sentences, we propose a new Visual-aware Attention (VA) model, which concatenates dynamic changes of temporal sequence frames with the words at the previous moment, as the input of attention mechanism to extract sequence features.In addition, the prevalent approaches widely use the teacher-forcing (TF) learning during training, where the next token is generated conditioned on the previous ground-truth tokens. The semantic information in the previously generated tokens is lost. Therefore, we design a self-forcing (SF) stream that takes the semantic information in the probability distribution of the previous token as input to enhance the current token.The Dual-stream Decoder (DD) architecture unifies the TF and SF streams, generating sentences to promote the annotated captioning for both streams.Meanwhile, with the Dual-stream Decoder utilized, the exposure bias problem is alleviated, caused by the discrepancy between the training and testing in the TF learning.The effectiveness of the proposed Visual-aware Attention Dual-stream Decoder (VADD) is demonstrated through the result of experimental studies on Microsoft video description (MSVD) corpus and MSR-Video to text (MSR-VTT) datasets.

Virtual Augmentation Supported Contrastive Learning of Sentence Representations arxiv:2110.08552 📈 3

Dejiao Zhang, Wei Xiao, Henghui Zhu, Xiaofei Ma, Andrew O. Arnold

**Abstract:** Despite profound successes, contrastive representation learning relies on carefully designed data augmentations using domain specific knowledge. This challenge is magnified in natural language processing where no general rules exist for data augmentation due to the discrete nature of natural language. We tackle this challenge by presenting a Virtual augmentation Supported Contrastive Learning of sentence representations (VaSCL). Originating from the interpretation that data augmentation essentially constructs the neighborhoods of each training instance, we in turn utilize the neighborhood to generate effective data augmentations. Leveraging the large training batch size of contrastive learning, we approximate the neighborhood of an instance via its K-nearest in-batch neighbors in the representation space. We then define an instance discrimination task within this neighborhood, and generate the virtual augmentation in an adversarial training manner. We access the performance of VaSCL on a wide range of downstream tasks, and set a new state-of-the-art for unsupervised sentence representation learning.

An Empirical Survey of the Effectiveness of Debiasing Techniques for Pre-Trained Language Models arxiv:2110.08527 📈 3

Nicholas Meade, Elinor Poole-Dayan, Siva Reddy

**Abstract:** Recent work has shown that pre-trained language models capture social biases from the text corpora they are trained on. This has attracted attention to developing techniques that mitigate such biases. In this work, we perform a empirical survey of five recently proposed debiasing techniques: Counterfactual Data Augmentation (CDA), Dropout, Iterative Nullspace Projection, Self-Debias, and SentenceDebias. We quantify the effectiveness of each technique using three different bias benchmarks while also measuring the impact of these techniques on a model's language modeling ability, as well as its performance on downstream NLU tasks. We experimentally find that: (1) CDA and Self-Debias are the strongest of the debiasing techniques, obtaining improved scores on most of the bias benchmarks (2) Current debiasing techniques do not generalize well beyond gender bias; And (3) improvements on bias benchmarks such as StereoSet and CrowS-Pairs by using debiasing strategies are usually accompanied by a decrease in language modeling ability, making it difficult to determine whether the bias mitigation is effective.

BAPGAN: GAN-based Bone Age Progression of Femur and Phalange X-ray Images arxiv:2110.08509 📈 3

Shinji Nakazawa, Changhee Han, Joe Hasei, Ryuichi Nakahara, Toshifumi Ozaki

**Abstract:** Convolutional Neural Networks play a key role in bone age assessment for investigating endocrinology, genetic, and growth disorders under various modalities and body regions. However, no researcher has tackled bone age progression/regression despite its valuable potential applications: bone-related disease diagnosis, clinical knowledge acquisition, and museum education. Therefore, we propose Bone Age Progression Generative Adversarial Network (BAPGAN) to progress/regress both femur/phalange X-ray images while preserving identity and realism. We exhaustively confirm the BAPGAN's clinical potential via Frechet Inception Distance, Visual Turing Test by two expert orthopedists, and t-Distributed Stochastic Neighbor Embedding.

On Model Selection Consistency of Lasso for High-Dimensional Ising Models on Tree-like Graphs arxiv:2110.08500 📈 3

Xiangming Meng, Tomoyuki Obuchi, Yoshiyuki Kabashima

**Abstract:** We consider the problem of high-dimensional Ising model selection using neighborhood-based least absolute shrinkage and selection operator (Lasso). It is rigorously proved that under some mild coherence conditions on the population covariance matrix of the Ising model, consistent model selection can be achieved with sample sizes $n=Ω{(d^3\log{p})}$ for any tree-like graph in the paramagnetic phase, where $p$ is the number of variables and $d$ is the maximum node degree. When the same conditions are imposed directly on the sample covariance matrices, it is shown that a reduced sample size $n=Ω{(d^2\log{p})}$ suffices. The obtained sufficient conditions for consistent model selection with Lasso are the same in the scaling of the sample complexity as that of $\ell_1$-regularized logistic regression. Given the popularity and efficiency of Lasso, our rigorous analysis provides a theoretical backing for its practical use in Ising model selection.

Centroid Approximation for Bootstrap arxiv:2110.08720 📈 2

Mao Ye, Qiang Liu

**Abstract:** Bootstrap is a principled and powerful frequentist statistical tool for uncertainty quantification. Unfortunately, standard bootstrap methods are computationally intensive due to the need of drawing a large i.i.d. bootstrap sample to approximate the ideal bootstrap distribution; this largely hinders their application in large-scale machine learning, especially deep learning problems. In this work, we propose an efficient method to explicitly \emph{optimize} a small set of high quality "centroid" points to better approximate the ideal bootstrap distribution. We achieve this by minimizing a simple objective function that is asymptotically equivalent to the Wasserstein distance to the ideal bootstrap distribution. This allows us to provide an accurate estimation of uncertainty with a small number of bootstrap centroids, outperforming the naive i.i.d. sampling approach. Empirically, we show that our method can boost the performance of bootstrap in a variety of applications.

Pareto Navigation Gradient Descent: a First-Order Algorithm for Optimization in Pareto Set arxiv:2110.08713 📈 2

Mao Ye, Qiang Liu

**Abstract:** Many modern machine learning applications, such as multi-task learning, require finding optimal model parameters to trade-off multiple objective functions that may conflict with each other. The notion of the Pareto set allows us to focus on the set of (often infinite number of) models that cannot be strictly improved. But it does not provide an actionable procedure for picking one or a few special models to return to practical users. In this paper, we consider \emph{optimization in Pareto set (OPT-in-Pareto)}, the problem of finding Pareto models that optimize an extra reference criterion function within the Pareto set. This function can either encode a specific preference from the users, or represent a generic diversity measure for obtaining a set of diversified Pareto models that are representative of the whole Pareto set. Unfortunately, despite being a highly useful framework, efficient algorithms for OPT-in-Pareto have been largely missing, especially for large-scale, non-convex, and non-linear objectives in deep learning. A naive approach is to apply Riemannian manifold gradient descent on the Pareto set, which yields a high computational cost due to the need for eigen-calculation of Hessian matrices. We propose a first-order algorithm that approximately solves OPT-in-Pareto using only gradient information, with both high practical efficiency and theoretically guaranteed convergence property. Empirically, we demonstrate that our method works efficiently for a variety of challenging multi-task-related problems.

On the Statistical Analysis of Complex Tree-shaped 3D Objects arxiv:2110.08693 📈 2

Guan Wang, Hamid Laga, Anuj Srivastava

**Abstract:** How can one analyze detailed 3D biological objects, such as neurons and botanical trees, that exhibit complex geometrical and topological variation? In this paper, we develop a novel mathematical framework for representing, comparing, and computing geodesic deformations between the shapes of such tree-like 3D objects. A hierarchical organization of subtrees characterizes these objects -- each subtree has the main branch with some side branches attached -- and one needs to match these structures across objects for meaningful comparisons. We propose a novel representation that extends the Square-Root Velocity Function (SRVF), initially developed for Euclidean curves, to tree-shaped 3D objects. We then define a new metric that quantifies the bending, stretching, and branch sliding needed to deform one tree-shaped object into the other. Compared to the current metrics, such as the Quotient Euclidean Distance (QED) and the Tree Edit Distance (TED), the proposed representation and metric capture the full elasticity of the branches (i.e., bending and stretching) as well as the topological variations (i.e., branch death/birth and sliding). It completely avoids the shrinkage that results from the edge collapse and node split operations of the QED and TED metrics. We demonstrate the utility of this framework in comparing, matching, and computing geodesics between biological objects such as neurons and botanical trees. The framework is also applied to various shape analysis tasks: (i) symmetry analysis and symmetrization of tree-shaped 3D objects, (ii) computing summary statistics (means and modes of variations) of populations of tree-shaped 3D objects, (iii) fitting parametric probability distributions to such populations, and (iv) finally synthesizing novel tree-shaped 3D objects through random sampling from estimated probability distributions.

Classical-to-Quantum Transfer Learning for Spoken Command Recognition Based on Quantum Neural Networks arxiv:2110.08689 📈 2

Jun Qi, Javier Tejedor

**Abstract:** This work investigates an extension of transfer learning applied in machine learning algorithms to the emerging hybrid end-to-end quantum neural network (QNN) for spoken command recognition (SCR). Our QNN-based SCR system is composed of classical and quantum components: (1) the classical part mainly relies on a 1D convolutional neural network (CNN) to extract speech features; (2) the quantum part is built upon the variational quantum circuit with a few learnable parameters. Since it is inefficient to train the hybrid end-to-end QNN from scratch on a noisy intermediate-scale quantum (NISQ) device, we put forth a hybrid transfer learning algorithm that allows a pre-trained classical network to be transferred to the classical part of the hybrid QNN model. The pre-trained classical network is further modified and augmented through jointly fine-tuning with a variational quantum circuit (VQC). The hybrid transfer learning methodology is particularly attractive for the task of QNN-based SCR because low-dimensional classical features are expected to be encoded into quantum states. We assess the hybrid transfer learning algorithm applied to the hybrid classical-quantum QNN for SCR on the Google speech command dataset, and our classical simulation results suggest that the hybrid transfer learning can boost our baseline performance on the SCR task.

Noise-Augmented Privacy-Preserving Empirical Risk Minimization with Dual-purpose Regularizer and Privacy Budget Retrieval and Recycling arxiv:2110.08676 📈 2

Yinan Li, Fang Liu

**Abstract:** We propose Noise-Augmented Privacy-Preserving Empirical Risk Minimization (NAPP-ERM) that solves ERM with differential privacy guarantees. Existing privacy-preserving ERM approaches may be subject to over-regularization with the employment of an l2 term to achieve strong convexity on top of the target regularization. NAPP-ERM improves over the current approaches and mitigates over-regularization by iteratively realizing target regularization through appropriately designed augmented data and delivering strong convexity via a single adaptively weighted dual-purpose l2 regularizer. When the target regularization is for variable selection, we propose a new regularizer that achieves both privacy and sparsity guarantees simultaneously. Finally, we propose a strategy to retrieve privacy budget when the strong convexity requirement is met, which can be returned to users such that the DP of ERM is guaranteed at a lower privacy cost than originally planned, or be recycled to the ERM optimization procedure to reduce the injected DP noise and improve the utility of DP-ERM. From an implementation perspective, NAPP-ERM can be achieved by optimizing a non-perturbed object function given noise-augmented data and can thus leverage existing tools for non-private ERM optimization. We illustrate through extensive experiments the mitigation effect of the over-regularization and private budget retrieval by NAPP-ERM on variable selection and prediction.

Finding Critical Scenarios for Automated Driving Systems: A Systematic Literature Review arxiv:2110.08664 📈 2

Xinhai Zhang, Jianbo Tao, Kaige Tan, Martin Törngren, José Manuel Gaspar Sánchez, Muhammad Rusyadi Ramli, Xin Tao, Magnus Gyllenhammar, Franz Wotawa, Naveen Mohan, Mihai Nica, Hermann Felbinger

**Abstract:** Scenario-based approaches have been receiving a huge amount of attention in research and engineering of automated driving systems. Due to the complexity and uncertainty of the driving environment, and the complexity of the driving task itself, the number of possible driving scenarios that an ADS or ADAS may encounter is virtually infinite. Therefore it is essential to be able to reason about the identification of scenarios and in particular critical ones that may impose unacceptable risk if not considered. Critical scenarios are particularly important to support design, verification and validation efforts, and as a basis for a safety case. In this paper, we present the results of a systematic literature review in the context of autonomous driving. The main contributions are: (i) introducing a comprehensive taxonomy for critical scenario identification methods; (ii) giving an overview of the state-of-the-art research based on the taxonomy encompassing 86 papers between 2017 and 2020; and (iii) identifying open issues and directions for further research. The provided taxonomy comprises three main perspectives encompassing the problem definition (the why), the solution (the methods to derive scenarios), and the assessment of the established scenarios. In addition, we discuss open research issues considering the perspectives of coverage, practicability, and scenario space explosion.

Equivariant Discrete Normalizing Flows arxiv:2110.08649 📈 2

Avishek Joey Bose, Ivan Kobyzev

**Abstract:** At its core, generative modeling seeks to uncover the underlying factors that give rise to observed data that can often be modelled as the natural symmetries that manifest themselves through invariances and equivariances to certain transformations laws. However, current approaches are couched in the formalism of continuous normalizing flows that require the construction of equivariant vector fields -- inhibiting their simple application to conventional higher dimensional generative modelling domains like natural images. In this paper we focus on building equivariant normalizing flows using discrete layers. We first theoretically prove the existence of an equivariant map for compact groups whose actions are on compact spaces. We further introduce two new equivariant flows: $G$-coupling Flows and $G$-Residual Flows that elevate classical Coupling and Residual Flows with equivariant maps to a prescribed group $G$. Our construction of $G$-Residual Flows are also universal, in the sense that we prove an $G$-equivariant diffeomorphism can be exactly mapped by a $G$-residual flow. Finally, we complement our theoretical insights with experiments -- for the first time -- on image datasets like CIFAR-10 and show $G$-Equivariant Discrete Normalizing flows lead to increased data efficiency, faster convergence, and improved likelihood estimates.

Local Advantage Actor-Critic for Robust Multi-Agent Deep Reinforcement Learning arxiv:2110.08642 📈 2

Yuchen Xiao, Xueguang Lyu, Christopher Amato

**Abstract:** Policy gradient methods have become popular in multi-agent reinforcement learning, but they suffer from high variance due to the presence of environmental stochasticity and exploring agents (i.e., non-stationarity), which is potentially worsened by the difficulty in credit assignment. As a result, there is a need for a method that is not only capable of efficiently solving the above two problems but also robust enough to solve a variety of tasks. To this end, we propose a new multi-agent policy gradient method, called Robust Local Advantage (ROLA) Actor-Critic. ROLA allows each agent to learn an individual action-value function as a local critic as well as ameliorating environment non-stationarity via a novel centralized training approach based on a centralized critic. By using this local critic, each agent calculates a baseline to reduce variance on its policy gradient estimation, which results in an expected advantage action-value over other agents' choices that implicitly improves credit assignment. We evaluate ROLA across diverse benchmarks and show its robustness and effectiveness over a number of state-of-the-art multi-agent policy gradient algorithms.

Physics-guided Deep Markov Models for Learning Nonlinear Dynamical Systems with Uncertainty arxiv:2110.08607 📈 2

Wei Liu, Zhilu Lai, Kiran Bacsa, Eleni Chatzi

**Abstract:** In this paper, we propose a probabilistic physics-guided framework, termed Physics-guided Deep Markov Model (PgDMM). The framework is especially targeted to the inference of the characteristics and latent structure of nonlinear dynamical systems from measurement data, where it is typically intractable to perform exact inference of latent variables. A recently surfaced option pertains to leveraging variational inference to perform approximate inference. In such a scheme, transition and emission functions of the system are parameterized via feed-forward neural networks (deep generative models). However, due to the generalized and highly versatile formulation of neural network functions, the learned latent space is often prone to lack physical interpretation and structured representation. To address this, we bridge physics-based state space models with Deep Markov Models, thus delivering a hybrid modeling framework for unsupervised learning and identification for nonlinear dynamical systems. Specifically, the transition process can be modeled as a physics-based model enhanced with an additive neural network component, which aims to learn the discrepancy between the physics-based model and the actual dynamical system being monitored. The proposed framework takes advantage of the expressive power of deep learning, while retaining the driving physics of the dynamical system by imposing physics-driven restrictions on the side of the latent space. We demonstrate the benefits of such a fusion in terms of achieving improved performance on illustrative simulation examples and experimental case studies of nonlinear systems. Our results indicate that the physics-based models involved in the employed transition and emission functions essentially enforce a more structured and physically interpretable latent space, which is essential to generalization and prediction capabilities.

A MIMO Radar-based Few-Shot Learning Approach for Human-ID arxiv:2110.08595 📈 2

Pascal Weller, Fady Aziz, Sherif Abdulatif, Urs Schneider, Marco F. Huber

**Abstract:** Radar for deep learning-based human identification has become a research area of increasing interest. It has been shown that micro-Doppler (\(\upmu\)-D) can reflect the walking behavior through capturing the periodic limbs' micro-motions. One of the main aspects is maximizing the number of included classes while considering the real-time and training dataset size constraints. In this paper, a multiple-input-multiple-output (MIMO) radar is used to formulate micro-motion spectrograms of the elevation angular velocity (\(\upmu\)-\(ω\)). The effectiveness of concatenating this newly-formulated spectrogram with the commonly used \(\upmu\)-D is investigated. To accommodate for non-constrained real walking motion, an adaptive cycle segmentation framework is utilized and a metric learning network is trained on half gait cycles (\(\approx\) 0.5 s). Studies on the effects of various numbers of classes (5--20), different dataset sizes, and varying observation time windows 1--2 s are conducted. A non-constrained walking dataset of 22 subjects is collected with different aspect angles with respect to the radar. The proposed few-shot learning (FSL) approach achieves a classification error of 11.3 % with only 2 min of training data per subject.

Nys-Curve: Nyström-Approximated Curvature for Stochastic Optimization arxiv:2110.08577 📈 2

Hardik Tankaria, Dinesh Singh, Makoto Yamada

**Abstract:** The quasi-Newton methods generally provide curvature information by approximating the Hessian using the secant equation. However, the secant equation becomes insipid in approximating the Newton step owing to its use of the first-order derivatives. In this study, we propose an approximate Newton step-based stochastic optimization algorithm for large-scale empirical risk minimization of convex functions with linear convergence rates. Specifically, we compute a partial column Hessian of size ($d\times k$) with $k\ll d$ randomly selected variables, then use the \textit{Nyström method} to better approximate the full Hessian matrix. To further reduce the computational complexity per iteration, we directly compute the update step ($Δ\boldsymbol{w}$) without computing and storing the full Hessian or its inverse. Furthermore, to address large-scale scenarios in which even computing a partial Hessian may require significant time, we used distribution-preserving (DP) sub-sampling to compute a partial Hessian. The DP sub-sampling generates $p$ sub-samples with similar first and second-order distribution statistics and selects a single sub-sample at each epoch in a round-robin manner to compute the partial Hessian. We integrate our approximated Hessian with stochastic gradient descent and stochastic variance-reduced gradients to solve the logistic regression problem. The numerical experiments show that the proposed approach was able to obtain a better approximation of Newton\textquotesingle s method with performance competitive with the state-of-the-art first-order and the stochastic quasi-Newton methods.

Deep Image Debanding arxiv:2110.08569 📈 2

Raymond Zhou, Shahrukh Athar, Zhongling Wang, Zhou Wang

**Abstract:** Banding or false contour is an annoying visual artifact whose impact is even more pronounced in ultra high definition, high dynamic range, and wide colour gamut visual content, which is becoming increasingly popular. Since users associate a heightened expectation of quality with such content and banding leads to deteriorated visual quality-of-experience, the area of banding removal or debanding has taken paramount importance. Existing debanding approaches are mostly knowledge-driven. Despite the widespread success of deep learning in other areas of image processing and computer vision, data-driven debanding approaches remain surprisingly missing. In this work, we make one of the first attempts to develop a deep learning based banding artifact removal method for images and name it deep debanding network (deepDeband). For its training, we construct a large-scale dataset of 51,490 pairs of corresponding pristine and banded image patches. Performance evaluation shows that deepDeband is successful at greatly reducing banding artifacts in images, outperforming existing methods both quantitatively and visually.

DPNAS: Neural Architecture Search for Deep Learning with Differential Privacy arxiv:2110.08557 📈 2

Anda Cheng, Jiaxing Wang, Xi Sheryl Zhang, Qiang Chen, Peisong Wang, Jian Cheng

**Abstract:** Training deep neural networks (DNNs) for meaningful differential privacy (DP) guarantees severely degrades model utility. In this paper, we demonstrate that the architecture of DNNs has a significant impact on model utility in the context of private deep learning, whereas its effect is largely unexplored in previous studies. In light of this missing, we propose the very first framework that employs neural architecture search to automatic model design for private deep learning, dubbed as DPNAS. To integrate private learning with architecture search, we delicately design a novel search space and propose a DP-aware method for training candidate models. We empirically certify the effectiveness of the proposed framework. The searched model DPNASNet achieves state-of-the-art privacy/utility trade-offs, e.g., for the privacy budget of $(ε, δ)=(3, 1\times10^{-5})$, our model obtains test accuracy of $98.57\%$ on MNIST, $88.09\%$ on FashionMNIST, and $68.33\%$ on CIFAR-10. Furthermore, by studying the generated architectures, we provide several intriguing findings of designing private-learning-friendly DNNs, which can shed new light on model design for deep learning with differential privacy.

Locally Adaptive Structure and Texture Similarity for Image Quality Assessment arxiv:2110.08521 📈 2

Keyan Ding, Yi Liu, Xueyi Zou, Shiqi Wang, Kede Ma

**Abstract:** The latest advances in full-reference image quality assessment (IQA) involve unifying structure and texture similarity based on deep representations. The resulting Deep Image Structure and Texture Similarity (DISTS) metric, however, makes rather global quality measurements, ignoring the fact that natural photographic images are locally structured and textured across space and scale. In this paper, we describe a locally adaptive structure and texture similarity index for full-reference IQA, which we term A-DISTS. Specifically, we rely on a single statistical feature, namely the dispersion index, to localize texture regions at different scales. The estimated probability (of one patch being texture) is in turn used to adaptively pool local structure and texture measurements. The resulting A-DISTS is adapted to local image content, and is free of expensive human perceptual scores for supervised training. We demonstrate the advantages of A-DISTS in terms of correlation with human data on ten IQA databases and optimization of single image super-resolution methods.

Mode and Ridge Estimation in Euclidean and Directional Product Spaces: A Mean Shift Approach arxiv:2110.08505 📈 2

Yikun Zhang, Yen-Chi Chen

**Abstract:** The set of local modes and the ridge lines estimated from a dataset are important summary characteristics of the data-generating distribution. In this work, we consider estimating the local modes and ridges from point cloud data in a product space with two or more Euclidean/directional metric spaces. Specifically, we generalize the well-known (subspace constrained) mean shift algorithm to the product space setting and illuminate some pitfalls in such generalization. We derive the algorithmic convergence of the proposed method, provide practical guidelines on the implementation, and demonstrate its effectiveness on both simulated and real datasets.

Streaming Decision Trees and Forests arxiv:2110.08483 📈 2

Haoyin Xu, Jayanta Dey, Sambit Panda, Joshua T. Vogelstein

**Abstract:** Machine learning has successfully leveraged modern data and provided computational solutions to innumerable real-world problems, including physical and biomedical discoveries. Currently, estimators could handle both scenarios with all samples available and situations requiring continuous updates. However, there is still room for improvement on streaming algorithms based on batch decision trees and random forests, which are the leading methods in batch data tasks. In this paper, we explore the simplest partial fitting algorithm to extend batch trees and test our models: stream decision tree (SDT) and stream decision forest (SDF) on three classification tasks of varying complexities. For reference, both existing streaming trees (Hoeffding trees and Mondrian forests) and batch estimators are included in the experiments. In all three tasks, SDF consistently produces high accuracy, whereas existing estimators encounter space restraints and accuracy fluctuations. Thus, our streaming trees and forests show great potential for further improvements, which are good candidates for solving problems like distribution drift and transfer learning.

An Evolutionary Correlation-aware Feature Selection Method for Classification Problems arxiv:2110.13082 📈 1

Motahare Namakin, Modjtaba Rouhani, Mostafa Sabzekar

**Abstract:** The population-based optimization algorithms have provided promising results in feature selection problems. However, the main challenges are high time complexity. Moreover, the interaction between features is another big challenge in FS problems that directly affects the classification performance. In this paper, an estimation of distribution algorithm is proposed to meet three goals. Firstly, as an extension of EDA, the proposed method generates only two individuals in each iteration that compete based on a fitness function and evolve during the algorithm, based on our proposed update procedure. Secondly, we provide a guiding technique for determining the number of features for individuals in each iteration. As a result, the number of selected features of the final solution will be optimized during the evolution process. The two mentioned advantages can increase the convergence speed of the algorithm. Thirdly, as the main contribution of the paper, in addition to considering the importance of each feature alone, the proposed method can consider the interaction between features. Thus, it can deal with complementary features and consequently increase classification performance. To do this, we provide a conditional probability scheme that considers the joint probability distribution of selecting two features. The introduced probabilities successfully detect correlated features. Experimental results on a synthetic dataset with correlated features prove the performance of our proposed approach facing these types of features. Furthermore, the results on 13 real-world datasets obtained from the UCI repository show the superiority of the proposed method in comparison with some state-of-the-art approaches.

A Learning-based Approach Towards Automated Tuning of SSD Configurations arxiv:2110.08685 📈 1

Daixuan Li, Jian Huang

**Abstract:** Thanks to the mature manufacturing techniques, solid-state drives (SSDs) are highly customizable for applications today, which brings opportunities to further improve their storage performance and resource utilization. However, the SSD efficiency is usually determined by many hardware parameters, making it hard for developers to manually tune them and determine the optimal SSD configurations. In this paper, we present an automated learning-based framework, named LearnedSSD, that utilizes both supervised and unsupervised machine learning (ML) techniques to drive the tuning of hardware configurations for SSDs. LearnedSSD automatically extracts the unique access patterns of a new workload using its block I/O traces, maps the workload to previously workloads for utilizing the learned experiences, and recommends an optimal SSD configuration based on the validated storage performance. LearnedSSD accelerates the development of new SSD devices by automating the hard-ware parameter configurations and reducing the manual efforts. We develop LearnedSSD with simple yet effective learning algorithms that can run efficiently on multi-core CPUs. Given a target storage workload, our evaluation shows that LearnedSSD can always deliver an optimal SSD configuration for the target workload, and this configuration will not hurt the performance of non-target workloads.

Blockchain and Federated Edge Learning for Privacy-Preserving Mobile Crowdsensing arxiv:2110.08671 📈 1

Qin Hu, Zhilin Wang, Minghui Xu, Xiuzhen Cheng

**Abstract:** Mobile crowdsensing (MCS) counting on the mobility of massive workers helps the requestor accomplish various sensing tasks with more flexibility and lower cost. However, for the conventional MCS, the large consumption of communication resources for raw data transmission and high requirements on data storage and computing capability hinder potential requestors with limited resources from using MCS. To facilitate the widespread application of MCS, we propose a novel MCS learning framework leveraging on blockchain technology and the new concept of edge intelligence based on federated learning (FL), which involves four major entities, including requestors, blockchain, edge servers and mobile devices as workers. Even though there exist several studies on blockchain-based MCS and blockchain-based FL, they cannot solve the essential challenges of MCS with respect to accommodating resource-constrained requestors or deal with the privacy concerns brought by the involvement of requestors and workers in the learning process. To fill the gaps, four main procedures, i.e., task publication, data sensing and submission, learning to return final results, and payment settlement and allocation, are designed to address major challenges brought by both internal and external threats, such as malicious edge servers and dishonest requestors. Specifically, a mechanism design based data submission rule is proposed to guarantee the data privacy of mobile devices being truthfully preserved at edge servers; consortium blockchain based FL is elaborated to secure the distributed learning process; and a cooperation-enforcing control strategy is devised to elicit full payment from the requestor. Extensive simulations are carried out to evaluate the performance of our designed schemes.

Fast Strain Estimation and Frame Selection in Ultrasound Elastography using Machine Learning arxiv:2110.08668 📈 1

Abdelrahman Zayed, Hassan Rivaz

**Abstract:** Ultrasound Elastography aims to determine the mechanical properties of the tissue by monitoring tissue deformation due to internal or external forces. Tissue deformations are estimated from ultrasound radio frequency (RF) signals and are often referred to as time delay estimation (TDE). Given two RF frames I1 and I2, we can compute a displacement image which shows the change in the position of each sample in I1 to a new position in I2. Two important challenges in TDE include high computational complexity and the difficulty in choosing suitable RF frames. Selecting suitable frames is of high importance because many pairs of RF frames either do not have acceptable deformation for extracting informative strain images or are decorrelated and deformation cannot be reliably estimated. Herein, we introduce a method that learns 12 displacement modes in quasi-static elastography by performing Principal Component Analysis (PCA) on displacement fields of a large training database. In the inference stage, we use dynamic programming (DP) to compute an initial displacement estimate of around 1% of the samples, and then decompose this sparse displacement into a linear combination of the 12 displacement modes. Our method assumes that the displacement of the whole image could also be described by this linear combination of principal components. We then use the GLobal Ultrasound Elastography (GLUE) method to fine-tune the result yielding the exact displacement image. Our method, which we call PCA-GLUE, is more than 10 times faster than DP in calculating the initial displacement map while giving the same result. Our second contribution in this paper is determining the suitability of the frame pair I1 and I2 for strain estimation, which we achieve by using the weight vector that we calculated for PCA-GLUE as an input to a multi-layer perceptron (MLP) classifier.

A theoretical and empirical study of new adaptive algorithms with additional momentum steps and shifted updates for stochastic non-convex optimization arxiv:2110.08531 📈 1

Cristian Daniel Alecsa

**Abstract:** In the following paper we introduce new adaptive algorithms endowed with momentum terms for stochastic non-convex optimization problems. We investigate the almost sure convergence to stationary points, along with a finite-time horizon analysis with respect to a chosen final iteration, and we also inspect the worst-case iteration complexity. An estimate for the expectation of the squared Euclidean norm of the gradient is given and the theoretical analysis that we perform is assisted by various computational simulations for neural network training.

AugmentedCode: Examining the Effects of Natural Language Resources in Code Retrieval Models arxiv:2110.08512 📈 1

Mehdi Bahrami, N. C. Shrikanth, Yuji Mizobuchi, Lei Liu, Masahiro Fukuyori, Wei-Peng Chen, Kazuki Munakata

**Abstract:** Code retrieval is allowing software engineers to search codes through a natural language query, which relies on both natural language processing and software engineering techniques. There have been several attempts on code retrieval from searching snippet codes to function codes. In this paper, we introduce Augmented Code (AugmentedCode) retrieval which takes advantage of existing information within the code and constructs augmented programming language to improve the code retrieval models' performance. We curated a large corpus of Python and showcased the the framework and the results of augmented programming language which outperforms on CodeSearchNet and CodeBERT with a Mean Reciprocal Rank (MRR) of 0.73 and 0.96, respectively. The outperformed fine-tuned augmented code retrieval model is published in HuggingFace at https://huggingface.co/Fujitsu/AugCode and a demonstration video is available at: https://youtu.be/mnZrUTANjGs .

What can we learn from universal Turing machines? arxiv:2110.08511 📈 1

Maurice Margenstern

**Abstract:** In the present paper, we construct what we call a pedagogical universal Turing machine. We try to understand which comparisons with biological phenomena can be deduced from its encoding and from its working.

Fast Projection onto the Capped Simplex with Applications to Sparse Regression in Bioinformatics arxiv:2110.08471 📈 0

Andersen Ang, Jianzhu Ma, Nianjun Liu, Kun Huang, Yijie Wang

**Abstract:** We consider the problem of projecting a vector onto the so-called k-capped simplex, which is a hyper-cube cut by a hyperplane. For an n-dimensional input vector with bounded elements, we found that a simple algorithm based on Newton's method is able to solve the projection problem to high precision with a complexity roughly about O(n), which has a much lower computational cost compared with the existing sorting-based methods proposed in the literature. We provide a theory for partial explanation and justification of the method. We demonstrate that the proposed algorithm can produce a solution of the projection problem with high precision on large scale datasets, and the algorithm is able to significantly outperform the state-of-the-art methods in terms of runtime (about 6-8 times faster than a commercial software with respect to CPU time for input vector with 1 million variables or more). We further illustrate the effectiveness of the proposed algorithm on solving sparse regression in a bioinformatics problem. Empirical results on the GWAS dataset (with 1,500,000 single-nucleotide polymorphisms) show that, when using the proposed method to accelerate the Projected Quasi-Newton (PQN) method, the accelerated PQN algorithm is able to handle huge-scale regression problem and it is more efficient (about 3-6 times faster) than the current state-of-the-art methods.

Next Page