Researcher
Audio Research Group
Tampere University
(previously known as Tampere University of Technology)
Email: | firstname.lastname@tuni.fi |
Office: | Hervanta campus, TC316 |
A machine learning method for the automatic detection of pronunciation errors made by non-native speakers of English is proposed. It consists of training word-specific binary classifiers on a collected dataset of isolated words with possible pronunciation errors, typical for Finnish native speakers. The classifiers predict whether the typical error is present in the given word utterance. They operate on sequences of acoustic features, extracted from consecutive frames of an audio recording of a word utterance. The proposed architecture includes a convolutional neural network, a recurrent neural network, or a combination of the two. The optimal topology and hyperparameters are obtained in a Bayesian optimisation setting using a tree-structured Parzen estimator. A dataset of 80 words uttered naturally by 120 speakers is collected. The performance of the proposed system, evaluated on a well-represented subset of the dataset, shows that it is capable of detecting pronunciation errors in most of the words (46/49) with high accuracy (mean accuracy gain over the zero rule 12.21 percent points).
@INPROCEEDINGS{Diment19_PL,
author={Diment, Aleksandr and Fagerlund, Eemi and Benfield, Adrian and Virtanen, Tuomas},
booktitle={Proceedings of the International Joint Conference on Neural Networks (IJCNN)},
title={Detection of Typical Pronunciation Errors in Non-native English Speech Using Convolutional Recurrent Neural Networks},
year={2019}
}
Each edition of the challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) contained several tasks involving sound event detection in different setups. DCASE 2017 presented participants with three such tasks, each having specific datasets and detection requirements: Task 2, in which target sound events were very rare in both training and testing data, Task 3 having overlapping events annotated in real-life audio, and Task 4, in which only weakly-labeled data was available for training. In this paper, we present the three tasks, including the datasets and baseline systems, and analyze the challenge entries for each task. We observe the popularity of methods using deep neural networks, and the still widely used mel frequency based representations, with only few approaches standing out as radically different. Analysis of the systems behavior reveals that task-specific optimization has a big role in producing good performance; however, often this optimization closely follows the ranking metric, and its maximization/minimization does not result in universally good performance. We also introduce the calculation of confidence intervals based on a jackknife resampling procedure, to perform statistical analysis of the challenge results. The analysis indicates that while the 95% confidence intervals for many systems overlap, there are significant differences in performance between the top systems and the baseline for all tasks.
@article{Mesaros2019_TASLP,
author = "Mesaros, Annamaria and Diment, Aleksandr and Elizalde, Benjamin and Heittola, Toni and Vincent, Emmanuel and Raj, Bhiksha and Virtanen, Tuomas",
doi = "10.1109/TASLP.2019.2907016",
title = "Sound event detection in the {DCASE} 2017 {C}hallenge",
journal = "IEEE/ACM Transactions on Audio, Speech, and Language Processing",
note = "in press",
year = "2019",
keywords = "Sound event detection, weak labels, pattern recognition, jackknife estimates, confidence intervals",
}
In this paper we propose a method for separation of moving sound sources. The method is based on first tracking the sources and then estimation of source spectrograms using multichannel non-negative matrix factorization (NMF) and extracting the sources from the mixture by single-channel Wiener filtering. We propose a novel multichannel NMF model with time-varying mixing of the sources denoted by spatial covariance matrices (SCM) and provide update equations for optimizing model parameters minimizing squared Frobenius norm. The SCMs of the model are obtained based on estimated directions of arrival of tracked sources at each time frame. The evaluation is based on established objective separation criteria and using real recordings of two and three simultaneous moving sound sources. The compared methods include conventional beamforming and ideal ratio mask separation. The proposed method is shown to exceed the separation quality of other evaluated blind approaches according to all measured quantities. Additionally, we evaluate the method's susceptibility towards tracking errors by comparing the separation quality achieved using annotated ground truth source trajectories.
@ARTICLE{Nikunen18,
author={J. {Nikunen} and A. {Diment} and T. {Virtanen}},
journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
title={Separation of Moving Sound Sources Using Multichannel NMF and Acoustic Tracking},
year={2018},
volume={26},
number={2},
pages={281-295},
keywords={acoustic signal processing;blind source separation;covariance matrices;direction-of-arrival estimation;matrix decomposition;optimisation;source separation;time-varying filters;Wiener filters;acoustic tracking;single-channel Wiener filtering;spatial covariance matrices;ideal ratio mask separation;annotated ground truth source trajectories;source spectrogram estimation;multichannel nonnegative matrix factorization model;multichannel NMF model;source tracking;moving sound source separation;time-varying source mixing;SCM;directions of arrival estimation;objective separation criteria;beamforming;optimisation model;Acoustics;Direction-of-arrival estimation;Microphones;Mathematical model;Estimation;Spectrogram;Array signal processing;Sound source separation;moving sound sources;time-varying mixing model;microphone arrays;acoustic source tracking},
doi={10.1109/TASLP.2017.2774925},
ISSN={2329-9290},
month={Feb},
}
Many machine learning tasks have been shown solvable with impressive levels of success given large amounts of training data and computational power. For the tasks which lack data sufficient to achieve high performance, methods for transfer learning can be applied. These refer to performing the new task while having some prior knowledge of the nature of the data, gained by first performing a different task, for which training data is abundant. Previously shown successful for machine vision and natural language processing, transfer learning is investigated in this work for audio analysis. We propose to solve the problem of weak label classification (tagging) with small amounts of training data by transferring the abstract knowledge about the nature of audio data from another tagging task. Three neural network architectures are proposed and evaluated, showing impressive classification accuracy gains.
@INPROCEEDINGS{Diment17_TL,
author={Diment, Aleksandr and Virtanen, Tuomas},
booktitle={Applications of Signal Processing to Audio and Acoustics (WASPAA), 2017 IEEE Workshop on},
title={Transfer Learning of Weakly Labelled Audio},
year={2017},
month={Oct},
keywords={transfer learning, tagging, weak labels, audio}
}
@book{DCASE17_book_bib,
title = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017)",
author = "Tuomas Virtanen and Annamaria Mesaros and Toni Heittola and Aleksandr Diment and Emmanuel Vincent and Emmanouil Benetos and Elizalde, {Benjamin Martinez}",
year = "2017",
month = "11",
publisher = "Tampere University of Technology. Laboratory of Signal Processing"
}
DCASE 2017 Challenge consists of four tasks: acoustic scene classification, detection of rare sound events, sound event detection in real-life audio, and large-scale weakly supervised sound event detection for smart cars. This paper presents the setup of these tasks: task definition, dataset, experimental setup, and baseline system results on the development dataset. The baseline systems for all tasks rely on the same implementation using multilayer perceptron and log mel-energies, but differ in the structure of the output layer and the decision making process, as well as the evaluation of system output using task specific metrics.
@inproceedings{DCASE2017challenge,
Author = "Mesaros, A. and Heittola, T. and Diment, A. and Elizalde, B. and Shah, A. and Badlani, R. and Vincent, E. and Raj, B. and Virtanen, T.",
title = "{DCASE} 2017 Challenge Setup: Tasks, Datasets and Baseline System",
booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017)",
month = "November",
year = "2017",
keywords = "Sound scene analysis, Acoustic scene classification, Sound event detection, Audio tagging, Rare sound events"
}
This paper presents a novel application of convo- lutional neural networks (CNNs) for the task of acoustic scene classification (ASC). We here propose the use of a CNN trained to classify short sequences of audio, represented by their log-mel spectrogram. We also introduce a training method that can be used under particular circumstances in order to make full use of small datasets. The proposed system is tested and evaluated on three different ASC datasets and compared to other state-of-the- art systems which competed in the “Detection and Classification of Acoustic Scenes and Events” (DCASE) challenges held in 2016 and 2013. The best accuracy scores obtained by our system on the DCASE 2016 datasets are 79.0% (development) and 86.2% (evaluation), which constitute a 6.4% and 9% improvements with respect to the baseline system. Finally, when tested on the DCASE 2013 evaluation dataset, the proposed system manages to reach a 77.0% accuracy, improving by 1% the challenge winner’s score.
@INPROCEEDINGS{Valenti17,
AUTHOR="Michele Valenti and Aleksandr Diment and Giambattista Parascandolo and Stefano Squartini and Tuomas Virtanen",
TITLE="{DCASE} 2016 Acoustic Scene Classification Using Convolutional Neural Networks",
BOOKTITLE="Proc. of the 2017 International Joint Conference on Neural Networks",
ADDRESS="Anchorage, Alaska, USA",
DAYS=14-19,
MONTH=may,
YEAR=2017,
}
This workshop paper presents our contribution for the task of acoustic scene classification proposed for the "detection and classification of acoustic scenes and events" (D-CASE) 2016 challenge. We propose the use of a convolutional neural network trained to classify short sequences of audio, represented by their log-mel spectrogram. In addition we use a training method that can be used when the validation performance of the system saturates as the training proceeds. The performance is evaluated on the public acoustic scene classification development dataset provided for the D-CASE challenge. The best accuracy score obtained by our configuration on a four-folded cross-validation setup is 79.0%. It constitutes a 8.8% relative improvement with respect to the baseline system, based on a Gaussian mixture model classifier.
@INPROCEEDINGS{Valenti16_DCASE,
AUTHOR="Michele Valenti and Aleksandr Diment and Giambattista Parascandolo and Stefano Squartini and Tuomas Virtanen",
TITLE="{DCASE} 2016 Acoustic Scene Classification Using Convolutional Neural Networks",
BOOKTITLE="Proc. of the 2016 Workshop on Detection and Classification of Acoustic Scenes and Events",
ADDRESS="Budapest, Hungary",
DAYS=3,
MONTH=sep,
YEAR=2016,
KEYWORDS="Acoustic scene classification, convolutional neural networks, DCASE, computational audio processing"
}
Detection of whispered speech in the presence of high levels of background noise has applications in fraudulent behaviour recognition. For instance, it can serve as an indicator of possible insider trading. We propose a deep neural network (DNN) -based whispering detection system, which operates on both magnitude and phase features, including the group delay feature from all-pole models (APGD). We show that the APGD feature outperforms the conventional ones. Trained and evaluated on the collected diverse dataset of whispered and normal speech with emulated phone line distortions and significant amounts of added background noise, the proposed system performs with accuracies as high as 91.8%.
@INPROCEEDINGS{Diment16_WHI,
AUTHOR="Aleksandr Diment and Mikko Parviainen and Tuomas Virtanen and Roman Zelov
and Alex Glasman",
TITLE="{Noise-Robust} Detection of Whispering in Telephone Calls Using Deep Neural
Networks",
BOOKTITLE="2016 24th European Signal Processing Conference (EUSIPCO) (EUSIPCO 2016)",
ADDRESS="Budapest, Hungary",
DAYS=27,
MONTH=aug,
YEAR=2016,
KEYWORDS="whispering, noise robustness, deep neural networks"
}
This paper proposes dictionary learning with archetypes for audio processing. Archetypes refer to so-called pure types, which are a combination of a few data points and which can be combined to obtain a data point. The concept has been found useful in various problems, but it has not yet been applied for audio analysis. The algorithm performs archetypal analysis that minimises the generalised Kullback-Leibler divergence, shown suitable for audio, between an observation and the model. The methodology is evaluated in a source separation scenario (mixtures of speech) and shows results, which are comparable to the state-of-the-art, with perceptual measures indicating its superiority over all of the competing methods in the case of medium-size dictionaries.
@INPROCEEDINGS{Diment15_AA,
author={Diment, Aleksandr and Virtanen, Tuomas},
booktitle={Applications of Signal Processing to Audio and Acoustics (WASPAA), 2015 IEEE Workshop on},
title={Archetypal analysis for audio dictionary learning},
year={2015},
month={Oct},
keywords={archetypes, audio analysis, non-negative matrix factorisation, sparse representation}
This paper proposes a method for binaural reconstruction of a sound scene captured with a portable-sized array consisting of several microphones. The proposed processing is separating the scene into a sum of small number of sources, and the spectrogram of each of them is in turn represented as a small number of latent components. The direction of arrival (DOA) of each source is estimated, which is followed by binaural rendering of each source at its estimated direction. For representing the sources, the proposed method uses low-rank complex-valued non-negative matrix factorization combined with DOA-based spatial covariance matrix model. The binaural reconstruction is achieved by applying the binaural cues (head-related transfer function) associated with the estimated source DOA to the separated source signals. The binaural rendering quality of the proposed method was evaluated using a speech intelligibility test. The test results indicated that the proposed binaural rendering was able to improve the intelligibility of speech over stereo recordings and separation by minimum variance distortionless response beamformer with the same binaural synthesis in a three-speaker scenario. An additional listening test evaluating the subjective quality of the rendered output indicates no added processing artifacts by the proposed method in comparison to unprocessed stereo recording.
@article{Nikunen2015,
title = "Binaural rendering of microphone array captures based on source separation",
journal = "Speech Communication",
year = "2015",
issn = "0167-6393",
doi = "http://dx.doi.org/10.1016/j.specom.2015.09.005",
url = "http://www.sciencedirect.com/science/article/pii/S0167639315001004",
author = "Joonas Nikunen and Aleksandr Diment and Tuomas Virtanen and Miikka Vilermo",
keywords = "Binaural processing, Source separation, Speech intelligibility, Non-negative matrix factorization",
}
A feature based on the group delay function from all-pole models (APGD) is proposed for environmental sound event recognition. The commonly used spectral features take into account merely the magnitude information, whereas the phase is overlooked due to the complications related to its interpretation. Additional information concealed in the phase is hypothesised to be beneficial for sound event recognition. The APGD is an approach to inferring phase information, which has shown applicability for speech and music analysis and is now studied in environmental audio. The evaluation is performed within a multi-label deep neural network (DNN) framework on a diverse real-life dataset of environmental sounds. It shows performance improvement compared to the baseline log mel-band energy case. Combined with the magnitude-based features, APGD demonstrates further improvement.
@INPROCEEDINGS{Diment15_APGD4events,
AUTHOR="Aleksandr Diment and Emre Cakir and Toni Heittola and Tuomas Virtanen",
TITLE="Automatic recognition of environmental sound events using all-pole group
delay features",
BOOKTITLE="European Signal Processing Conference 2015 (EUSIPCO 2015)",
ADDRESS="Nice, France",
DAYS=31,
MONTH=aug,
YEAR=2015,
KEYWORDS="Phase spectrum; sound event recognition; audio classification; neural
networks"
}
In this work, the feature based on the group delay function from all-pole models (APGD) is proposed for pitched musical instrument recognition. Conventionally, the spectrum-related features take into account merely the magnitude information, whereas the phase is often overlooked due to the complications related to its interpretation. However, there is often additional information concealed in the phase, which could be beneficial for recognition. The APGD is an elegant approach to inferring phase information, which lacks of the issues related to interpreting the phase and does not require extensive parameter adjustment. Having shown applicability for speech-related problems, it is now explored in terms of instrument recognition. The evaluation is performed with various instrument sets and shows noteworthy absolute accuracy gains of up to 7% compared to the baseline mel-frequency cepstral coefficients (MFCCs) case. Combined with the MFCCs and with feature selection, APGD demonstrates superiority over the baseline with all the evaluated sets.
@incollection{Diment14_APGD,
year={2014},
isbn={978-3-319-12975-4},
booktitle={Sound, Music, and Motion},
series={Lecture Notes in Computer Science},
editor={Aramaki, Mitsuko and Derrien, Olivier and Kronland-Martinet, Richard and Ystad, Sølvi},
doi={10.1007/978-3-319-12976-1_37},
title={Group Delay Function from All-Pole Models for Musical Instrument Recognition},
url={http://dx.doi.org/10.1007/978-3-319-12976-1_37},
publisher={Springer International Publishing},
keywords={Musical instrument recognition; Music information retrieval; All-pole group delay feature; Phase spectrum},
author={Diment, Aleksandr and Rajan, Padmanabhan and Heittola, Toni and Virtanen, Tuomas},
pages={606-618},
language={English}
}
In this work, the modified group delay feature (MODGDF) is proposed for pitched musical instrument recognition. Conventionally, the spectrum-related features used in instrument recognition take into account merely the magnitude information, whereas the phase is often overlooked due to the complications related to its interpretation. However, there is often additional information concealed in the phase, which could be beneficial for recognition. The MODGDF is a method of incorporating phase information, which lacks of the issues related to phase unwrapping. Having shown its applicability for speech-related problems, it is now explored in terms of musical instrument recognition. The evaluation is performed on separate note recordings in various instrument sets, and combined with the conventional mel frequency cepstral coefficients (MFCCs), MODGDF shows the noteworthy absolute accuracy gains of up to 5.1% compared to the baseline MFCCs case.
@INPROCEEDINGS{Diment13_MODGDF,
AUTHOR = "Aleksandr Diment and Rajan Padmanabhan and Toni Heittola and Tuomas Virtanen",
TITLE = "Modified Group Delay Feature for Musical Instrument Recognition",
BOOKTITLE = "10th International Symposium on Computer Music Multidisciplinary Research (CMMR)",
ADDRESS = "Marseille, France",
DAYS = 15,
MONTH = oct,
YEAR = 2013,
KEYWORDS = "Musical instrument recognition; music information retrieval; modified group delay feature; phase spectrum"
}
We present a sound event detection system based on hidden Markov models. The system is evaluated with development material provided in the AASP Challenge on Detection and Classification of Acoustic Scenes and Events. Two approaches using the same basic detection scheme are presented. First one, developed for acoustic scenes with non-overlapping sound events is evaluated with Office Live development dataset. Second one, developed for acoustic scenes with some degree of overlapping sound events is evaluated with Office Synthetic development dataset.
@TECHREPORT{Diment13_AASP,
AUTHOR = "Aleksandr Diment and Toni Heittola and Tuomas Virtanen",
TITLE = "Sound Event Detection for Office Live and Office Synthetic AASP Challenge",
YEAR = 2013,
KEYWORDS = "Sound event detection"
}
Birds have been widely used as biological indicators for ecological research. They respond quickly to environmental changes and can be used to infer about other organisms (e.g., insects they feed on). Traditional methods for collecting data about birds involves costly human effort. A promising alternative is acoustic monitoring. There are many advantages to recording audio of birds compared to human surveys, including increased temporal and spatial resolution and extent, applicability in remote sites, reduced observer bias, and potentially lower cost. However, it is an open problem for signal processing and machine learning to reliably identify bird sounds in real-world audio data collected in an acoustic monitoring scenario. Some of the major challenges include multiple simultaneously vocalizing birds, other sources of non-bird sound (e.g., buzzing insects), and background noise like wind, rain, and motor vehicles.
@INPROCEEDINGS{6661934,
author={Briggs, F. and Yonghong Huang and Raich, R. and Eftaxias, K. and Zhong Lei and Cukierski, W. and Hadley, S.F. and Hadley, A. and Betts, M. and Fern, X.Z. and Irvine, J. and Neal, L. and Thomas, A. and Fodor, G. and Tsoumakas, G. and Hong Wei Ng and Thi Ngoc Tho Nguyen and Huttunen, H. and Ruusuvuori, P. and Manninen, T. and Diment, A. and Virtanen, T. and Marzat, J. and Defretin, J. and Callender, D. and Hurlburt, C. and Larrey, K. and Milakov, M.},
booktitle={Machine Learning for Signal Processing (MLSP), 2013 IEEE International Workshop on},
title={The 9th annual MLSP competition: New methods for acoustic classification of multiple simultaneous bird species in a noisy environment},
year={2013},
month={Sept},
pages={1-8},
keywords={acoustic signal processing;audio recording;audio signal processing;ecology;learning (artificial intelligence);signal resolution;zoology;MLSP competition;acoustic classification;acoustic monitoring scenario;audio recording;background noise;biological indicators;bird sounds;ecological research;human effort;human surveys;machine learning;multiple simultaneous bird species;noisy environment;nonbird sound;real-world audio data;reduced observer bias;remote sites;signal processing;spatial resolution;temporal resolution;Birds;Histograms;Image segmentation;Noise;Rain;Spectrogram;Vectors},
doi={10.1109/MLSP.2013.6661934},
ISSN={1551-2541},}
In this work, the semi-supervised learning (SSL) techniques are explored in the context of musical instrument recognition. The conventional supervised approaches normally rely on annotated data to train the classifier. This implies performing costly manual annotations of the training data. The SSL methods enable utilising the additional unannotated data, which is significantly easier to obtain, allowing the overall development cost maintained at the same level while notably improving the performance. The implemented classifier incorporates the Gaussian mixture model-based SSL scheme utilising the iterative EM-based algorithm, as well as the extensions facilitating a simpler convergence criteria. The evaluation is performed on a set of nine instruments while training on a dataset, in which the relative size of the labelled data is as little as 15%. It yields a noteworthy absolute performance gain of 16% compared to the performance of the initial supervised models.
@INPROCEEDINGS{Diment13_SSL,
AUTHOR = "Aleksandr Diment and Toni Heittola and Tuomas Virtanen",
TITLE = "Semi-supervised Learning for Musical Instrument Recognition",
BOOKTITLE = "21st European Signal Processing Conference 2013 (EUSIPCO 2013)",
ADDRESS = "Marrakech, Morocco",
DAYS = 9,
MONTH = sep,
YEAR = 2013,
KEYWORDS = "Music information retrieval; musical instrument recognition; semi-supervised learning"
}
The application areas of music information retrieval have been gaining popularity over the last decades. Musical instrument recognition is an example of a specific research topic in the field. In this thesis, semi-supervised learning techniques are explored in the context of musical instrument recognition. The conventional approaches employed for musical instrument recognition rely on annotated data, i.e. example recordings of the target instruments with associated information about the target labels in order to perform training. This implies a highly laborious and tedious work of manually annotating the collected training data. The semi-supervised methods enable incorporating additional unannotated data into training. Such data consists of merely the recordings of the instruments and is therefore significantly easier to acquire. Hence, these methods allow keeping the overall development cost at the same level while notably improving the performance of a system.
The implemented musical instrument recognition system utilises the mixture model semi-supervised learning scheme in the form of two EM-based algorithms. Furthermore, upgraded versions, namely, the additional labelled data weighting and class-wise retraining, for the improved performance and convergence criteria in terms of the particular classification scenario are proposed. The evaluation is performed on sets consisting of four and ten instruments and yields the overall average recognition accuracy rates of 95.3 and 68.4%, respectively. These correspond to the absolute gains of 6.1 and 9.7% compared to the initial, purely supervised cases. Additional experiments are conducted in terms of the effects of the proposed modifications, as well as the investigation of the optimal relative labelled dataset size. In general, the obtained performance improvement is quite noteworthy, and future research directions suggest to subsequently investigate the behaviour of the implemented algorithms along with the proposed and further extended approaches.
@MastersThesis{Diment13_MSc,
author = {Aleksandr Diment},
title = {Semi-supervised musical instrument recognition},
school = {Tampere University of Technology},
address = {Finland},
year = {2013},
}
In this bachelor's thesis, a dataset consisting of samples of various drums and percussion is built and a method for visualizing it is described. An insight to the process of planning, recording and labeling an audio dataset is provided, and further on, a network embedding based visualization technique is applied to the dataset. The dataset visualization method consists of two parts. First, high-dimensional embeddings produced by OpenL3, a pre-trained deep convolutional neural network model, are obtained. Then, the embeddings are projected to a 2-dimensional plane with the t-Stochastic Neighborhood Embedding (t-SNE) algorithm. Promising results are presented, as the method is able to separate different drum types and articulations into distinct clusters in the 2D plane, and moreover, it is able to recognize dynamic variations inside the classes themselves and organize the samples respectively. Additionally, Python programs are written to assist in the dataset labeling and visualization tasks, and the code and the dataset alongside with pre-computed embeddings are published online.
@thesis{Nieminen20_BSc,
author = {Elias Nieminen},
title = {Building and Visualizing a Percussion Dataset Using Deep Audio Embeddings and Dimensionality Reduction},
school = {Tampere University},
address = {Finland},
year = {2020},
type={Bachelor's Thesis}
}
Having large amounts of training data is necessary for the ever more popular neural networks to perform reliably. Data augmentation, i.e. the act of creating additional training data by performing label-preserving transformations for existing training data, is an efficient solution for this problem. While increasing the amount of data, introducing variations to the data via the transformations also has the power to make machine learning models more robust in real life conditions with noisy environments and mismatches between the training and test data.
In this thesis, data augmentation techniques in audio analysis are reviewed, and a tool for audio data augmentation (TADA) is presented. TADA is capable of performing three audio data augmentation techniques, which are convolution with mobile device microphone impulse responses, convolution with room impulse responses, and addition of background noises. TADA is evaluated by using it in a pronunciation error classification task, where typical pronunciation errors of Finnish people uttering English words are classified. All the techniques are tested first individually and then also in combination.
The experiments are executed with both original and augmented data. In all experiments, using TADA improves the performance of the classifier when compared to training with only original data. Robustness against unseen devices and rooms also improves. Additional gain from performing combined augmentation starts to saturate only after augmenting the training data to 30 times the original amount. Based on the positive impact of TADA for the classification task, it is found that data augmentation with convolutional and additive noises is an effective combination for increasing robustness against environmental distortions and channel effects.
@thesis{Eklund19_MSc,
author = {Ville-Veikko Eklund},
title = {Data Augmentation Techniques for Robust Audio Analysis},
school = {Tampere University},
address = {Finland},
year = {2019},
type={Master's Thesis}
}
In the recent years deep learning based approaches have dominated different types of classification problems. Usually these approaches require large amounts of training data to train a model capable of generalizing to any unseen data of the same type. However, in some applications it might be difficult to gather training data efficiently and it would be beneficial to classify new samples using only a few or even a single training example.
For us humans the knowledge from previously learned concepts is relatively easy to transfer to unfamiliar concepts, therefore many researchers have experimented with this idea in machine learning classification tasks. The idea of only using a single labelled example to classify unseen data is known as one-shot learning and has been successful especially in the field of computer vision. Many of the modern approaches for one-shot learning utilize a special neural network architecture named siamese network. This architecture can be trained to predict similarities between inputs, and can be used for a metric-based approach to one-shot learning. Siamese networks have been used for different audio related tasks before, however their usage in one-shot learning for audio classification has received less attention compared to computer vision.
The purpose of this thesis is to extend the idea of one-shot learning to environmental audio classification and see if this approach is feasible. The proposed system was trained and evaluated on the ESC dataset, consisting of 50 different environmental audio categories. The final one-shot evaluation was done to 5 completely unseen classes, using only a single example of each class when performing the classification. The results show that convolutional siamese networks are indeed a valid approach to the difficult one-shot classification task for environmental audio.
@thesis{Honka19_BSc,
author = {Tapio Honka},
title = {One-shot Learning with Siamese Networks for Environmental Audio},
school = {Tampere University},
address = {Finland},
year = {2019},
type={Bachelor's Thesis}
}
The goal of this thesis was to build an easily accessible and approachable environmental audio dataset for benchmarking machine learning algorithms. The samples in the created dataset were collected from Freesound, which is an online database consisting of sounds uploaded by its users. The dataset consists of three classes, each containing 50 instances of original recordings with annotations for 5-second segments. The classes are 'dog', 'bird' and 'rain', and the dataset was therefore given the name DBR dataset. Additionally, a script for evaluating support vector, random forest and k-NN classifers with the dataset was written. The building and evaluation stages were documented step-by-step and a Jupyter notebook tutorial for using the dataset was created. The dataset, the scripts and the notebook were published online.
@thesis{Eklund18_BSc,
author = {Ville-Veikko Eklund},
title = {Audio dataset creation},
school = {Tampere University of Technology},
address = {Finland},
year = {2018},
type={Bachelor's Thesis}
}
In this thesis we investigate the use of deep neural networks applied to the field of computational audio scene analysis, in particular to acoustic scene classification. This task concerns the recognition of an acoustic scene, like a park or a home, performed by an artificial system. In our work we examine the use of deep models aiming to give a contribution in one of their use cases which is, in our opinion, one of the most poorly explored. The neural architecture we propose in this work is a convolutional neural network specifically designed to work on a time-frequency audio representation known as log-mel spectrogram. The network output is an array of prediction scores, each of which is associated with one class of a set of 15 predefined classes. In addition, the architecture features batch normalization, a recently proposed regularization technique used to enhance the network performance and to speed up its training.
We also investigate the use of different audio sequence lengths as classification unit for our network. Thanks to these experiments we observe that, for our artificial system, the recognition of long sequences is not easier than of medium-length sequences, hence highlighting a counterintuitive behaviour. Moreover, we introduce a training procedure which aims to make the best of small datasets by using all the labeled data available for the network training. This procedure, possible under particular circumstances, constitutes a trade-off between an accurate training stop and an increased data representation available to the network. Finally, we compare our model to other systems, proving that its recognition ability can outperform either other neural architectures as well as other state-of-the-art statistical classifiers, like support vector machines and Gaussian mixture models. The proposed system reaches good accuracy scores on two different databases collected in 2013 and 2016. The best accuracy scores, obtained according to two cross-validation setups, are 77% and 79% respectively. These scores constitute a 22% and 6.1% accuracy increment with respect to the correspondent baselines published together with datasets.
@MastersThesis{Valenti16_MSc_bib,
author = {Michele Valenti},
title = {Convolutional Neural Networks for Acoustic Scene Classification},
school = {Tampere University of Technology},
address = {Finland},
year = {2016},
}
Page design based on theme by orderedlist, CC BY-SA 3.0.