Aleksandr Diment, ARG, TAU

Publications

V.-V. Eklund, A. Diment, T. Virtanen. Noise, Device and Room Robustness Methods for Pronunciation Error Detection. First published in the Proceedings of the 30th European Signal Processing Conference in 2022, published by EURASIP.
In this work, we address the problem of audio classification operating on signals recorded with various mobile devices in challenging environments. We propose a method for device, room and noise robust pronunciation error detection. It involves a data augmentation pipeline of convolution operations with room impulse responses and mobile device microphone impulse responses, and addition of background noise. A dataset of impulse responses of a diverse set of mobile devices, rooms and noises is collected. The method is evaluated in a pronunciation error detection task. The data consists of Finnish people uttering various English words accompanied by expert annotations of pronunciation errors. Classification accuracy is shown to improve by up to 12.9 percentage points as the amount of generated training data is increased. Given the large diverse set of collected impulse responses, we demonstrate that robustness is achieved consistently for new rooms and devices, excluded from the training set.
```
@INPROCEEDINGS{Eklund22_ROB,
    AUTHOR="Ville-Veikko Eklund and Aleksandr Diment and Tuomas Virtanen",
    TITLE="Noise, Device and Room Robustness Methods for Pronunciation Error Detection",
    BOOKTITLE="30th European Signal Processing Conference (EUSIPCO)",
    ADDRESS="Belgrade, Serbia",
    DAYS=29,
    MONTH=aug,
    YEAR=2022,
    KEYWORDS="data augmentation, robust classification, additive noise, impulse response, pronunciation error detection"
    }
```
A. Diment, E. Fagerlund, A. Benfield and T. Virtanen. Detection of Typical Pronunciation Errors in Non-native English Speech Using Convolutional Recurrent Neural Networks. In proc. of International Joint Conference on Neural Networks (IJCNN), 2019.
A machine learning method for the automatic detection of pronunciation errors made by non-native speakers of English is proposed. It consists of training word-specific binary classifiers on a collected dataset of isolated words with possible pronunciation errors, typical for Finnish native speakers. The classifiers predict whether the typical error is present in the given word utterance. They operate on sequences of acoustic features, extracted from consecutive frames of an audio recording of a word utterance. The proposed architecture includes a convolutional neural network, a recurrent neural network, or a combination of the two. The optimal topology and hyperparameters are obtained in a Bayesian optimisation setting using a tree-structured Parzen estimator. A dataset of 80 words uttered naturally by 120 speakers is collected. The performance of the proposed system, evaluated on a well-represented subset of the dataset, shows that it is capable of detecting pronunciation errors in most of the words (46/49) with high accuracy (mean accuracy gain over the zero rule 12.21 percent points).
```
@INPROCEEDINGS{Diment19_PL,
  author={Diment, Aleksandr and Fagerlund, Eemi and Benfield, Adrian  and Virtanen, Tuomas},
  booktitle={Proceedings of the International Joint Conference on Neural Networks (IJCNN)},
  title={Detection of Typical Pronunciation Errors in Non-native English Speech Using Convolutional Recurrent Neural Networks},
  year={2019}
}
```
A. Mesaros, A. Diment, B. Elizalde, T. Heittola, E. Vincent, B. Raj, and T. Virtanen. Sound Event Detection in the DCASE 2017 Challenge. In IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019, in press.
Each edition of the challenge on Detection and Classification of Acoustic Scenes and Events (DCASE) contained several tasks involving sound event detection in different setups. DCASE 2017 presented participants with three such tasks, each having specific datasets and detection requirements: Task 2, in which target sound events were very rare in both training and testing data, Task 3 having overlapping events annotated in real-life audio, and Task 4, in which only weakly-labeled data was available for training. In this paper, we present the three tasks, including the datasets and baseline systems, and analyze the challenge entries for each task. We observe the popularity of methods using deep neural networks, and the still widely used mel frequency based representations, with only few approaches standing out as radically different. Analysis of the systems behavior reveals that task-specific optimization has a big role in producing good performance; however, often this optimization closely follows the ranking metric, and its maximization/minimization does not result in universally good performance. We also introduce the calculation of confidence intervals based on a jackknife resampling procedure, to perform statistical analysis of the challenge results. The analysis indicates that while the 95% confidence intervals for many systems overlap, there are significant differences in performance between the top systems and the baseline for all tasks.
```
@article{Mesaros2019_TASLP,
    author = "Mesaros, Annamaria and Diment, Aleksandr and Elizalde, Benjamin and Heittola, Toni and Vincent, Emmanuel and Raj, Bhiksha and Virtanen, Tuomas",
    doi = "10.1109/TASLP.2019.2907016",
    title = "Sound event detection in the {DCASE} 2017 {C}hallenge",
    journal = "IEEE/ACM Transactions on Audio, Speech, and Language Processing",
    note = "in press",
    year = "2019",
    keywords = "Sound event detection, weak labels, pattern recognition, jackknife estimates, confidence intervals",
}
```
J. Nikunen, A. Diment, T. Virtanen. Separation of Moving Sound Sources Using Multichannel NMF and Acoustic Tracking. In IEEE/ACM Transactions on Audio, Speech and Language Processing, 2018.
In this paper we propose a method for separation of moving sound sources. The method is based on first tracking the sources and then estimation of source spectrograms using multichannel non-negative matrix factorization (NMF) and extracting the sources from the mixture by single-channel Wiener filtering. We propose a novel multichannel NMF model with time-varying mixing of the sources denoted by spatial covariance matrices (SCM) and provide update equations for optimizing model parameters minimizing squared Frobenius norm. The SCMs of the model are obtained based on estimated directions of arrival of tracked sources at each time frame. The evaluation is based on established objective separation criteria and using real recordings of two and three simultaneous moving sound sources. The compared methods include conventional beamforming and ideal ratio mask separation. The proposed method is shown to exceed the separation quality of other evaluated blind approaches according to all measured quantities. Additionally, we evaluate the method's susceptibility towards tracking errors by comparing the separation quality achieved using annotated ground truth source trajectories.
```
@ARTICLE{Nikunen18,
  author={J. {Nikunen} and A. {Diment} and T. {Virtanen}},
  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
  title={Separation of Moving Sound Sources Using Multichannel NMF and Acoustic Tracking},
  year={2018},
  volume={26},
  number={2},
  pages={281-295},
  keywords={acoustic signal processing;blind source separation;covariance matrices;direction-of-arrival estimation;matrix decomposition;optimisation;source separation;time-varying filters;Wiener filters;acoustic tracking;single-channel Wiener filtering;spatial covariance matrices;ideal ratio mask separation;annotated ground truth source trajectories;source spectrogram estimation;multichannel nonnegative matrix factorization model;multichannel NMF model;source tracking;moving sound source separation;time-varying source mixing;SCM;directions of arrival estimation;objective separation criteria;beamforming;optimisation model;Acoustics;Direction-of-arrival estimation;Microphones;Mathematical model;Estimation;Spectrogram;Array signal processing;Sound source separation;moving sound sources;time-varying mixing model;microphone arrays;acoustic source tracking},
  doi={10.1109/TASLP.2017.2774925},
  ISSN={2329-9290},
  month={Feb},
}
```
A. Diment and T. Virtanen. Transfer Learning of Weakly Labelled Audio. In proc. of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2017.
Many machine learning tasks have been shown solvable with impressive levels of success given large amounts of training data and computational power. For the tasks which lack data sufficient to achieve high performance, methods for transfer learning can be applied. These refer to performing the new task while having some prior knowledge of the nature of the data, gained by first performing a different task, for which training data is abundant. Previously shown successful for machine vision and natural language processing, transfer learning is investigated in this work for audio analysis. We propose to solve the problem of weak label classification (tagging) with small amounts of training data by transferring the abstract knowledge about the nature of audio data from another tagging task. Three neural network architectures are proposed and evaluated, showing impressive classification accuracy gains.
```
@INPROCEEDINGS{Diment17_TL,
  author={Diment, Aleksandr and Virtanen, Tuomas},
  booktitle={Applications of Signal Processing to Audio and Acoustics (WASPAA), 2017 IEEE Workshop on},
  title={Transfer Learning of Weakly Labelled Audio},
  year={2017},
  month={Oct},
  keywords={transfer learning, tagging, weak labels, audio}
}
```

T. Virtanen, A. Mesaros, T. Heittola, A. Diment, E. Vincent, E. Benetos, B. M. Elizalde. (eds.) Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017). Tampere University of Technology. Laboratory of Signal Processing, 2017. 142 p.

@book{DCASE17_book_bib,
	title = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017)",
	author = "Tuomas Virtanen and Annamaria Mesaros and Toni Heittola and Aleksandr Diment and Emmanuel Vincent and Emmanouil Benetos and Elizalde, {Benjamin Martinez}",
	year = "2017",
	month = "11",
	publisher = "Tampere University of Technology. Laboratory of Signal Processing"
}

A. Mesaros, T. Heittola, A. Diment, B. Elizalde, A. Shah, R. Badlani, E. Vincent, B. Raj, and T. Virtanen. DCASE 2017 challenge setup: tasks, datasets and baseline system. In Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017). November 2017.
DCASE 2017 Challenge consists of four tasks: acoustic scene classification, detection of rare sound events, sound event detection in real-life audio, and large-scale weakly supervised sound event detection for smart cars. This paper presents the setup of these tasks: task definition, dataset, experimental setup, and baseline system results on the development dataset. The baseline systems for all tasks rely on the same implementation using multilayer perceptron and log mel-energies, but differ in the structure of the output layer and the decision making process, as well as the evaluation of system output using task specific metrics.
```
@inproceedings{DCASE2017challenge,
    Author = "Mesaros, A. and Heittola, T. and Diment, A. and Elizalde, B. and Shah, A. and Badlani, R. and Vincent, E. and Raj, B. and Virtanen, T.",
    title = "{DCASE} 2017 Challenge Setup: Tasks, Datasets and Baseline System",
    booktitle = "Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017)",
    month = "November",
    year = "2017",
    keywords = "Sound scene analysis, Acoustic scene classification, Sound event detection, Audio tagging, Rare sound events"
}
```
M. Valenti, A. Diment, G. Parascandolo, S. Squartini, T. Virtanen. DCASE 2016 Acoustic Scene Classification Using Convolutional Neural Networks. In proceedings of the 2017 International Joint Conference on Neural Networks, Anchorage, Alaska, USA.
- View abstract
- Export BibTeX
This paper presents a novel application of convo- lutional neural networks (CNNs) for the task of acoustic scene classification (ASC). We here propose the use of a CNN trained to classify short sequences of audio, represented by their log-mel spectrogram. We also introduce a training method that can be used under particular circumstances in order to make full use of small datasets. The proposed system is tested and evaluated on three different ASC datasets and compared to other state-of-the- art systems which competed in the “Detection and Classification of Acoustic Scenes and Events” (DCASE) challenges held in 2016 and 2013. The best accuracy scores obtained by our system on the DCASE 2016 datasets are 79.0% (development) and 86.2% (evaluation), which constitute a 6.4% and 9% improvements with respect to the baseline system. Finally, when tested on the DCASE 2013 evaluation dataset, the proposed system manages to reach a 77.0% accuracy, improving by 1% the challenge winner’s score.
```
@INPROCEEDINGS{Valenti17,
  AUTHOR="Michele Valenti and Aleksandr Diment and Giambattista Parascandolo and Stefano Squartini and Tuomas Virtanen",
  TITLE="{DCASE} 2016 Acoustic Scene Classification Using Convolutional Neural Networks",
  BOOKTITLE="Proc. of the 2017 International Joint Conference on Neural Networks",
  ADDRESS="Anchorage, Alaska, USA",
  DAYS=14-19,
  MONTH=may,
  YEAR=2017,
}
```

M. Valenti, A. Diment, G. Parascandolo, S. Squartini, T. Virtanen. DCASE 2016 Acoustic Scene Classification Using Convolutional Neural Networks. In proceedings of the 2016 Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2016), Budapest, Hungary.
This workshop paper presents our contribution for the task of acoustic scene classification proposed for the "detection and classification of acoustic scenes and events" (D-CASE) 2016 challenge. We propose the use of a convolutional neural network trained to classify short sequences of audio, represented by their log-mel spectrogram. In addition we use a training method that can be used when the validation performance of the system saturates as the training proceeds. The performance is evaluated on the public acoustic scene classification development dataset provided for the D-CASE challenge. The best accuracy score obtained by our configuration on a four-folded cross-validation setup is 79.0%. It constitutes a 8.8% relative improvement with respect to the baseline system, based on a Gaussian mixture model classifier.
```
@INPROCEEDINGS{Valenti16_DCASE,
AUTHOR="Michele Valenti and Aleksandr Diment and Giambattista Parascandolo and Stefano Squartini and Tuomas Virtanen",
TITLE="{DCASE} 2016 Acoustic Scene Classification Using Convolutional Neural Networks",
BOOKTITLE="Proc. of the 2016 Workshop on Detection and Classification of Acoustic Scenes and Events",
ADDRESS="Budapest, Hungary",
DAYS=3,
MONTH=sep,
YEAR=2016,
KEYWORDS="Acoustic scene classification, convolutional neural networks, DCASE, computational audio processing"
}
```
A. Diment, M. Parviainen, T. Virtanen, R. Zelov, A. Glasman. Noise-robust detection of whispering in telephone calls using deep neural networks. First published in the Proceedings of the 24th European Signal Processing Conference (EUSIPCO-2016) in 2016, published by EURASIP.
Detection of whispered speech in the presence of high levels of background noise has applications in fraudulent behaviour recognition. For instance, it can serve as an indicator of possible insider trading. We propose a deep neural network (DNN) -based whispering detection system, which operates on both magnitude and phase features, including the group delay feature from all-pole models (APGD). We show that the APGD feature outperforms the conventional ones. Trained and evaluated on the collected diverse dataset of whispered and normal speech with emulated phone line distortions and significant amounts of added background noise, the proposed system performs with accuracies as high as 91.8%.
```
@INPROCEEDINGS{Diment16_WHI,
AUTHOR="Aleksandr Diment  and Mikko Parviainen and Tuomas Virtanen and Roman Zelov
and Alex Glasman",
TITLE="{Noise-Robust} Detection of Whispering in Telephone Calls Using Deep Neural
Networks",
BOOKTITLE="2016 24th European Signal Processing Conference (EUSIPCO) (EUSIPCO 2016)",
ADDRESS="Budapest, Hungary",
DAYS=27,
MONTH=aug,
YEAR=2016,
KEYWORDS="whispering, noise robustness, deep neural networks"
}
```
A. Diment and T. Virtanen. Archetypal analysis for audio dictionary learning. In proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2015.
This paper proposes dictionary learning with archetypes for audio processing. Archetypes refer to so-called pure types, which are a combination of a few data points and which can be combined to obtain a data point. The concept has been found useful in various problems, but it has not yet been applied for audio analysis. The algorithm performs archetypal analysis that minimises the generalised Kullback-Leibler divergence, shown suitable for audio, between an observation and the model. The methodology is evaluated in a source separation scenario (mixtures of speech) and shows results, which are comparable to the state-of-the-art, with perceptual measures indicating its superiority over all of the competing methods in the case of medium-size dictionaries.
```
@INPROCEEDINGS{Diment15_AA,
	author={Diment, Aleksandr and Virtanen, Tuomas},
	booktitle={Applications of Signal Processing to Audio and Acoustics (WASPAA), 2015 IEEE Workshop on},
	title={Archetypal analysis for audio dictionary learning},
	year={2015},
	month={Oct},
	keywords={archetypes, audio analysis, non-negative matrix factorisation, sparse representation}
```
J. Nikunen, A. Diment, T. Virtanen, M. Vilermo. Binaural rendering of microphone array captures based on source separation, Speech Communication, Available online 21 September 2015, ISSN 0167-6393.
- View abstract
- Export BibTeX
This paper proposes a method for binaural reconstruction of a sound scene captured with a portable-sized array consisting of several microphones. The proposed processing is separating the scene into a sum of small number of sources, and the spectrogram of each of them is in turn represented as a small number of latent components. The direction of arrival (DOA) of each source is estimated, which is followed by binaural rendering of each source at its estimated direction. For representing the sources, the proposed method uses low-rank complex-valued non-negative matrix factorization combined with DOA-based spatial covariance matrix model. The binaural reconstruction is achieved by applying the binaural cues (head-related transfer function) associated with the estimated source DOA to the separated source signals. The binaural rendering quality of the proposed method was evaluated using a speech intelligibility test. The test results indicated that the proposed binaural rendering was able to improve the intelligibility of speech over stereo recordings and separation by minimum variance distortionless response beamformer with the same binaural synthesis in a three-speaker scenario. An additional listening test evaluating the subjective quality of the rendered output indicates no added processing artifacts by the proposed method in comparison to unprocessed stereo recording.
```
@article{Nikunen2015,
title = "Binaural rendering of microphone array captures based on source separation",
journal = "Speech Communication",
year = "2015",
issn = "0167-6393",
doi = "http://dx.doi.org/10.1016/j.specom.2015.09.005",
url = "http://www.sciencedirect.com/science/article/pii/S0167639315001004",
author = "Joonas Nikunen and Aleksandr Diment and Tuomas Virtanen and Miikka Vilermo",
keywords = "Binaural processing, Source separation, Speech intelligibility, Non-negative matrix factorization",
}
```
A. Diment, E. Çakır, T. Heittola and T. Virtanen. Automatic recognition of environmental sound events using all-pole group delay features. First published in the Proceedings of the 23rd European Signal Processing Conference (EUSIPCO-2015) in 2015, published by EURASIP.
A feature based on the group delay function from all-pole models (APGD) is proposed for environmental sound event recognition. The commonly used spectral features take into account merely the magnitude information, whereas the phase is overlooked due to the complications related to its interpretation. Additional information concealed in the phase is hypothesised to be beneficial for sound event recognition. The APGD is an approach to inferring phase information, which has shown applicability for speech and music analysis and is now studied in environmental audio. The evaluation is performed within a multi-label deep neural network (DNN) framework on a diverse real-life dataset of environmental sounds. It shows performance improvement compared to the baseline log mel-band energy case. Combined with the magnitude-based features, APGD demonstrates further improvement.
```
@INPROCEEDINGS{Diment15_APGD4events,
AUTHOR="Aleksandr Diment and Emre Cakir and Toni Heittola and Tuomas Virtanen",
TITLE="Automatic recognition of environmental sound events using all-pole group
delay features",
BOOKTITLE="European Signal Processing Conference 2015 (EUSIPCO 2015)",
ADDRESS="Nice, France",
DAYS=31,
MONTH=aug,
YEAR=2015,
KEYWORDS="Phase spectrum; sound event recognition; audio classification; neural
networks"
}
```
A. Diment, R. Padmanabhan, T. Heittola and T. Virtanen. Group Delay Function from All-Pole Models for Musical Instrument Recognition, in Sound, Music and Motion — 10th International Symposium, CMMR 2013, Marseille, France, October 15-18, 2013. Revised Selected Papers, Lecture Notes in Computer Science, Springer International Publishing, 2014. The final publication is available at link.springer.com.
In this work, the feature based on the group delay function from all-pole models (APGD) is proposed for pitched musical instrument recognition. Conventionally, the spectrum-related features take into account merely the magnitude information, whereas the phase is often overlooked due to the complications related to its interpretation. However, there is often additional information concealed in the phase, which could be beneficial for recognition. The APGD is an elegant approach to inferring phase information, which lacks of the issues related to interpreting the phase and does not require extensive parameter adjustment. Having shown applicability for speech-related problems, it is now explored in terms of instrument recognition. The evaluation is performed with various instrument sets and shows noteworthy absolute accuracy gains of up to 7% compared to the baseline mel-frequency cepstral coefficients (MFCCs) case. Combined with the MFCCs and with feature selection, APGD demonstrates superiority over the baseline with all the evaluated sets.
```
@incollection{Diment14_APGD,
    year={2014},
    isbn={978-3-319-12975-4},
    booktitle={Sound, Music, and Motion},
    series={Lecture Notes in Computer Science},
    editor={Aramaki, Mitsuko and Derrien, Olivier and Kronland-Martinet, Richard and Ystad, Sølvi},
    doi={10.1007/978-3-319-12976-1_37},
    title={Group Delay Function from All-Pole Models for Musical Instrument Recognition},
    url={http://dx.doi.org/10.1007/978-3-319-12976-1_37},
    publisher={Springer International Publishing},
    keywords={Musical instrument recognition; Music information retrieval; All-pole group delay feature; Phase spectrum},
    author={Diment, Aleksandr and Rajan, Padmanabhan and Heittola, Toni and Virtanen, Tuomas},
    pages={606-618},
    language={English}
}
```
A. Diment, R. Padmanabhan, T. Heittola and T. Virtanen. Modified Group Delay Feature for Musical Instrument Recognition, in Proceedings of the 10th International Symposium on Computer Music Multidisciplinary Research (CMMR), Marseille, France, 2013.
In this work, the modified group delay feature (MODGDF) is proposed for pitched musical instrument recognition. Conventionally, the spectrum-related features used in instrument recognition take into account merely the magnitude information, whereas the phase is often overlooked due to the complications related to its interpretation. However, there is often additional information concealed in the phase, which could be beneficial for recognition. The MODGDF is a method of incorporating phase information, which lacks of the issues related to phase unwrapping. Having shown its applicability for speech-related problems, it is now explored in terms of musical instrument recognition. The evaluation is performed on separate note recordings in various instrument sets, and combined with the conventional mel frequency cepstral coefficients (MFCCs), MODGDF shows the noteworthy absolute accuracy gains of up to 5.1% compared to the baseline MFCCs case.
```
@INPROCEEDINGS{Diment13_MODGDF,
  AUTHOR = "Aleksandr Diment and Rajan Padmanabhan and Toni Heittola and Tuomas Virtanen",
  TITLE = "Modified Group Delay Feature for Musical Instrument Recognition",
  BOOKTITLE = "10th International Symposium on Computer Music Multidisciplinary Research (CMMR)",
  ADDRESS = "Marseille, France",
  DAYS = 15,
  MONTH = oct,
  YEAR = 2013,
  KEYWORDS = "Musical instrument recognition; music information retrieval; modified group delay feature; phase spectrum"
  }
```
A. Diment, T. Heittola and T. Virtanen. Sound Event Detection for Office Live and Office Synthetic AASP Challenge, IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events, 2013.
We present a sound event detection system based on hidden Markov models. The system is evaluated with development material provided in the AASP Challenge on Detection and Classification of Acoustic Scenes and Events. Two approaches using the same basic detection scheme are presented. First one, developed for acoustic scenes with non-overlapping sound events is evaluated with Office Live development dataset. Second one, developed for acoustic scenes with some degree of overlapping sound events is evaluated with Office Synthetic development dataset.
```
@TECHREPORT{Diment13_AASP,
  AUTHOR = "Aleksandr Diment and Toni Heittola and Tuomas Virtanen",
  TITLE = "Sound Event Detection for Office Live and Office Synthetic AASP Challenge",
  YEAR = 2013,
  KEYWORDS = "Sound event detection"
  }
```
Briggs, F., Yonghong Huang, Raich, R., et al. Eftaxias, K., Zhong Lei, Cukierski, W., Hadley, S.F., Hadley, A., Betts, M., Fern, X.Z., Irvine, J., Neal, L., Thomas, A., Fodor, G., Tsoumakas, G., Hong Wei Ng, Thi Ngoc Tho Nguyen, Huttunen, H., Ruusuvuori, P., Manninen, T., Diment, A., Virtanen, T., Marzat, J., Defretin, J., Callender, D., Hurlburt, C., Larrey, K., Milakov, M. The 9th annual MLSP competition: New methods for acoustic classification of multiple simultaneous bird species in a noisy environment, Machine Learning for Signal Processing (MLSP), 2013 IEEE International Workshop on, Southampton, United Kingdom, 2013.
Birds have been widely used as biological indicators for ecological research. They respond quickly to environmental changes and can be used to infer about other organisms (e.g., insects they feed on). Traditional methods for collecting data about birds involves costly human effort. A promising alternative is acoustic monitoring. There are many advantages to recording audio of birds compared to human surveys, including increased temporal and spatial resolution and extent, applicability in remote sites, reduced observer bias, and potentially lower cost. However, it is an open problem for signal processing and machine learning to reliably identify bird sounds in real-world audio data collected in an acoustic monitoring scenario. Some of the major challenges include multiple simultaneously vocalizing birds, other sources of non-bird sound (e.g., buzzing insects), and background noise like wind, rain, and motor vehicles.
```
@INPROCEEDINGS{6661934,
author={Briggs, F. and Yonghong Huang and Raich, R. and Eftaxias, K. and Zhong Lei and Cukierski, W. and Hadley, S.F. and Hadley, A. and Betts, M. and Fern, X.Z. and Irvine, J. and Neal, L. and Thomas, A. and Fodor, G. and Tsoumakas, G. and Hong Wei Ng and Thi Ngoc Tho Nguyen and Huttunen, H. and Ruusuvuori, P. and Manninen, T. and Diment, A. and Virtanen, T. and Marzat, J. and Defretin, J. and Callender, D. and Hurlburt, C. and Larrey, K. and Milakov, M.},
booktitle={Machine Learning for Signal Processing (MLSP), 2013 IEEE International Workshop on},
title={The 9th annual MLSP competition: New methods for acoustic classification of multiple simultaneous bird species in a noisy environment},
year={2013},
month={Sept},
pages={1-8},
keywords={acoustic signal processing;audio recording;audio signal processing;ecology;learning (artificial intelligence);signal resolution;zoology;MLSP competition;acoustic classification;acoustic monitoring scenario;audio recording;background noise;biological indicators;bird sounds;ecological research;human effort;human surveys;machine learning;multiple simultaneous bird species;noisy environment;nonbird sound;real-world audio data;reduced observer bias;remote sites;signal processing;spatial resolution;temporal resolution;Birds;Histograms;Image segmentation;Noise;Rain;Spectrogram;Vectors},
doi={10.1109/MLSP.2013.6661934},
ISSN={1551-2541},}
```
A. Diment, T. Heittola and T. Virtanen. Semi-supervised learning for musical instrument recognition, in Proceedings of the 21st European Signal Processing Conference (EUSIPCO), Marrakech, Morocco, 2013. IEEE Xplore.
In this work, the semi-supervised learning (SSL) techniques are explored in the context of musical instrument recognition. The conventional supervised approaches normally rely on annotated data to train the classifier. This implies performing costly manual annotations of the training data. The SSL methods enable utilising the additional unannotated data, which is significantly easier to obtain, allowing the overall development cost maintained at the same level while notably improving the performance. The implemented classifier incorporates the Gaussian mixture model-based SSL scheme utilising the iterative EM-based algorithm, as well as the extensions facilitating a simpler convergence criteria. The evaluation is performed on a set of nine instruments while training on a dataset, in which the relative size of the labelled data is as little as 15%. It yields a noteworthy absolute performance gain of 16% compared to the performance of the initial supervised models.
```
@INPROCEEDINGS{Diment13_SSL,
  AUTHOR = "Aleksandr Diment and Toni Heittola and Tuomas Virtanen",
  TITLE = "Semi-supervised Learning for Musical Instrument Recognition",
  BOOKTITLE = "21st European Signal Processing Conference 2013 (EUSIPCO 2013)",
  ADDRESS = "Marrakech, Morocco",
  DAYS = 9,
  MONTH = sep,
  YEAR = 2013,
  KEYWORDS = "Music information retrieval; musical instrument recognition; semi-supervised learning"
  }
```
A. Diment. Semi-supervised musical instrument recognition, MSc thesis, Tampere University of Technology, 2013.
The application areas of music information retrieval have been gaining popularity over the last decades. Musical instrument recognition is an example of a specific research topic in the field. In this thesis, semi-supervised learning techniques are explored in the context of musical instrument recognition. The conventional approaches employed for musical instrument recognition rely on annotated data, i.e. example recordings of the target instruments with associated information about the target labels in order to perform training. This implies a highly laborious and tedious work of manually annotating the collected training data. The semi-supervised methods enable incorporating additional unannotated data into training. Such data consists of merely the recordings of the instruments and is therefore significantly easier to acquire. Hence, these methods allow keeping the overall development cost at the same level while notably improving the performance of a system.

The implemented musical instrument recognition system utilises the mixture model semi-supervised learning scheme in the form of two EM-based algorithms. Furthermore, upgraded versions, namely, the additional labelled data weighting and class-wise retraining, for the improved performance and convergence criteria in terms of the particular classification scenario are proposed. The evaluation is performed on sets consisting of four and ten instruments and yields the overall average recognition accuracy rates of 95.3 and 68.4%, respectively. These correspond to the absolute gains of 6.1 and 9.7% compared to the initial, purely supervised cases. Additional experiments are conducted in terms of the effects of the proposed modifications, as well as the investigation of the optimal relative labelled dataset size. In general, the obtained performance improvement is quite noteworthy, and future research directions suggest to subsequently investigate the behaviour of the implemented algorithms along with the proposed and further extended approaches.
```
@MastersThesis{Diment13_MSc,
  author = {Aleksandr Diment},
  title = {Semi-supervised musical instrument recognition},
  school = {Tampere University of Technology},
  address = {Finland},
  year = {2013},
  }
```

Thesis supervision

E. Nieminen. Building and Visualizing a Percussion Dataset Using Deep Audio Embeddings and Dimensionality Reduction, BSc thesis, Tampere University, 2020.

@thesis{Nieminen20_BSc,
    author = {Elias Nieminen},
    title = {Building and Visualizing a Percussion Dataset Using Deep Audio Embeddings and Dimensionality Reduction},
    school = {Tampere University},
    address = {Finland},
    year = {2020},
    type={Bachelor's Thesis}
    }

V.-V. Eklund. Data Augmentation Techniques for Robust Audio Analysis, MSc thesis, Tampere University, 2019.

@thesis{Eklund19_MSc,
  author = {Ville-Veikko Eklund},
  title = {Data Augmentation Techniques for Robust Audio Analysis},
  school = {Tampere University},
  address = {Finland},
  year = {2019},
  type={Master's Thesis}
  }

T. Honka. One-shot Learning with Siamese Networks for Environmental Audio, BSc thesis, Tampere University, 2019.

@thesis{Honka19_BSc,
    author = {Tapio Honka},
    title = {One-shot Learning with Siamese Networks for Environmental Audio},
    school = {Tampere University},
    address = {Finland},
    year = {2019},
    type={Bachelor's Thesis}
    }

V.-V. Eklund. Audio dataset creation, BSc thesis, Tampere University of Technology, 2018.

@thesis{Eklund18_BSc,
  author = {Ville-Veikko Eklund},
  title = {Audio dataset creation},
  school = {Tampere University of Technology},
  address = {Finland},
  year = {2018},
  type={Bachelor's Thesis}
  }

M. Valenti. Convolutional neural networks for acoustic scene classification, MSc thesis, Tampere University of Technology, 2016.
In this thesis we investigate the use of deep neural networks applied to the ﬁeld of computational audio scene analysis, in particular to acoustic scene classiﬁcation. This task concerns the recognition of an acoustic scene, like a park or a home, performed by an artiﬁcial system. In our work we examine the use of deep models aiming to give a contribution in one of their use cases which is, in our opinion, one of the most poorly explored. The neural architecture we propose in this work is a convolutional neural network speciﬁcally designed to work on a time-frequency audio representation known as log-mel spectrogram. The network output is an array of prediction scores, each of which is associated with one class of a set of 15 predeﬁned classes. In addition, the architecture features batch normalization, a recently proposed regularization technique used to enhance the network performance and to speed up its training.

We also investigate the use of different audio sequence lengths as classiﬁcation unit for our network. Thanks to these experiments we observe that, for our artiﬁcial system, the recognition of long sequences is not easier than of medium-length sequences, hence highlighting a counterintuitive behaviour. Moreover, we introduce a training procedure which aims to make the best of small datasets by using all the labeled data available for the network training. This procedure, possible under particular circumstances, constitutes a trade-off between an accurate training stop and an increased data representation available to the network. Finally, we compare our model to other systems, proving that its recognition ability can outperform either other neural architectures as well as other state-of-the-art statistical classiﬁers, like support vector machines and Gaussian mixture models. The proposed system reaches good accuracy scores on two different databases collected in 2013 and 2016. The best accuracy scores, obtained according to two cross-validation setups, are 77% and 79% respectively. These scores constitute a 22% and 6.1% accuracy increment with respect to the correspondent baselines published together with datasets.
```
@MastersThesis{Valenti16_MSc_bib,
  author = {Michele Valenti},
  title = {Convolutional Neural Networks for Acoustic Scene Classification},
  school = {Tampere University of Technology},
  address = {Finland},
  year = {2016},
  }
```

Teaching

2016-2017:
- Advanced Signal Processing Laboratory.
2014-2015:
- Audio and Speech Processing.
2013-2014:
- Audio and Speech Processing.
- Analysis of Audio, Speech and Music Signals.
2012-2013:
- Speech recognition.

Page design based on theme by orderedlist, CC BY-SA 3.0.