What is Optical Music Recognition?
Updated: Jan 16, 2020
I have always been enthusiastic to acquire knowledge of Computer Vision, even more of music as it has always been a great part of my life. These two combined? A great research problem! Four months ago I have commenced a 4-year Ph.D. program. So far it has been an insane journey in terms of how much I have learned and the enthusiasm to learn more.
The research problem I am undertaking is Optical Music Recognition (OMR), more pointedly, investigating if Deep Learning can assist in improving the performances of the current methods.
For you to get to comprehend this problem a little bit more I will attempt to clarify what OMR is, the conventional methods used and the main issues needed to be tackled in the future.
Most of us have presumably used google translate and its camera translation feature by now. By just taking a picture of a text, we save time and avoid learning Chinese or other languages. Now let us think of how this feature would apply to music. Musicians still write in music sheets or blank paper. However, if they want to share their music, they will have to transcribe it into a computer. A computer-readable music file would be more accessible. Therefore the motivation behind this research is the possibility of allowing composers, musicians, to not only transcribe and edit music by means of taking a picture of the sheet music but ultimately share and play their pieces. OMR would also assist in music statistics, and enable searchability for notations, similar to that of searching for text.
Calvo-Zaragoza et al. give a very clear and inclusive definition of OMR, calling it a research field rather than a simple problem.
Optical Music Recognition is a field of research that investigates how to computationally read music notation in documents.
the second part of the definition stresses the “computationally read music notation in documents”, given that it is performed by computers (rather than humans), it does not concern music notation models themselves, but it builds upon this knowledge. Furthermore, it emphasizes the information captured by these systems, which I will explain in more detail in below sections.
The research field was established at MIT in the late 1960s, using scanned printed music sheets. The pioneers in the field are Ichiro Fujinaga, Nicholas Carter, Kia Ng, David Bainbridge and Tim Bell. Their work is still an excellent foundation for today’s research. OMR is related to other fields such as music information retrieval, computer vision and document analysis.
Based on the carried out studies a standard pipeline reflecting the approaches taken into solving the problem was formed, see Figure 1.
The usual inputs to this pipeline are scans/pictures of printed/handwritten music sheets such as those in Figure 2. These images are then subject to image processing techniques. These techniques include binarization (black and white), blurring, deskewing (rotation), and will help in reducing the noise in the image.
Enhanced images will next be used in music object recognition. In this step, the algorithm will try to identify musical objects such as clefs, noteheads, bars, slurs, and others. In this stage, the objects are primitives and separate from their semantic meaning.
Consequently, the next step attempts to reconstruct the relationships these primitives have had, together with the semantic meaning. This approach rebuilds the semantics based on grammar rules that exist in music.
The final output can represent the musical meaning and description of the music score in the input and it is machine-readable. The usual formats of these files can be MIDI, MusicXML, MEI and so on.
We want to explore new ways of performing such steps using Deep Learning (DL). Most DL models build on artificial neural networks. These networks are inspired by the biological neural networks. They consist of many layers that have the so-called nodes; they contain one input layer, one or more hidden layers and output layers. The deeper it goes, the more intricate features a model can learn and extract. The hidden layers in between are usually referred to as a “black box.” That is because we can not easily understand what happens inside, albeit new research is focusing on this.
We plan to start by applying this approach in the second stage of OMR, that is, music object detection. To do this, we need a vast dataset containing images of music sheets. This dataset should also have a ground truth so that the model can learn well from it. A part of the data, called test data, should not be seen by the model. This way, we can evaluate how good the model does on things it has never seen before. This model should be designed based on the nature of the experiment, the input, and the desired output. We also propose bringing standardization on what formats inputs, outputs should be and their evaluation.
A. Rebelo, I. Fujinaga, F. Paszkiewicz, A. R. S. Marcal, C. Guedes, and J. S. Cardoso, “Optical music recognition: state-of-the-art and open issues,” Int J Multimed Info Retr, vol. 1, no. 3, pp. 173–190, Oct. 2012. [Online]. Available: http: //link.springer.com/10.1007/s13735–012–0004–6
J. Calvo-Zaragoza, J. Hajicˇ Jr., and A. Pacha, “Understanding Optical Music Recognition,” arXiv:1908.03608 [cs, eess], Aug. 2019, arXiv: 1908.03608. [Online]. Available: http://arxiv.org/abs/ 1908.03608