On the application of BERT models for nanopore methylation detection

Motivation DNA methylation is a common epigenetic modification, which is widely associated with various biological processes, such as gene expression, aging, and disease. Nanopore sequencing provides a promising methylation detection approach through monitoring abnormal signal shifts for detecting modified bases in target motif regions. Recently, model-based approaches, especially those with deep learning models, have achieved significant performance improvements on nanopore methylation detection. In this work, we explore using bidirectional encoder representations from transformers (BERT) for doing the task, which can provide non-recurrent neural structures for fast parallel computation.

Results We find original BERT architecture does not work as well as the bidirectional recurrent neural network (biRNN) on the nanopore methylation prediction task. Through further analysis, we observe recurrent patterns of positional-signal-shift in the context window surrounding target 5-methylcytosine (5mC) and N6-methyladenine (6mA) motifs. We propose a refined BERT with relative position representation and center hidden units concatenation, which takes account of task-specific characters into modeling.

We perform systematic evaluations in-sample and cross-sample. The experiment results show that the refined BERT model can achieve competitive or even better results than the state-of-the-art biRNN model, while the model inference speed is about 6x faster. Besides, on the cross-sample evaluation of datasets from the different research groups, BERT models demonstrate a good generalization performance.

Authors: Yao-zhong Zhang, Sera Hatakeyama, Kiyoshi Yamaguchi, Yoichi Furukawa, Satoru Miyano, Rui Yamaguchi, Seiya Imoto