Malang, Indonesia — Researchers from the Data Science Lab at STIKI Malang have made significant strides in tackling hate speech in the Javanese language. Led by Mukhlis Amien, the team has developed a robust machine learning model designed to detect hate speech in Javanese, a language spoken by over 84 million people but underrepresented in computational linguistics.
Addressing Low-Resource Language Challenges
The study highlights the unique challenges of working with low-resource languages like Javanese, which include limited data availability and significant linguistic diversity. To overcome these hurdles, the researchers employed a synthesized dataset approach alongside advanced machine learning techniques such as transfer learning, unsupervised learning, and data augmentation.
“Our goal was to create a model that could effectively identify hate speech despite the complexities and variations within the Javanese language,” said Amien. “We combined various NLP techniques to adapt to these challenges, making significant advancements in this area.”
Innovative Methodology
The team’s approach involved several key steps:
- Data Collection and Pre-processing: They gathered diverse forms of Javanese speech, translated existing English and Indonesian hate speech data into Javanese, and augmented this data using advanced NLP techniques.
- Model Development: Utilizing the pre-trained multilingual model XLM-RoBERTa, the team fine-tuned it with the augmented Javanese dataset. They also employed unsupervised learning and domain adaptation to enhance the model’s accuracy.
- Evaluation and Refinement: The model’s performance was evaluated using standard metrics and an in-depth error analysis. An iterative refinement process, including active learning, ensured continuous improvement.
Promising Results and Future Directions
The developed model showed promising results in accurately detecting hate speech in Javanese, demonstrating the effectiveness of the synthesized dataset approach. This framework not only addresses the current challenges but also offers scalable solutions for other low-resource languages.
“This research marks a substantial advancement in fostering linguistic diversity and shielding communities from the harmful effects of hate speech,” Amien added.
The team’s future work will focus on expanding the dataset, enhancing model robustness, and integrating more granular linguistic features to further refine detection capabilities. Continuous updates to the model will ensure its adaptability to the evolving nature of language and hate speech patterns.
About STIKI Malang
STIKI Malang, a leading institution in Indonesia, continues to make significant contributions to the field of data science and computational linguistics. This latest research underscores their commitment to developing innovative solutions that address real-world challenges.