# First post - language detection
This is my first blog post. In this post I would like to talk about language detection.
# Problem Statement
Language detection is very popular problem that comes first on data mining.
Suppose we have multiple content posts saved in .csv
file, where we need to detect the language.
Let's start to solve this problem by using python3
and library langdetect
.
This solution is very simple, but at same time good enough for most cases.
First of all we need to add import libraries on python
import pandas as pd
from langdetect import detect
Second step, we need to load data from .csv
file.
df = pd.read_csv("input_file.csv")
Suppose, we got schema where one of the fields is 'PostContent' in what we need to detect language. We can check it by calling this line of code:
print (df.columns)
Now, we need to define a safe function that will have fallback logic in case if language can not be detected. Let's use for fallback english language with the code 'en'.
def safe_detect(s):
try:
return detect(str(s))
except:
return 'en'
Now, we are ready to process the whole column in one line.
df['PostLang'] = df.apply(lambda row: safe_detect(row['PostContent']), axis=1)
So, we have detected language in a new column 'Lang', let's save the result to file.
df.to_csv('output_file.csv')
In this simple example we study how to use 'langdetect' library for language detection on your data. On next posts we will find the way to train and use advanced models for language detection.