使用NLTK库进行文本规范化的步骤如下:
import nltk
nltk.download('all')
word_tokenize
和sent_tokenize
。例如:from nltk.tokenize import word_tokenize
text = "Hello, how are you?"
tokens = word_tokenize(text)
print(tokens)
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in tokens if word.lower() not in stop_words]
print(filtered_words)
from nltk.stem import PorterStemmer
ps = PorterStemmer()
stemmed_words = [ps.stem(word) for word in filtered_words]
print(stemmed_words)
normalized_text = ' '.join([word.lower() for word in stemmed_words if word.isalnum()])
print(normalized_text)
通过以上步骤,可以使用NLTK库对文本进行规范化处理,使文本更易于分析和处理。
辰迅云「云服务器」,即开即用、新一代英特尔至强铂金CPU、三副本存储NVMe SSD云盘,价格低至29元/月。点击查看>>
推荐阅读: 怎么使用NLTK库对齐文本