트윗 텍스트와 Nan값을 포함하고 있는 위치, 키워드 정보를 이용하여 재난과 관련된 내용인지 아닌지에 대한 간단한 이진 분류 문제이다.
자연어 처리에 관한 지식이 전무하다고 봐도 될 정도이기 때문에 자연어 모델과 테이블형 데이터 모델을 따로 만들어 소프트 보팅 방식을 통하여 진행하였다.
Competition링크: https://www.kaggle.com/competitions/nlp-getting-started/submissions
전체 코드
라이브러리 및 데이터 다운로드¶
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
from tensorflow import keras
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
import nltk
import string
import tensorflow_hub as hub
import tensorflow as tf
import sklearn
data = pd.read_csv("D:/kaggle_datasets/nlp-getting-started/train.csv")
X_test = pd.read_csv("D:/kaggle_datasets/nlp-getting-started/test.csv")
submission = pd.read_csv("D:/kaggle_datasets/nlp-getting-started/sample_submission.csv")
y = data["target"]
data
id | keyword | location | text | target | |
---|---|---|---|---|---|
0 | 1 | NaN | NaN | Our Deeds are the Reason of this #earthquake M... | 1 |
1 | 4 | NaN | NaN | Forest fire near La Ronge Sask. Canada | 1 |
2 | 5 | NaN | NaN | All residents asked to 'shelter in place' are ... | 1 |
3 | 6 | NaN | NaN | 13,000 people receive #wildfires evacuation or... | 1 |
4 | 7 | NaN | NaN | Just got sent this photo from Ruby #Alaska as ... | 1 |
... | ... | ... | ... | ... | ... |
7608 | 10869 | NaN | NaN | Two giant cranes holding a bridge collapse int... | 1 |
7609 | 10870 | NaN | NaN | @aria_ahrary @TheTawniest The out of control w... | 1 |
7610 | 10871 | NaN | NaN | M1.94 [01:04 UTC]?5km S of Volcano Hawaii. htt... | 1 |
7611 | 10872 | NaN | NaN | Police investigating after an e-bike collided ... | 1 |
7612 | 10873 | NaN | NaN | The Latest: More Homes Razed by Northern Calif... | 1 |
7613 rows × 5 columns
X = data["text"]
X_for_tree = data[["keyword", "location"]]
텍스트 데이터를 이용한 LSTM과 Tabular data를 이용하여 RandomForest를 학습시키기 위하여 데이터를 분할해주었다.
X는 LSTM을 위한 텍스트 데이터이며 X_for_tree는 RandomForest를 위한 Tabular data이다.
노이즈 제거¶
텍스트 데이터는 많은 노이즈를 가지고 있으므로 제거해주는 과정은 필수이다.
X[:10]
0 our deeds are the reason of this earthquake ma...
1 forest fire near la ronge sask canada
2 all residents asked to shelter in place are be...
3 people receive wildfires evacuation orders in...
4 just got sent this photo from ruby alaska as s...
5 rockyfire update california hwy closed in bo...
6 flood disaster heavy rain causes flash floodin...
7 im on top of the hill and i can see a fire in ...
8 theres an emergency evacuation happening now i...
9 im afraid that the tornado is coming to our area
dtype: object
def clean_text(text):
'''Make text lowercase, remove text in square brackets,remove links,remove punctuation
and remove words containing numbers.'''
text = text.lower()
text = re.sub('\[.*?\]', '', text)
text = re.sub('https?://\S+|www\.\S+', '', text)
text = re.sub('<.*?>+', '', text)
text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
text = re.sub('\n', '', text)
text = re.sub('\w*\d\w*', '', text)
return text
X = pd.Series(map(clean_text, X))
X[:10]
0 our deeds are the reason of this earthquake ma...
1 forest fire near la ronge sask canada
2 all residents asked to shelter in place are be...
3 people receive wildfires evacuation orders in...
4 just got sent this photo from ruby alaska as s...
5 rockyfire update california hwy closed in bo...
6 flood disaster heavy rain causes flash floodin...
7 im on top of the hill and i can see a fire in ...
8 theres an emergency evacuation happening now i...
9 im afraid that the tornado is coming to our area
dtype: object
토큰화¶
모델이 문장을 한 번에 학습하는 게 아닌 단어 별로 학습할 수 있도록 토큰화 시켜준다.
X = [nltk.word_tokenize(sentence) for sentence in X]
pd.DataFrame(X[:5])
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | our | deeds | are | the | reason | of | this | earthquake | may | allah | ... | all | None | None | None | None | None | None | None | None | None |
1 | forest | fire | near | la | ronge | sask | canada | None | None | None | ... | None | None | None | None | None | None | None | None | None | None |
2 | all | residents | asked | to | shelter | in | place | are | being | notified | ... | no | other | evacuation | or | shelter | in | place | orders | are | expected |
3 | people | receive | wildfires | evacuation | orders | in | california | None | None | None | ... | None | None | None | None | None | None | None | None | None | None |
4 | just | got | sent | this | photo | from | ruby | alaska | as | smoke | ... | pours | into | a | school | None | None | None | None | None | None |
5 rows × 22 columns
불용어 제거¶
텍스트 데이터에는 모델 학습에 큰 의미를 주지 않는 불용어가 다량 존재한다.
[I see fire at the mountain] -> [see fire mountain] 이와 같은 식으로 전처리를 진행한다.
상당히 어려워 보이는 작업이지만 nltk 라이브러리에는 이미 영어 불용어 사전이 구현되어 있기 때문에 이를 사용하여 간단히 가능하다.
nltk.corpus.stopwords.words('english')[:5]
['i', 'me', 'my', 'myself', 'we']
pd.DataFrame(X[:5])
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | our | deeds | are | the | reason | of | this | earthquake | may | allah | ... | all | None | None | None | None | None | None | None | None | None |
1 | forest | fire | near | la | ronge | sask | canada | None | None | None | ... | None | None | None | None | None | None | None | None | None | None |
2 | all | residents | asked | to | shelter | in | place | are | being | notified | ... | no | other | evacuation | or | shelter | in | place | orders | are | expected |
3 | people | receive | wildfires | evacuation | orders | in | california | None | None | None | ... | None | None | None | None | None | None | None | None | None | None |
4 | just | got | sent | this | photo | from | ruby | alaska | as | smoke | ... | pours | into | a | school | None | None | None | None | None | None |
5 rows × 22 columns
def remove_stopwords(text):
words = [w for w in text if w not in nltk.corpus.stopwords.words('english')]
return words
X = list(map(remove_stopwords, X))
pd.DataFrame(X[:5])
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | deeds | reason | earthquake | may | allah | forgive | us | None | None | None | None |
1 | forest | fire | near | la | ronge | sask | canada | None | None | None | None |
2 | residents | asked | shelter | place | notified | officers | evacuation | shelter | place | orders | expected |
3 | people | receive | wildfires | evacuation | orders | california | None | None | None | None | None |
4 | got | sent | photo | ruby | alaska | smoke | wildfires | pours | school | None | None |
정수 인코딩¶
모델은 단어 그 자체를 학습하지 못하기 때문에 정수로 인코딩해주어야 한다.
pd.DataFrame(X[:5])
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | deeds | reason | earthquake | may | allah | forgive | us | None | None | None | None |
1 | forest | fire | near | la | ronge | sask | canada | None | None | None | None |
2 | residents | asked | shelter | place | notified | officers | evacuation | shelter | place | orders | expected |
3 | people | receive | wildfires | evacuation | orders | california | None | None | None | None | None |
4 | got | sent | photo | ruby | alaska | smoke | wildfires | pours | school | None | None |
tokenizer = keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(X)
X = tokenizer.texts_to_sequences(X)
pd.DataFrame(X[:5])
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4164 | 724 | 143 | 53 | 1512 | 4165 | 12.0 | NaN | NaN | NaN | NaN |
1 | 95 | 4 | 125 | 579 | 6288 | 6289 | 1188.0 | NaN | NaN | NaN | NaN |
2 | 1513 | 1391 | 1881 | 530 | 6290 | 1514 | 149.0 | 1881.0 | 530.0 | 1189.0 | 911.0 |
3 | 8 | 4166 | 1190 | 149 | 1189 | 23 | NaN | NaN | NaN | NaN | NaN |
4 | 20 | 1021 | 228 | 6291 | 1882 | 156 | 1190.0 | 6292.0 | 96.0 | NaN | NaN |
패딩¶
LSTM을 이용하여 텍스트 데이터를 훈련할 예정인데 RNN 계열 모델은 시퀀스가 너무 길어지면 학습이 잘 진행되지 않는다.
데이터를 그래도 사용할 시 가장 긴 길이를 기준으로 패딩 하게 되는데 이를 방지하지 위하여 정보를 너무 잃지 않는 길이로 패딩 하였다.
print("Average langth of tweet:", sum(map(len, X))/len(X))
plt.hist([len(tweet) for tweet in X])
Average langth of tweet: 8.932221200577958
(array([2.160e+02, 6.370e+02, 1.172e+03, 2.213e+03, 1.472e+03, 1.160e+03,
6.730e+02, 6.200e+01, 6.000e+00, 2.000e+00]),
array([ 0. , 2.3, 4.6, 6.9, 9.2, 11.5, 13.8, 16.1, 18.4, 20.7, 23. ]),
<BarContainer object of 10 artists>)
fig,(ax1,ax2) = plt.subplots(1,2,figsize=(10,5))
ax1.hist([len(X[i]) for i in range(len(X)) if y[i]==1])
ax1.set_title("Real")
ax2.hist([len(X[i]) for i in range(len(X)) if y[i]==0], color="orange")
ax2.set_title("Fake")
Text(0.5, 1.0, 'Fake')
count=0
for tweet in X:
if len(tweet) > 20:
count+=1
print((len(X)-count) / len(X))
0.9997372914751084
X = keras.preprocessing.sequence.pad_sequences(X, maxlen=20, padding="post")
아래는 Test data를 위하여 생성한 위의 과정을 모두 합치 함수이다.
def preprocess(text_sequence):
text_sequence = pd.Series(map(clean_text, text_sequence))
text_sequence = [nltk.word_tokenize(sentence) for sentence in text_sequence]
text_sequence = list(map(remove_stopwords, text_sequence))
text_sequence = tokenizer.texts_to_sequences(text_sequence)
text_sequence = keras.preprocessing.sequence.pad_sequences(text_sequence, maxlen=20, padding="post")
return text_sequence
테이블형 데이터 인코딩 및 데이터 split¶
테이블형 데이터에서 keyword, location 컬럼 역시 정수로 만들어주어야 한다.
sklearn의 LabelEncoder를 이용하여 전처리하였으며 보이지 않은 데이터에 대하여서 nan으로 만들어주었다.
이미 인코더가 nan을 학습했으므로 문제없이 동작한다.
X_train, X_valid, y_train, y_valid = train_test_split(X, y, stratify=y, random_state=42)
X_train_tree, X_valid_tree, y_train, y_valid = train_test_split(X_for_tree, y, stratify=y, random_state=42)
k_encoder = LabelEncoder()
X_train_tree["keyword"] = k_encoder.fit_transform(X_train_tree["keyword"])
X_valid_tree["keyword"] = k_encoder.transform(X_valid_tree["keyword"])
l_encoder = LabelEncoder()
X_train_tree["location"] = l_encoder.fit_transform(X_train_tree["location"])
temp = np.array(X_valid_tree["location"])
for index in range(len(temp)):
if temp[index] not in l_encoder.classes_:
temp[index] = np.nan
X_valid_tree["location"] = temp
X_valid_tree["location"] = l_encoder.transform(X_valid_tree["location"])
모델 훈련¶
def make_model(iter = 1, hidden=100):
Input = keras.Input(shape=[20])
x = keras.layers.Reshape((20, 1))(Input)
for _ in range(iter):
x = keras.layers.Bidirectional(keras.layers.LSTM(hidden, return_sequences=True))(x)
x = keras.layers.Bidirectional(keras.layers.LSTM(hidden))(x)
x = keras.layers.Dense((hidden+40)/2, activation="relu")(x)
output = keras.layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs=Input, outputs=output)
model.compile(loss="binary_crossentropy", metrics="accuracy", optimizer="adam")
return model
model = make_model(1, 100)
history = model.fit(X_train, y_train, validation_data=(X_valid, y_valid), epochs=30)
Epoch 1/30
179/179 [==============================] - 13s 22ms/step - loss: 0.6727 - accuracy: 0.5787 - val_loss: 0.6894 - val_accuracy: 0.5867
Epoch 2/30
179/179 [==============================] - 2s 14ms/step - loss: 0.6656 - accuracy: 0.5952 - val_loss: 0.6598 - val_accuracy: 0.6008
Epoch 3/30
179/179 [==============================] - 2s 14ms/step - loss: 0.6637 - accuracy: 0.6012 - val_loss: 0.6582 - val_accuracy: 0.6071
Epoch 4/30
179/179 [==============================] - 2s 14ms/step - loss: 0.6640 - accuracy: 0.5915 - val_loss: 0.6572 - val_accuracy: 0.6035
Epoch 5/30
179/179 [==============================] - 2s 14ms/step - loss: 0.6616 - accuracy: 0.5978 - val_loss: 0.6717 - val_accuracy: 0.5830
Epoch 6/30
179/179 [==============================] - 2s 14ms/step - loss: 0.6594 - accuracy: 0.6029 - val_loss: 0.6512 - val_accuracy: 0.6155
Epoch 7/30
179/179 [==============================] - 2s 14ms/step - loss: 0.6611 - accuracy: 0.6001 - val_loss: 0.6536 - val_accuracy: 0.6056
Epoch 8/30
179/179 [==============================] - 2s 14ms/step - loss: 0.6598 - accuracy: 0.5998 - val_loss: 0.6543 - val_accuracy: 0.6014
Epoch 9/30
179/179 [==============================] - 2s 14ms/step - loss: 0.6596 - accuracy: 0.6043 - val_loss: 0.6679 - val_accuracy: 0.5809
Epoch 10/30
179/179 [==============================] - 3s 14ms/step - loss: 0.6606 - accuracy: 0.5926 - val_loss: 0.6527 - val_accuracy: 0.6077
Epoch 11/30
179/179 [==============================] - 2s 14ms/step - loss: 0.6585 - accuracy: 0.5977 - val_loss: 0.6506 - val_accuracy: 0.6155
Epoch 12/30
179/179 [==============================] - 2s 14ms/step - loss: 0.6595 - accuracy: 0.6075 - val_loss: 0.6544 - val_accuracy: 0.6113
Epoch 13/30
179/179 [==============================] - 2s 14ms/step - loss: 0.6560 - accuracy: 0.6064 - val_loss: 0.6502 - val_accuracy: 0.6134
Epoch 14/30
179/179 [==============================] - 2s 14ms/step - loss: 0.6553 - accuracy: 0.6045 - val_loss: 0.6473 - val_accuracy: 0.6192
Epoch 15/30
179/179 [==============================] - 2s 14ms/step - loss: 0.6543 - accuracy: 0.6075 - val_loss: 0.6444 - val_accuracy: 0.6192
Epoch 16/30
179/179 [==============================] - 3s 14ms/step - loss: 0.6495 - accuracy: 0.6159 - val_loss: 0.6414 - val_accuracy: 0.6229
Epoch 17/30
179/179 [==============================] - 2s 14ms/step - loss: 0.6530 - accuracy: 0.6129 - val_loss: 0.6476 - val_accuracy: 0.6145
Epoch 18/30
179/179 [==============================] - 2s 14ms/step - loss: 0.6493 - accuracy: 0.6124 - val_loss: 0.6452 - val_accuracy: 0.6203
Epoch 19/30
179/179 [==============================] - 2s 14ms/step - loss: 0.6455 - accuracy: 0.6174 - val_loss: 0.6449 - val_accuracy: 0.6119
Epoch 20/30
179/179 [==============================] - 3s 14ms/step - loss: 0.6471 - accuracy: 0.6146 - val_loss: 0.6465 - val_accuracy: 0.6224
Epoch 21/30
179/179 [==============================] - 2s 14ms/step - loss: 0.6453 - accuracy: 0.6167 - val_loss: 0.6439 - val_accuracy: 0.6261
Epoch 22/30
179/179 [==============================] - 2s 14ms/step - loss: 0.6410 - accuracy: 0.6234 - val_loss: 0.6445 - val_accuracy: 0.6245
Epoch 23/30
179/179 [==============================] - 2s 14ms/step - loss: 0.6393 - accuracy: 0.6260 - val_loss: 0.6417 - val_accuracy: 0.6224
Epoch 24/30
179/179 [==============================] - 2s 14ms/step - loss: 0.6359 - accuracy: 0.6290 - val_loss: 0.6386 - val_accuracy: 0.6245
Epoch 25/30
179/179 [==============================] - 2s 14ms/step - loss: 0.6338 - accuracy: 0.6252 - val_loss: 0.6381 - val_accuracy: 0.6345
Epoch 26/30
179/179 [==============================] - 3s 15ms/step - loss: 0.6307 - accuracy: 0.6264 - val_loss: 0.6438 - val_accuracy: 0.6313
Epoch 27/30
179/179 [==============================] - 2s 14ms/step - loss: 0.6289 - accuracy: 0.6301 - val_loss: 0.6379 - val_accuracy: 0.6350
Epoch 28/30
179/179 [==============================] - 3s 14ms/step - loss: 0.6276 - accuracy: 0.6355 - val_loss: 0.6411 - val_accuracy: 0.6308
Epoch 29/30
179/179 [==============================] - 3s 14ms/step - loss: 0.6203 - accuracy: 0.6453 - val_loss: 0.6402 - val_accuracy: 0.6381
Epoch 30/30
179/179 [==============================] - 2s 14ms/step - loss: 0.6198 - accuracy: 0.6367 - val_loss: 0.6492 - val_accuracy: 0.6387
I grid search model shape by iter and hidden and I use parameteres above to train text data.
plt.figure(figsize=(14, 10))
plt.plot(history.history["accuracy"], label="Train")
plt.plot(history.history["val_accuracy"], label="Valid")
plt.legend()
<matplotlib.legend.Legend at 0x24d4b13ab20>
model.evaluate(X_valid, y_valid)
60/60 [==============================] - 1s 8ms/step - loss: 0.6492 - accuracy: 0.6387
[0.6491609811782837, 0.6386554837226868]
tree_model = RandomForestClassifier()
tree_model.fit(X_train_tree, y_train)
RandomForestClassifier()
X_valid_tree
keyword | location | |
---|---|---|
6509 | 189 | NaN |
3768 | 107 | NaN |
5507 | 159 | NaN |
5116 | 148 | NaN |
18 | 221 | NaN |
... | ... | ... |
1280 | 36 | NaN |
7206 | 210 | NaN |
12 | 221 | NaN |
4078 | 117 | United States |
1654 | 47 | NaN |
1904 rows × 2 columns
tree_model.score(X_valid_tree, y_valid)
0.717436974789916
Tree모델이 LSTM보다 더 좋은 성능을 보인다.
자연어 전처리에 대한 지식과 RNN계열 지식이 부족하여 복잡한 데이터를 제대로 학습하는 모델을 만들지 못한 것 같다.
tree_score = tree_model.predict_proba(X_valid_tree)
ann_score = np.concatenate((1-model.predict(X_valid), model.predict(X_valid)), axis=1)
tree_score
array([[0.65883849, 0.34116151],
[0.69167114, 0.30832886],
[0.5269497 , 0.4730503 ],
...,
[0.31059442, 0.68940558],
[0. , 1. ],
[0.49417857, 0.50582143]])
ann_score
array([[0.57062364, 0.4293764 ],
[0.38166547, 0.61833453],
[0.7892802 , 0.21071981],
...,
[0.49951833, 0.50048167],
[0.5015851 , 0.4984149 ],
[0.35483283, 0.6451672 ]], dtype=float32)
sklearn.metrics.accuracy_score(np.argmax(tree_score + ann_score, axis=1), y_valid)
0.7363445378151261
각각의 라벨 값에 대한 확신 정도?를 합쳐서 가장 높을 값을 선택하는 방식으로 Soft Voting Ensemble을 진행하였다.
제출¶
제출 전에 Train 데이터를 모두 사용하여 모델을 재학습 시킬 필요가 있다.
model = make_model(1, 100)
model.fit(X,y, epochs=30)
Epoch 1/30
238/238 [==============================] - 9s 13ms/step - loss: 0.6678 - accuracy: 0.5851
Epoch 2/30
238/238 [==============================] - 3s 12ms/step - loss: 0.6617 - accuracy: 0.5983
Epoch 3/30
238/238 [==============================] - 3s 13ms/step - loss: 0.6610 - accuracy: 0.6011
Epoch 4/30
238/238 [==============================] - 3s 13ms/step - loss: 0.6606 - accuracy: 0.5998
Epoch 5/30
238/238 [==============================] - 3s 13ms/step - loss: 0.6589 - accuracy: 0.5996
Epoch 6/30
238/238 [==============================] - 3s 12ms/step - loss: 0.6579 - accuracy: 0.6019
Epoch 7/30
238/238 [==============================] - 3s 12ms/step - loss: 0.6554 - accuracy: 0.6069
Epoch 8/30
238/238 [==============================] - 3s 12ms/step - loss: 0.6556 - accuracy: 0.6057
Epoch 9/30
238/238 [==============================] - 3s 13ms/step - loss: 0.6508 - accuracy: 0.6154
Epoch 10/30
238/238 [==============================] - 3s 13ms/step - loss: 0.6505 - accuracy: 0.6153
Epoch 11/30
238/238 [==============================] - 3s 12ms/step - loss: 0.6480 - accuracy: 0.6168
Epoch 12/30
238/238 [==============================] - 3s 12ms/step - loss: 0.6492 - accuracy: 0.6180
Epoch 13/30
238/238 [==============================] - 3s 12ms/step - loss: 0.6448 - accuracy: 0.6162
Epoch 14/30
238/238 [==============================] - 3s 12ms/step - loss: 0.6459 - accuracy: 0.6220
Epoch 15/30
238/238 [==============================] - 3s 12ms/step - loss: 0.6439 - accuracy: 0.6203
Epoch 16/30
238/238 [==============================] - 3s 12ms/step - loss: 0.6392 - accuracy: 0.6285
Epoch 17/30
238/238 [==============================] - 3s 12ms/step - loss: 0.6387 - accuracy: 0.6300
Epoch 18/30
238/238 [==============================] - 3s 12ms/step - loss: 0.6349 - accuracy: 0.6312
Epoch 19/30
238/238 [==============================] - 3s 12ms/step - loss: 0.6317 - accuracy: 0.6322
Epoch 20/30
238/238 [==============================] - 3s 12ms/step - loss: 0.6301 - accuracy: 0.6365
Epoch 21/30
238/238 [==============================] - 3s 12ms/step - loss: 0.6294 - accuracy: 0.6377
Epoch 22/30
238/238 [==============================] - 3s 12ms/step - loss: 0.6256 - accuracy: 0.6371
Epoch 23/30
238/238 [==============================] - 3s 12ms/step - loss: 0.6226 - accuracy: 0.6434
Epoch 24/30
238/238 [==============================] - 3s 12ms/step - loss: 0.6185 - accuracy: 0.6409
Epoch 25/30
238/238 [==============================] - 3s 12ms/step - loss: 0.6160 - accuracy: 0.6448
Epoch 26/30
238/238 [==============================] - 3s 12ms/step - loss: 0.6097 - accuracy: 0.6489
Epoch 27/30
238/238 [==============================] - 3s 13ms/step - loss: 0.6053 - accuracy: 0.6530
Epoch 28/30
238/238 [==============================] - 3s 12ms/step - loss: 0.6024 - accuracy: 0.6603
Epoch 29/30
238/238 [==============================] - 3s 12ms/step - loss: 0.5993 - accuracy: 0.6540
Epoch 30/30
238/238 [==============================] - 3s 12ms/step - loss: 0.5909 - accuracy: 0.6649
<keras.callbacks.History at 0x24ea6885b50>
k_encoder = LabelEncoder()
data["keyword"] = k_encoder.fit_transform(data["keyword"])
l_encoder = LabelEncoder()
data["location"] = l_encoder.fit_transform(data["location"])
tree_model = RandomForestClassifier()
tree_model.fit(data[["keyword", "location"]], y)
RandomForestClassifier()
X_test에 대하여서 똑같이 전처리해주어야 한다.
X_test_text = preprocess(X_test["text"])
temp = np.array(X_test["location"])
for index in range(len(temp)):
if temp[index] not in l_encoder.classes_:
temp[index] = np.nan
X_test["location"] = temp
X_test["keyword"] = k_encoder.transform(X_test["keyword"])
X_test["location"] = l_encoder.transform(X_test["location"])
ann_score = np.concatenate((1-model.predict(X_test_text), model.predict(X_test_text)), axis=1)
tree_score = tree_model.predict_proba(X_test[["keyword", "location"]])
Final_score = np.argmax(tree_score+ann_score, axis=1)
submission["target"] = Final_score
submission.to_csv("./Ensemble_rnn_tree.csv", index=None)
최종 점수로 72% 정도의 Accuracy를 얻었는데 상위 90% 정도의 낮은 등수이다. 자연어를 벡터화하거나 word embedding을 사용하여 추가적으로 점수를 높일 예정이다.
캐글 노트북 링크:https://www.kaggle.com/code/hykhhijk/soft-voting-using-lstm-and-random-forest/notebook
'Kaggle, Dacon' 카테고리의 다른 글
2024 자동차 데이터 분석 경진대회(프롬프트 엔지니어링) (0) | 2024.10.19 |
---|---|
캐글 노트북에서 기존 터미널과 똑같이 split하기 (0) | 2024.08.26 |
[Dacon]자율주행 센서의 안테나 성능 예측 AI 경진대회 (0) | 2022.08.31 |
[Kaggle] Ubiquant Market Prediction (0) | 2022.06.19 |