트윗 텍스트와 Nan값을 포함하고 있는 위치, 키워드 정보를 이용하여 재난과 관련된 내용인지 아닌지에 대한 간단한 이진 분류 문제이다.

자연어 처리에 관한 지식이 전무하다고 봐도 될 정도이기 때문에 자연어 모델과 테이블형 데이터 모델을 따로 만들어 소프트 보팅 방식을 통하여 진행하였다.

Competition링크: https://www.kaggle.com/competitions/nlp-getting-started/submissions

Natural Language Processing with Disaster Tweets | Kaggle

www.kaggle.com

전체 코드

라이브러리 및 데이터 다운로드¶

In [2]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
from tensorflow import keras
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
import nltk
import string
import tensorflow_hub as hub
import tensorflow as tf
import sklearn

In [3]:

data = pd.read_csv("D:/kaggle_datasets/nlp-getting-started/train.csv")
X_test = pd.read_csv("D:/kaggle_datasets/nlp-getting-started/test.csv")
submission = pd.read_csv("D:/kaggle_datasets/nlp-getting-started/sample_submission.csv")
y = data["target"]

In [7]:

data

Out[7]:

	id	keyword	location	text	target
0	1	NaN	NaN	Our Deeds are the Reason of this #earthquake M...	1
1	4	NaN	NaN	Forest fire near La Ronge Sask. Canada	1
2	5	NaN	NaN	All residents asked to 'shelter in place' are ...	1
3	6	NaN	NaN	13,000 people receive #wildfires evacuation or...	1
4	7	NaN	NaN	Just got sent this photo from Ruby #Alaska as ...	1
...	...	...	...	...	...
7608	10869	NaN	NaN	Two giant cranes holding a bridge collapse int...	1
7609	10870	NaN	NaN	@aria_ahrary @TheTawniest The out of control w...	1
7610	10871	NaN	NaN	M1.94 [01:04 UTC]?5km S of Volcano Hawaii. htt...	1
7611	10872	NaN	NaN	Police investigating after an e-bike collided ...	1
7612	10873	NaN	NaN	The Latest: More Homes Razed by Northern Calif...	1

7613 rows × 5 columns

In [8]:

X = data["text"]

In [9]:

X_for_tree = data[["keyword", "location"]]

텍스트 데이터를 이용한 LSTM과 Tabular data를 이용하여 RandomForest를 학습시키기 위하여 데이터를 분할해주었다.
X는 LSTM을 위한 텍스트 데이터이며 X_for_tree는 RandomForest를 위한 Tabular data이다.

노이즈 제거¶

텍스트 데이터는 많은 노이즈를 가지고 있으므로 제거해주는 과정은 필수이다.

In [17]:

X[:10]

Out[17]:

0    our deeds are the reason of this earthquake ma...
1                forest fire near la ronge sask canada
2    all residents asked to shelter in place are be...
3     people receive wildfires evacuation orders in...
4    just got sent this photo from ruby alaska as s...
5    rockyfire update  california hwy  closed in bo...
6    flood disaster heavy rain causes flash floodin...
7    im on top of the hill and i can see a fire in ...
8    theres an emergency evacuation happening now i...
9     im afraid that the tornado is coming to our area
dtype: object

In [18]:

def clean_text(text):
    '''Make text lowercase, remove text in square brackets,remove links,remove punctuation
    and remove words containing numbers.'''
    text = text.lower()
    text = re.sub('\[.*?\]', '', text) 
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text
X = pd.Series(map(clean_text, X))

In [19]:

X[:10]

Out[19]:

0    our deeds are the reason of this earthquake ma...
1                forest fire near la ronge sask canada
2    all residents asked to shelter in place are be...
3     people receive wildfires evacuation orders in...
4    just got sent this photo from ruby alaska as s...
5    rockyfire update  california hwy  closed in bo...
6    flood disaster heavy rain causes flash floodin...
7    im on top of the hill and i can see a fire in ...
8    theres an emergency evacuation happening now i...
9     im afraid that the tornado is coming to our area
dtype: object

토큰화¶

모델이 문장을 한 번에 학습하는 게 아닌 단어 별로 학습할 수 있도록 토큰화 시켜준다.

In [20]:

X = [nltk.word_tokenize(sentence) for sentence in X]

In [21]:

pd.DataFrame(X[:5])

Out[21]:

	0	1	2	3	4	5	6	7	8	9	...	12	13	14	15	16	17	18	19	20	21
0	our	deeds	are	the	reason	of	this	earthquake	may	allah	...	all	None	None	None	None	None	None	None	None	None
1	forest	fire	near	la	ronge	sask	canada	None	None	None	...	None	None	None	None	None	None	None	None	None	None
2	all	residents	asked	to	shelter	in	place	are	being	notified	...	no	other	evacuation	or	shelter	in	place	orders	are	expected
3	people	receive	wildfires	evacuation	orders	in	california	None	None	None	...	None	None	None	None	None	None	None	None	None	None
4	just	got	sent	this	photo	from	ruby	alaska	as	smoke	...	pours	into	a	school	None	None	None	None	None	None

5 rows × 22 columns

불용어 제거¶

텍스트 데이터에는 모델 학습에 큰 의미를 주지 않는 불용어가 다량 존재한다.
[I see fire at the mountain] -> [see fire mountain] 이와 같은 식으로 전처리를 진행한다.
상당히 어려워 보이는 작업이지만 nltk 라이브러리에는 이미 영어 불용어 사전이 구현되어 있기 때문에 이를 사용하여 간단히 가능하다.

In [24]:

nltk.corpus.stopwords.words('english')[:5]

Out[24]:

['i', 'me', 'my', 'myself', 'we']

In [25]:

pd.DataFrame(X[:5])

Out[25]:

	0	1	2	3	4	5	6	7	8	9	...	12	13	14	15	16	17	18	19	20	21
0	our	deeds	are	the	reason	of	this	earthquake	may	allah	...	all	None	None	None	None	None	None	None	None	None
1	forest	fire	near	la	ronge	sask	canada	None	None	None	...	None	None	None	None	None	None	None	None	None	None
2	all	residents	asked	to	shelter	in	place	are	being	notified	...	no	other	evacuation	or	shelter	in	place	orders	are	expected
3	people	receive	wildfires	evacuation	orders	in	california	None	None	None	...	None	None	None	None	None	None	None	None	None	None
4	just	got	sent	this	photo	from	ruby	alaska	as	smoke	...	pours	into	a	school	None	None	None	None	None	None

5 rows × 22 columns

In [26]:

def remove_stopwords(text):
    words = [w for w in text if w not in nltk.corpus.stopwords.words('english')]
    return words

X = list(map(remove_stopwords, X))

In [27]:

pd.DataFrame(X[:5])

Out[27]:

	0	1	2	3	4	5	6	7	8	9	10
0	deeds	reason	earthquake	may	allah	forgive	us	None	None	None	None
1	forest	fire	near	la	ronge	sask	canada	None	None	None	None
2	residents	asked	shelter	place	notified	officers	evacuation	shelter	place	orders	expected
3	people	receive	wildfires	evacuation	orders	california	None	None	None	None	None
4	got	sent	photo	ruby	alaska	smoke	wildfires	pours	school	None	None

정수 인코딩¶

모델은 단어 그 자체를 학습하지 못하기 때문에 정수로 인코딩해주어야 한다.

In [28]:

pd.DataFrame(X[:5])

Out[28]:

	0	1	2	3	4	5	6	7	8	9	10
0	deeds	reason	earthquake	may	allah	forgive	us	None	None	None	None
1	forest	fire	near	la	ronge	sask	canada	None	None	None	None
2	residents	asked	shelter	place	notified	officers	evacuation	shelter	place	orders	expected
3	people	receive	wildfires	evacuation	orders	california	None	None	None	None	None
4	got	sent	photo	ruby	alaska	smoke	wildfires	pours	school	None	None

In [29]:

tokenizer = keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(X)
X = tokenizer.texts_to_sequences(X)

In [30]:

pd.DataFrame(X[:5])

Out[30]:

	0	1	2	3	4	5	6	7	8	9	10
0	4164	724	143	53	1512	4165	12.0	NaN	NaN	NaN	NaN
1	95	4	125	579	6288	6289	1188.0	NaN	NaN	NaN	NaN
2	1513	1391	1881	530	6290	1514	149.0	1881.0	530.0	1189.0	911.0
3	8	4166	1190	149	1189	23	NaN	NaN	NaN	NaN	NaN
4	20	1021	228	6291	1882	156	1190.0	6292.0	96.0	NaN	NaN

패딩¶

LSTM을 이용하여 텍스트 데이터를 훈련할 예정인데 RNN 계열 모델은 시퀀스가 너무 길어지면 학습이 잘 진행되지 않는다.
데이터를 그래도 사용할 시 가장 긴 길이를 기준으로 패딩 하게 되는데 이를 방지하지 위하여 정보를 너무 잃지 않는 길이로 패딩 하였다.

In [32]:

print("Average langth of tweet:", sum(map(len, X))/len(X))
plt.hist([len(tweet) for tweet in X])

Average langth of tweet: 8.932221200577958

Out[32]:

(array([2.160e+02, 6.370e+02, 1.172e+03, 2.213e+03, 1.472e+03, 1.160e+03,
        6.730e+02, 6.200e+01, 6.000e+00, 2.000e+00]),
 array([ 0. ,  2.3,  4.6,  6.9,  9.2, 11.5, 13.8, 16.1, 18.4, 20.7, 23. ]),
 <BarContainer object of 10 artists>)

In [33]:

fig,(ax1,ax2) = plt.subplots(1,2,figsize=(10,5))
ax1.hist([len(X[i]) for i in range(len(X)) if y[i]==1])
ax1.set_title("Real")
ax2.hist([len(X[i]) for i in range(len(X)) if y[i]==0], color="orange")
ax2.set_title("Fake")

Out[33]:

Text(0.5, 1.0, 'Fake')

In [177]:

count=0
for tweet in X:
    if len(tweet) > 20:
        count+=1
print((len(X)-count) / len(X))

0.9997372914751084

In [34]:

X = keras.preprocessing.sequence.pad_sequences(X, maxlen=20, padding="post")

아래는 Test data를 위하여 생성한 위의 과정을 모두 합치 함수이다.

In [35]:

def preprocess(text_sequence):
    text_sequence = pd.Series(map(clean_text, text_sequence))    
    text_sequence = [nltk.word_tokenize(sentence) for sentence in text_sequence]
    text_sequence = list(map(remove_stopwords, text_sequence))    
    text_sequence =  tokenizer.texts_to_sequences(text_sequence)
    text_sequence = keras.preprocessing.sequence.pad_sequences(text_sequence, maxlen=20, padding="post") 
    return text_sequence

테이블형 데이터 인코딩 및 데이터 split¶

테이블형 데이터에서 keyword, location 컬럼 역시 정수로 만들어주어야 한다.
sklearn의 LabelEncoder를 이용하여 전처리하였으며 보이지 않은 데이터에 대하여서 nan으로 만들어주었다.
이미 인코더가 nan을 학습했으므로 문제없이 동작한다.

In [36]:

X_train, X_valid, y_train, y_valid = train_test_split(X, y, stratify=y, random_state=42)
X_train_tree, X_valid_tree, y_train, y_valid = train_test_split(X_for_tree, y, stratify=y, random_state=42)

In [37]:

k_encoder = LabelEncoder()
X_train_tree["keyword"] = k_encoder.fit_transform(X_train_tree["keyword"])
X_valid_tree["keyword"] = k_encoder.transform(X_valid_tree["keyword"])

l_encoder = LabelEncoder()
X_train_tree["location"] = l_encoder.fit_transform(X_train_tree["location"])

In [53]:

temp = np.array(X_valid_tree["location"])

for index in range(len(temp)):
    if temp[index] not in l_encoder.classes_:
        temp[index] = np.nan
        
X_valid_tree["location"] = temp
X_valid_tree["location"] = l_encoder.transform(X_valid_tree["location"])

모델 훈련¶

In [39]:

def make_model(iter = 1, hidden=100):
    Input = keras.Input(shape=[20])

    x = keras.layers.Reshape((20, 1))(Input)
    
    for _ in range(iter):
        x = keras.layers.Bidirectional(keras.layers.LSTM(hidden, return_sequences=True))(x)
        
    x = keras.layers.Bidirectional(keras.layers.LSTM(hidden))(x)
    x = keras.layers.Dense((hidden+40)/2, activation="relu")(x)
    output = keras.layers.Dense(1, activation="sigmoid")(x)

    model = keras.Model(inputs=Input, outputs=output)
    model.compile(loss="binary_crossentropy", metrics="accuracy", optimizer="adam")
    return model

In [40]:

model = make_model(1, 100)
history = model.fit(X_train, y_train, validation_data=(X_valid, y_valid), epochs=30)

Epoch 1/30
179/179 [==============================] - 13s 22ms/step - loss: 0.6727 - accuracy: 0.5787 - val_loss: 0.6894 - val_accuracy: 0.5867
Epoch 2/30
179/179 [==============================] - 2s 14ms/step - loss: 0.6656 - accuracy: 0.5952 - val_loss: 0.6598 - val_accuracy: 0.6008
Epoch 3/30
179/179 [==============================] - 2s 14ms/step - loss: 0.6637 - accuracy: 0.6012 - val_loss: 0.6582 - val_accuracy: 0.6071
Epoch 4/30
179/179 [==============================] - 2s 14ms/step - loss: 0.6640 - accuracy: 0.5915 - val_loss: 0.6572 - val_accuracy: 0.6035
Epoch 5/30
179/179 [==============================] - 2s 14ms/step - loss: 0.6616 - accuracy: 0.5978 - val_loss: 0.6717 - val_accuracy: 0.5830
Epoch 6/30
179/179 [==============================] - 2s 14ms/step - loss: 0.6594 - accuracy: 0.6029 - val_loss: 0.6512 - val_accuracy: 0.6155
Epoch 7/30
179/179 [==============================] - 2s 14ms/step - loss: 0.6611 - accuracy: 0.6001 - val_loss: 0.6536 - val_accuracy: 0.6056
Epoch 8/30
179/179 [==============================] - 2s 14ms/step - loss: 0.6598 - accuracy: 0.5998 - val_loss: 0.6543 - val_accuracy: 0.6014
Epoch 9/30
179/179 [==============================] - 2s 14ms/step - loss: 0.6596 - accuracy: 0.6043 - val_loss: 0.6679 - val_accuracy: 0.5809
Epoch 10/30
179/179 [==============================] - 3s 14ms/step - loss: 0.6606 - accuracy: 0.5926 - val_loss: 0.6527 - val_accuracy: 0.6077
Epoch 11/30
179/179 [==============================] - 2s 14ms/step - loss: 0.6585 - accuracy: 0.5977 - val_loss: 0.6506 - val_accuracy: 0.6155
Epoch 12/30
179/179 [==============================] - 2s 14ms/step - loss: 0.6595 - accuracy: 0.6075 - val_loss: 0.6544 - val_accuracy: 0.6113
Epoch 13/30
179/179 [==============================] - 2s 14ms/step - loss: 0.6560 - accuracy: 0.6064 - val_loss: 0.6502 - val_accuracy: 0.6134
Epoch 14/30
179/179 [==============================] - 2s 14ms/step - loss: 0.6553 - accuracy: 0.6045 - val_loss: 0.6473 - val_accuracy: 0.6192
Epoch 15/30
179/179 [==============================] - 2s 14ms/step - loss: 0.6543 - accuracy: 0.6075 - val_loss: 0.6444 - val_accuracy: 0.6192
Epoch 16/30
179/179 [==============================] - 3s 14ms/step - loss: 0.6495 - accuracy: 0.6159 - val_loss: 0.6414 - val_accuracy: 0.6229
Epoch 17/30
179/179 [==============================] - 2s 14ms/step - loss: 0.6530 - accuracy: 0.6129 - val_loss: 0.6476 - val_accuracy: 0.6145
Epoch 18/30
179/179 [==============================] - 2s 14ms/step - loss: 0.6493 - accuracy: 0.6124 - val_loss: 0.6452 - val_accuracy: 0.6203
Epoch 19/30
179/179 [==============================] - 2s 14ms/step - loss: 0.6455 - accuracy: 0.6174 - val_loss: 0.6449 - val_accuracy: 0.6119
Epoch 20/30
179/179 [==============================] - 3s 14ms/step - loss: 0.6471 - accuracy: 0.6146 - val_loss: 0.6465 - val_accuracy: 0.6224
Epoch 21/30
179/179 [==============================] - 2s 14ms/step - loss: 0.6453 - accuracy: 0.6167 - val_loss: 0.6439 - val_accuracy: 0.6261
Epoch 22/30
179/179 [==============================] - 2s 14ms/step - loss: 0.6410 - accuracy: 0.6234 - val_loss: 0.6445 - val_accuracy: 0.6245
Epoch 23/30
179/179 [==============================] - 2s 14ms/step - loss: 0.6393 - accuracy: 0.6260 - val_loss: 0.6417 - val_accuracy: 0.6224
Epoch 24/30
179/179 [==============================] - 2s 14ms/step - loss: 0.6359 - accuracy: 0.6290 - val_loss: 0.6386 - val_accuracy: 0.6245
Epoch 25/30
179/179 [==============================] - 2s 14ms/step - loss: 0.6338 - accuracy: 0.6252 - val_loss: 0.6381 - val_accuracy: 0.6345
Epoch 26/30
179/179 [==============================] - 3s 15ms/step - loss: 0.6307 - accuracy: 0.6264 - val_loss: 0.6438 - val_accuracy: 0.6313
Epoch 27/30
179/179 [==============================] - 2s 14ms/step - loss: 0.6289 - accuracy: 0.6301 - val_loss: 0.6379 - val_accuracy: 0.6350
Epoch 28/30
179/179 [==============================] - 3s 14ms/step - loss: 0.6276 - accuracy: 0.6355 - val_loss: 0.6411 - val_accuracy: 0.6308
Epoch 29/30
179/179 [==============================] - 3s 14ms/step - loss: 0.6203 - accuracy: 0.6453 - val_loss: 0.6402 - val_accuracy: 0.6381
Epoch 30/30
179/179 [==============================] - 2s 14ms/step - loss: 0.6198 - accuracy: 0.6367 - val_loss: 0.6492 - val_accuracy: 0.6387

I grid search model shape by iter and hidden and I use parameteres above to train text data.

In [42]:

plt.figure(figsize=(14, 10))
plt.plot(history.history["accuracy"], label="Train")
plt.plot(history.history["val_accuracy"], label="Valid")
plt.legend()

Out[42]:

<matplotlib.legend.Legend at 0x24d4b13ab20>

In [43]:

model.evaluate(X_valid, y_valid)

60/60 [==============================] - 1s 8ms/step - loss: 0.6492 - accuracy: 0.6387

Out[43]:

[0.6491609811782837, 0.6386554837226868]

In [44]:

tree_model = RandomForestClassifier()
tree_model.fit(X_train_tree, y_train)

Out[44]:

RandomForestClassifier()

In [51]:

X_valid_tree

Out[51]:

	keyword	location
6509	189	NaN
3768	107	NaN
5507	159	NaN
5116	148	NaN
18	221	NaN
...	...	...
1280	36	NaN
7206	210	NaN
12	221	NaN
4078	117	United States
1654	47	NaN

1904 rows × 2 columns

In [54]:

tree_model.score(X_valid_tree, y_valid)

Out[54]:

0.717436974789916

Tree모델이 LSTM보다 더 좋은 성능을 보인다.
자연어 전처리에 대한 지식과 RNN계열 지식이 부족하여 복잡한 데이터를 제대로 학습하는 모델을 만들지 못한 것 같다.

In [55]:

tree_score = tree_model.predict_proba(X_valid_tree)

In [56]:

ann_score = np.concatenate((1-model.predict(X_valid), model.predict(X_valid)), axis=1)

In [57]:

tree_score

Out[57]:

array([[0.65883849, 0.34116151],
       [0.69167114, 0.30832886],
       [0.5269497 , 0.4730503 ],
       ...,
       [0.31059442, 0.68940558],
       [0.        , 1.        ],
       [0.49417857, 0.50582143]])

In [58]:

ann_score

Out[58]:

array([[0.57062364, 0.4293764 ],
       [0.38166547, 0.61833453],
       [0.7892802 , 0.21071981],
       ...,
       [0.49951833, 0.50048167],
       [0.5015851 , 0.4984149 ],
       [0.35483283, 0.6451672 ]], dtype=float32)

In [64]:

sklearn.metrics.accuracy_score(np.argmax(tree_score + ann_score, axis=1), y_valid)

Out[64]:

0.7363445378151261

각각의 라벨 값에 대한 확신 정도?를 합쳐서 가장 높을 값을 선택하는 방식으로 Soft Voting Ensemble을 진행하였다.

제출¶

제출 전에 Train 데이터를 모두 사용하여 모델을 재학습 시킬 필요가 있다.

In [65]:

model = make_model(1, 100)
model.fit(X,y, epochs=30)

Epoch 1/30
238/238 [==============================] - 9s 13ms/step - loss: 0.6678 - accuracy: 0.5851
Epoch 2/30
238/238 [==============================] - 3s 12ms/step - loss: 0.6617 - accuracy: 0.5983
Epoch 3/30
238/238 [==============================] - 3s 13ms/step - loss: 0.6610 - accuracy: 0.6011
Epoch 4/30
238/238 [==============================] - 3s 13ms/step - loss: 0.6606 - accuracy: 0.5998
Epoch 5/30
238/238 [==============================] - 3s 13ms/step - loss: 0.6589 - accuracy: 0.5996
Epoch 6/30
238/238 [==============================] - 3s 12ms/step - loss: 0.6579 - accuracy: 0.6019
Epoch 7/30
238/238 [==============================] - 3s 12ms/step - loss: 0.6554 - accuracy: 0.6069
Epoch 8/30
238/238 [==============================] - 3s 12ms/step - loss: 0.6556 - accuracy: 0.6057
Epoch 9/30
238/238 [==============================] - 3s 13ms/step - loss: 0.6508 - accuracy: 0.6154
Epoch 10/30
238/238 [==============================] - 3s 13ms/step - loss: 0.6505 - accuracy: 0.6153
Epoch 11/30
238/238 [==============================] - 3s 12ms/step - loss: 0.6480 - accuracy: 0.6168
Epoch 12/30
238/238 [==============================] - 3s 12ms/step - loss: 0.6492 - accuracy: 0.6180
Epoch 13/30
238/238 [==============================] - 3s 12ms/step - loss: 0.6448 - accuracy: 0.6162
Epoch 14/30
238/238 [==============================] - 3s 12ms/step - loss: 0.6459 - accuracy: 0.6220
Epoch 15/30
238/238 [==============================] - 3s 12ms/step - loss: 0.6439 - accuracy: 0.6203
Epoch 16/30
238/238 [==============================] - 3s 12ms/step - loss: 0.6392 - accuracy: 0.6285
Epoch 17/30
238/238 [==============================] - 3s 12ms/step - loss: 0.6387 - accuracy: 0.6300
Epoch 18/30
238/238 [==============================] - 3s 12ms/step - loss: 0.6349 - accuracy: 0.6312
Epoch 19/30
238/238 [==============================] - 3s 12ms/step - loss: 0.6317 - accuracy: 0.6322
Epoch 20/30
238/238 [==============================] - 3s 12ms/step - loss: 0.6301 - accuracy: 0.6365
Epoch 21/30
238/238 [==============================] - 3s 12ms/step - loss: 0.6294 - accuracy: 0.6377
Epoch 22/30
238/238 [==============================] - 3s 12ms/step - loss: 0.6256 - accuracy: 0.6371
Epoch 23/30
238/238 [==============================] - 3s 12ms/step - loss: 0.6226 - accuracy: 0.6434
Epoch 24/30
238/238 [==============================] - 3s 12ms/step - loss: 0.6185 - accuracy: 0.6409
Epoch 25/30
238/238 [==============================] - 3s 12ms/step - loss: 0.6160 - accuracy: 0.6448
Epoch 26/30
238/238 [==============================] - 3s 12ms/step - loss: 0.6097 - accuracy: 0.6489
Epoch 27/30
238/238 [==============================] - 3s 13ms/step - loss: 0.6053 - accuracy: 0.6530
Epoch 28/30
238/238 [==============================] - 3s 12ms/step - loss: 0.6024 - accuracy: 0.6603
Epoch 29/30
238/238 [==============================] - 3s 12ms/step - loss: 0.5993 - accuracy: 0.6540
Epoch 30/30
238/238 [==============================] - 3s 12ms/step - loss: 0.5909 - accuracy: 0.6649

Out[65]:

<keras.callbacks.History at 0x24ea6885b50>

In [81]:

k_encoder = LabelEncoder()
data["keyword"] = k_encoder.fit_transform(data["keyword"])
l_encoder = LabelEncoder()
data["location"] = l_encoder.fit_transform(data["location"])

In [84]:

tree_model = RandomForestClassifier()
tree_model.fit(data[["keyword", "location"]], y)

Out[84]:

RandomForestClassifier()

X_test에 대하여서 똑같이 전처리해주어야 한다.

In [68]:

X_test_text = preprocess(X_test["text"])

In [69]:

temp = np.array(X_test["location"])

for index in range(len(temp)):
    if temp[index] not in l_encoder.classes_:
        temp[index] = np.nan
        
X_test["location"] = temp

In [70]:

X_test["keyword"] = k_encoder.transform(X_test["keyword"])
X_test["location"] = l_encoder.transform(X_test["location"])

In [71]:

ann_score = np.concatenate((1-model.predict(X_test_text), model.predict(X_test_text)), axis=1)

In [85]:

tree_score = tree_model.predict_proba(X_test[["keyword", "location"]])

In [87]:

Final_score = np.argmax(tree_score+ann_score, axis=1)

In [88]:

submission["target"] = Final_score

In [89]:

submission.to_csv("./Ensemble_rnn_tree.csv", index=None)

최종 점수로 72% 정도의 Accuracy를 얻었는데 상위 90% 정도의 낮은 등수이다. 자연어를 벡터화하거나 word embedding을 사용하여 추가적으로 점수를 높일 예정이다.

캐글 노트북 링크:https://www.kaggle.com/code/hykhhijk/soft-voting-using-lstm-and-random-forest/notebook

Soft voting using LSTM and Random Forest

Explore and run machine learning code with Kaggle Notebooks | Using data from Natural Language Processing with Disaster Tweets

www.kaggle.com

'Kaggle, Dacon' 카테고리의 다른 글

캐글 노트북에서 기존 터미널과 똑같이 split하기 (0)	2024.08.26
[Dacon]자율주행 센서의 안테나 성능 예측 AI 경진대회 (0)	2022.08.31
[Kaggle] Ubiquant Market Prediction (0)	2022.06.19

기계심리학 공부방

[Kaggle] Natural Language Processing with Disaster Tweets

전체 코드

라이브러리 및 데이터 다운로드¶

노이즈 제거¶

토큰화¶

불용어 제거¶

정수 인코딩¶

패딩¶

테이블형 데이터 인코딩 및 데이터 split¶

모델 훈련¶

제출¶

'Kaggle, Dacon' 카테고리의 다른 글

티스토리툴바

[Kaggle] Natural Language Processing with Disaster Tweets

전체 코드

라이브러리 및 데이터 다운로드¶

노이즈 제거¶

토큰화¶

불용어 제거¶

정수 인코딩¶

패딩¶

테이블형 데이터 인코딩 및 데이터 split¶

모델 훈련¶

제출¶

'Kaggle, Dacon' 카테고리의 다른 글

관련글

티스토리툴바