Intern Project 2. 사전 준비 함수(3)

불굴의관돌이 2021. 1. 13. 16:54

2021. 1. 13. 16:54

사실 준비하는데 가장 많이 공을 들인 부분이다. 훈련된 단어들을 통해 어떻게 관련지어 워드클라우드로 표현할까이다.

본인이 행한 기법은 only 긍정 단어, only 부정 단어, 긍정&부정 단어 들로 구분시켜 전체 단어의 갯수, 각 부분별 전체 빈도수, 각 단어의 빈도수를 구하여 조합을 시킨 스코어값을 대입했다.

그렇게 되면 새로운 데이터와 비교하였을때 훈련시킨 데이터의 가중치를 통해 중요도를 구분할 수 있기 때문이다. 위 계산과정은 생략하겠다.

def call_dict():
    read_ne_df=pd.read_pickle("C:\\data\\real_ne.pkl")
    read_po_df=pd.read_pickle("C:\\data\\real_po.pkl")
    read_with_df=pd.read_pickle("C:\\data\\real_with.pkl")
    po_word=list(np.array(read_po_df['word']))
    po_weight=list(np.array(read_po_df['weight']))
    ne_word=list(np.array(read_ne_df['word']))
    ne_weight=list(np.array(read_ne_df['weight']))
    with_word=list(np.array(read_with_df['word']))
    with_weight=list(np.array(read_with_df['weight']))
    all_word=po_word+ne_word+with_word
    all_weight=po_weight+ne_weight+with_weight
    all_list=[all_word,all_weight]
    all_dic=dict(zip(*all_list))
    return all_dic

#들어오는 리뷰들을 지정된 점수로 분류하여 라인별로 묶기.
#들어오는 리뷰들을 지정된 점수로 분류하여 라인별로 묶기.
def scoring(sentence):
    score,word,weight=[],[],[]
    a,b=[],[]
    dictionary=call_dict()
    summation=0
    stopwords =['도', '는', '다', '의', '가', '이', '은',  '한', '에', '하', '고', '을', '를', '인', '듯', '과', '와', '네', '들', '듯', '지', '임', '게','요','거','로','으로',
            '것','수','할','하는','제','에서','그','데','번','해도','죠','된','건','바','구','세','최신','.']
    word_tokens=okt.morphs(sentence)
    word_tokens=[x for x in word_tokens if x not in stopwords]
    
    for x in word_tokens:
        if dictionary.get(x):
            
            
            if(len(x)==1):
                continue
            elif(len(x)>1):
                word.append(x)
                score.append(dictionary[x])
        else:
            continue
    for sc in score:
        summation+=sc
    for sc in score:
        weight.append(sc/summation)
    all_list=[word,weight]
    dict_n=dict(zip(*all_list))
    s_dic=sorted(dict_n.items(),key=lambda x:x[1],reverse=True)
    best_dic=s_dic[:10]
    temp_word,temp_weight=[],[]
    for split in range(len(best_dic)):
        list_=list(best_dic[split])
        temp_word.append(list_[0])
        temp_weight.append(list_[1])
    final_list=[temp_word,temp_weight]
    final=dict(zip(*final_list))
        
    return final
## scoring function을 이용해 csv로 받아온 리뷰를 정렬 
##location 매개변수는 메인함수에서 돌릴때 찾아서 변경
def merge_all(data):
    #데이터프레임으로 된것 저장
    merged={}
    temp_list=[]
    for line in data['content']:
        x=scoring(line)
        temp_list.append(x)
    for mer in range(len(temp_list)):
        merged={**temp_list[mer],**merged}
    return merged

def wordcloud(df):
    tokens=merge_all(df)
    wordcloud=WordCloud(font_path='C:/Windows/Fonts/malgun.ttf',background_color='white',colormap='Accent_r',
                       width=1500,height=1000).generate_from_frequencies(tokens)
    plt.imshow(wordcloud)
    plt.axis('off')

    plt.show()

'Projects' 카테고리의 다른 글

Intern Project 3. 메인 기능 함수 만들기(2) (0)	2021.01.13
Intern Project 3. 메인 기능 함수 만들기(1) (0)	2021.01.13
Intern Project 2. 사전 준비 함수(2) (0)	2021.01.13
Intern Project 2. 사전 준비 함수(1) (0)	2021.01.13
Intern Project1. GRU 기법을 활용한 감성 분석 핵심 엔진 만들기 (0)	2021.01.13

불굴의관돌로그

Intern Project 2. 사전 준비 함수(3)

'Projects' 카테고리의 다른 글

+ Recent posts

티스토리툴바