인공지능/Merchine_Learning

[파이썬 머신러닝 완벽가이드] 추천시스템 - 콘텐츠기반

아네스 2020. 12. 21. 20:57

코드올리는거 여전히 이상하네 왜이렇지 ㅠㅠ

코사인 유사도 ?? 유튜브에서 찾아보자.

1.코사인 유사도별로 정렬을 하니 vote_average가 낮은것들도 유사도가 높다고 나온다.

vote_average도 포함시켜서 추출해보자

2.아까보다야 좋아졌지만 한표받고 평균 10점받은 영화는 거르고 싶다.

3.유사도도 높고 투표도 잘 받은(평점 좋은) 영화를 추출해보자.

아래와 같은 가중 평점을 이용하여 새로운 칼럼을 만들어 추출할 수 있다.

위 식에서 m 값보다 v가 낮으면 개별 평점이 좋더라도 수가 굉장히 작아지고 (ex v = 10, m = 300 )

m값보다 v가 높아야만 봐줄만한 값이 된다 (ex v = 1000, m = 300 )

두번째 항에서 평균평점이 높은 사이트가 있을거고 짜게주는 사이트도 있을 것이다. 이것을 감안하게끔 하는 가중치.

In [65]:

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:1200px !important; }</style>"))

In [51]:

import pandas as pd
import numpy as np
import warnings; warnings.filterwarnings('ignore')

movies = pd.read_csv("./tmdb_5000_movies.csv")
print(movies.shape)
movies.head(5)

(4803, 20)

Out[51]:

	budget	genres	homepage	id	keywords	original_language	original_title	overview	popularity	production_companies	production_countries	release_date	revenue	runtime	spoken_languages	status	tagline	title	vote_average	vote_count
0	237000000	[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {...	http://www.avatarmovie.com/	19995	[{"id": 1463, "name": "culture clash"}, {"id": 2964, "name": "future"}, {"id": 3386, "name": "sp...	en	Avatar	In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, ...	150.437577	[{"name": "Ingenious Film Partners", "id": 289}, {"name": "Twentieth Century Fox Film Corporatio...	[{"iso_3166_1": "US", "name": "United States of America"}, {"iso_3166_1": "GB", "name": "United ...	2009-12-10	2787965087	162.0	[{"iso_639_1": "en", "name": "English"}, {"iso_639_1": "es", "name": "Espa\u00f1ol"}]	Released	Enter the World of Pandora.	Avatar	7.2	11800
1	300000000	[{"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 28, "name": "Action"}]	http://disney.go.com/disneypictures/pirates/	285	[{"id": 270, "name": "ocean"}, {"id": 726, "name": "drug abuse"}, {"id": 911, "name": "exotic is...	en	Pirates of the Caribbean: At World's End	Captain Barbossa, long believed to be dead, has come back to life and is headed to the edge of t...	139.082615	[{"name": "Walt Disney Pictures", "id": 2}, {"name": "Jerry Bruckheimer Films", "id": 130}, {"na...	[{"iso_3166_1": "US", "name": "United States of America"}]	2007-05-19	961000000	169.0	[{"iso_639_1": "en", "name": "English"}]	Released	At the end of the world, the adventure begins.	Pirates of the Caribbean: At World's End	6.9	4500
2	245000000	[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 80, "name": "Crime"}]	http://www.sonypictures.com/movies/spectre/	206647	[{"id": 470, "name": "spy"}, {"id": 818, "name": "based on novel"}, {"id": 4289, "name": "secret...	en	Spectre	A cryptic message from Bond’s past sends him on a trail to uncover a sinister organization. Whil...	107.376788	[{"name": "Columbia Pictures", "id": 5}, {"name": "Danjaq", "id": 10761}, {"name": "B24", "id": ...	[{"iso_3166_1": "GB", "name": "United Kingdom"}, {"iso_3166_1": "US", "name": "United States of ...	2015-10-26	880674609	148.0	[{"iso_639_1": "fr", "name": "Fran\u00e7ais"}, {"iso_639_1": "en", "name": "English"}, {"iso_639...	Released	A Plan No One Escapes	Spectre	6.3	4466
3	250000000	[{"id": 28, "name": "Action"}, {"id": 80, "name": "Crime"}, {"id": 18, "name": "Drama"}, {"id": ...	http://www.thedarkknightrises.com/	49026	[{"id": 849, "name": "dc comics"}, {"id": 853, "name": "crime fighter"}, {"id": 949, "name": "te...	en	The Dark Knight Rises	Following the death of District Attorney Harvey Dent, Batman assumes responsibility for Dent's c...	112.312950	[{"name": "Legendary Pictures", "id": 923}, {"name": "Warner Bros.", "id": 6194}, {"name": "DC E...	[{"iso_3166_1": "US", "name": "United States of America"}]	2012-07-16	1084939099	165.0	[{"iso_639_1": "en", "name": "English"}]	Released	The Legend Ends	The Dark Knight Rises	7.6	9106
4	260000000	[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 878, "name": "Science Fic...	http://movies.disney.com/john-carter	49529	[{"id": 818, "name": "based on novel"}, {"id": 839, "name": "mars"}, {"id": 1456, "name": "medal...	en	John Carter	John Carter is a war-weary, former military captain who's inexplicably transported to the myster...	43.926995	[{"name": "Walt Disney Pictures", "id": 2}]	[{"iso_3166_1": "US", "name": "United States of America"}]	2012-03-07	284139100	132.0	[{"iso_639_1": "en", "name": "English"}]	Released	Lost in our world, found in another.	John Carter	6.1	2124

In [35]:

movies_df = movies[['id','title', 'genres', 'vote_average', 'vote_count',
                 'popularity', 'keywords', 'overview']]

In [36]:

pd.set_option('max_colwidth',100)
movies_df[['genres','keywords']][:1]

Out[36]:

	genres	keywords
0	[{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {...	[{"id": 1463, "name": "culture clash"}, {"id": 2964, "name": "future"}, {"id": 3386, "name": "sp...

In [37]:

movies_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 8 columns):
id              4803 non-null int64
title           4803 non-null object
genres          4803 non-null object
vote_average    4803 non-null float64
vote_count      4803 non-null int64
popularity      4803 non-null float64
keywords        4803 non-null object
overview        4800 non-null object
dtypes: float64(2), int64(2), object(4)
memory usage: 300.3+ KB

텍스트 문자 1차 가공, 파이썬 딕셔너리 변환 후 리스트 형태로 변환¶

In [38]:

from ast import literal_eval
#문자열을 가공해서 파이썬에서 사용하는 객체로 만들어줌
movies_df['genres'] = movies_df['genres'].apply(literal_eval)
movies_df['keywords'] = movies_df['keywords'].apply(literal_eval)

In [39]:

movies_df['keywords'].head(1)

Out[39]:

0    [{'id': 1463, 'name': 'culture clash'}, {'id': 2964, 'name': 'future'}, {'id': 3386, 'name': 'sp...
Name: keywords, dtype: object

In [40]:

movies_df['genres'] = movies_df['genres'].apply(lambda x: [y['name'] for y in x])
movies_df['keywords'] = movies_df['keywords'].apply(lambda x : [y['name'] for y in x])
movies_df[['genres','keywords']][:1]

Out[40]:

	genres	keywords
0	[Action, Adventure, Fantasy, Science Fiction]	[culture clash, future, space war, space colony, society, space travel, futuristic, romance, spa...

장르 콘텐츠 필터링을 이용한 영화 추천. 장르 문자열을 Count 벡터화 후에 코사인 유사도로 각 영화를 비교¶

장르 문자열의 Count기반 피처 벡터화¶

In [44]:

(' ').join(['test1', 'test2']) #join함수

Out[44]:

'test1 test2'

In [47]:

from sklearn.feature_extraction.text import CountVectorizer

#CountVectorizer를 적용하기 위해 공백 문자로 word 단위가 구분되는 문자열로 변환.
movies_df['genres_literal'] = movies_df['genres'].apply(lambda x : (' ').join(x))
count_vect = CountVectorizer(min_df = 0, ngram_range=(1,2))
genre_mat = count_vect.fit_transform(movies_df['genres_literal'])
print(genre_mat.shape)

(4803, 276)

장르에 따른 영화별 코사인 유사도 추출¶

In [48]:

from sklearn.metrics.pairwise import cosine_similarity

genre_sim = cosine_similarity(genre_mat, genre_mat)
print(genre_sim.shape)
print(genre_sim[:2])

(4803, 4803)
[[1.         0.59628479 0.4472136  ... 0.         0.         0.        ]
 [0.59628479 1.         0.4        ... 0.         0.         0.        ]]

In [49]:

genre_sim_sorted_ind = genre_sim.argsort()[:,::-1]
print(genre_sim_sorted_ind[:1])

[[   0 3494  813 ... 3038 3037 2401]]

특정 영화와 장르별 유사도가 높은 영화를 반환하는 함수 생성¶

In [55]:

def find_sim_movie(df, sorted_ind, title_name, top_n = 10):
    #인자로 입력된 movies_df DataFrame에서 'title' 컬럼이 입력된 title_name값인 DataFrame추출
    title_movie = df[df['title']== title_name]
    
    #title_named을 가진 DataFrame의 index 개거체를 ndarray로 반환하고
    # sorted_ind 인자로 입력된 genre_sim_sorted_ind 객체에서 유사도 순으로 top_n 개의 index 추출
    title_index = title_movie.index.values
    similar_indexes = sorted_ind[title_index, :(top_n)]
    
    #추출된 top_n index들 출력. top_n index는 2차원 데이터임
    #dataframe에서 index로 사용하기 위해서 1차원 array로 변경
    print(similar_indexes)
    similar_indexes = similar_indexes.reshape(-1)
    
    return df.iloc[similar_indexes]

In [56]:

similar_movies = find_sim_movie(movies_df, genre_sim_sorted_ind, 'The Godfather',10)
similar_movies[['title', 'vote_average']]

[[2731 1243 3636 1946 2640 4065 1847 4217  883 3866]]

Out[56]:

	title	vote_average
2731	The Godfather: Part II	8.3
1243	Mean Streets	7.2
3636	Light Sleeper	5.7
1946	The Bad Lieutenant: Port of Call - New Orleans	6.0
2640	Things to Do in Denver When You're Dead	6.7
4065	Mi America	0.0
1847	GoodFellas	8.2
4217	Kids	6.8
883	Catch Me If You Can	7.7
3866	City of God	8.1

비슷한 컨텐츠지만 평점이 좋은 순으로 바꿔보자¶

In [57]:

movies_df[['title', 'vote_average','vote_count']].sort_values('vote_average', ascending = False)[:10]

Out[57]:

	title	vote_average	vote_count
3519	Stiff Upper Lips	10.0	1
4247	Me You and Five Bucks	10.0	2
4045	Dancer, Texas Pop. 81	10.0	1
4662	Little Big Top	10.0	1
3992	Sardaarji	9.5	2
2386	One Man's Hero	9.3	2
2970	There Goes My Baby	8.5	2
1881	The Shawshank Redemption	8.5	8205
2796	The Prisoner of Zenda	8.4	11
3337	The Godfather	8.4	5893

In [59]:

C = movies_df['vote_average'].mean()
m = movies_df['vote_count'].quantile(0.6)
print(f'C : {round(C,3)}, m: {round(m,3)}')

C : 6.092, m: 370.2

In [61]:

percentile = 0.6
m = movies_df['vote_count'].quantile(percentile)
C = movies_df['vote_average'].mean()

def weighted_vote_average(record):
    v = record['vote_count']
    R = record['vote_average']
    return ((v/(v+m))*R) + ((m/(m+v)) *C)

movies_df['weighted_vote'] = movies_df.apply(weighted_vote_average, axis = 1) #row별로 구한다.

In [62]:

movies_df[['title','vote_average', 'weighted_vote', 'vote_count']].sort_values('weighted_vote', ascending = False)[:10]

Out[62]:

	title	vote_average	weighted_vote	vote_count
1881	The Shawshank Redemption	8.5	8.396052	8205
3337	The Godfather	8.4	8.263591	5893
662	Fight Club	8.3	8.216455	9413
3232	Pulp Fiction	8.3	8.207102	8428
65	The Dark Knight	8.2	8.136930	12002
1818	Schindler's List	8.3	8.126069	4329
3865	Whiplash	8.3	8.123248	4254
809	Forrest Gump	8.2	8.105954	7927
2294	Spirited Away	8.3	8.105867	3840
2731	The Godfather: Part II	8.3	8.079586	3338

In [64]:

def find_sim_movie(df, sorted_ind, title_name, top_n=10):
    title_movie = df[df['title'] ==title_name]
    title_index = title_movie.index.values
    
    #top_n의 2배에 해당하는 장르 유사성이 높은 index 추출
    similar_indexes = sorted_ind[title_index, :(top_n*2)]
    similar_indexes = similar_indexes.reshape(-1)
    #기준영화 index는 제외
    similar_indexes = similar_indexes[similar_indexes != title_index]
    
    #top_n의 2배에 해당하는 후보군에서 weighted_vote 높은 순으로 top_n 만큼 추출
    return df.iloc[similar_indexes].sort_values('weighted_vote', ascending = False)[:top_n]

similar_movies = find_sim_movie(movies_df, genre_sim_sorted_ind, 'The Godfather', 10)
similar_movies[['title', 'vote_average', 'weighted_vote']]

Out[64]:

	title	vote_average	weighted_vote
2731	The Godfather: Part II	8.3	8.079586
1847	GoodFellas	8.2	7.976937
3866	City of God	8.1	7.759693
1663	Once Upon a Time in America	8.2	7.657811
883	Catch Me If You Can	7.7	7.557097
281	American Gangster	7.4	7.141396
4041	This Is England	7.4	6.739664
1149	American Hustle	6.8	6.717525
1243	Mean Streets	7.2	6.626569
2839	Rounders	6.9	6.530427

In [ ]:

'인공지능 > Merchine_Learning' 카테고리의 다른 글

[파이썬 머신러닝 완벽가이드] 협업필터링 (0)	2020.12.21
[파이썬 머신러닝 완벽가이드] Ch7 Clustering 내용 (0)	2020.11.23
[파이썬 머신러닝 완벽가이드] Ch4. Practice 산탄데르 고객 만족 예측 (0)	2020.11.14
[파이썬 머신러닝 완벽가이드] Ch.5 Regression (0)	2020.11.14
[파이썬 머신러닝 완벽가이드] Ch4. Classification (0)	2020.11.10

현재글[파이썬 머신러닝 완벽가이드] 추천시스템 - 콘텐츠기반

아네스의 공부일지

I'll Never Stop Study. 게임 그만하고 현실 RPG를 해라.

비비팬, 내돈내산, 롤링팬, 기본, 빗썸, 더스파, 비전공커리큘럼, 혼용사용, 서빙봇, 첫달수익, 비전공대학원생, 캐글제대로해보고싶다, 클린코드, 2월수익, 공부이력, 제노토큰, 소리끊김, 혼합사용, ARM32보드, 동시사용,

Today :
Yesterday :

아네스의 공부일지

[파이썬 머신러닝 완벽가이드] 추천시스템 - 콘텐츠기반

텍스트 문자 1차 가공, 파이썬 딕셔너리 변환 후 리스트 형태로 변환¶

장르 콘텐츠 필터링을 이용한 영화 추천. 장르 문자열을 Count 벡터화 후에 코사인 유사도로 각 영화를 비교¶

장르 문자열의 Count기반 피처 벡터화¶

장르에 따른 영화별 코사인 유사도 추출¶

특정 영화와 장르별 유사도가 높은 영화를 반환하는 함수 생성¶

비슷한 컨텐츠지만 평점이 좋은 순으로 바꿔보자¶

'인공지능 > Merchine_Learning' 카테고리의 다른 글

'인공지능/Merchine_Learning'의 다른글

티스토리툴바

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

[파이썬 머신러닝 완벽가이드] 추천시스템 - 콘텐츠기반

텍스트 문자 1차 가공, 파이썬 딕셔너리 변환 후 리스트 형태로 변환¶

장르 콘텐츠 필터링을 이용한 영화 추천. 장르 문자열을 Count 벡터화 후에 코사인 유사도로 각 영화를 비교¶

장르 문자열의 Count기반 피처 벡터화¶

장르에 따른 영화별 코사인 유사도 추출¶

특정 영화와 장르별 유사도가 높은 영화를 반환하는 함수 생성¶

비슷한 컨텐츠지만 평점이 좋은 순으로 바꿔보자¶

'인공지능 > Merchine_Learning' 카테고리의 다른 글

'인공지능/Merchine_Learning'의 다른글

관련글

티스토리툴바