반응형
코드올리는거 여전히 이상하네 왜이렇지 ㅠㅠ
코사인 유사도 ?? 유튜브에서 찾아보자.
1.코사인 유사도별로 정렬을 하니 vote_average가 낮은것들도 유사도가 높다고 나온다.
vote_average도 포함시켜서 추출해보자
2.아까보다야 좋아졌지만 한표받고 평균 10점받은 영화는 거르고 싶다.
3.유사도도 높고 투표도 잘 받은(평점 좋은) 영화를 추출해보자.
아래와 같은 가중 평점을 이용하여 새로운 칼럼을 만들어 추출할 수 있다.
위 식에서 m 값보다 v가 낮으면 개별 평점이 좋더라도 수가 굉장히 작아지고 (ex v = 10, m = 300 )
m값보다 v가 높아야만 봐줄만한 값이 된다 (ex v = 1000, m = 300 )
두번째 항에서 평균평점이 높은 사이트가 있을거고 짜게주는 사이트도 있을 것이다. 이것을 감안하게끔 하는 가중치.
In [65]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:1200px !important; }</style>"))
In [51]:
import pandas as pd
import numpy as np
import warnings; warnings.filterwarnings('ignore')
movies = pd.read_csv("./tmdb_5000_movies.csv")
print(movies.shape)
movies.head(5)
(4803, 20)
Out[51]:
budget | genres | homepage | id | keywords | original_language | original_title | overview | popularity | production_companies | production_countries | release_date | revenue | runtime | spoken_languages | status | tagline | title | vote_average | vote_count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 237000000 | [{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {... | http://www.avatarmovie.com/ | 19995 | [{"id": 1463, "name": "culture clash"}, {"id": 2964, "name": "future"}, {"id": 3386, "name": "sp... | en | Avatar | In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, ... | 150.437577 | [{"name": "Ingenious Film Partners", "id": 289}, {"name": "Twentieth Century Fox Film Corporatio... | [{"iso_3166_1": "US", "name": "United States of America"}, {"iso_3166_1": "GB", "name": "United ... | 2009-12-10 | 2787965087 | 162.0 | [{"iso_639_1": "en", "name": "English"}, {"iso_639_1": "es", "name": "Espa\u00f1ol"}] | Released | Enter the World of Pandora. | Avatar | 7.2 | 11800 |
1 | 300000000 | [{"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 28, "name": "Action"}] | http://disney.go.com/disneypictures/pirates/ | 285 | [{"id": 270, "name": "ocean"}, {"id": 726, "name": "drug abuse"}, {"id": 911, "name": "exotic is... | en | Pirates of the Caribbean: At World's End | Captain Barbossa, long believed to be dead, has come back to life and is headed to the edge of t... | 139.082615 | [{"name": "Walt Disney Pictures", "id": 2}, {"name": "Jerry Bruckheimer Films", "id": 130}, {"na... | [{"iso_3166_1": "US", "name": "United States of America"}] | 2007-05-19 | 961000000 | 169.0 | [{"iso_639_1": "en", "name": "English"}] | Released | At the end of the world, the adventure begins. | Pirates of the Caribbean: At World's End | 6.9 | 4500 |
2 | 245000000 | [{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 80, "name": "Crime"}] | http://www.sonypictures.com/movies/spectre/ | 206647 | [{"id": 470, "name": "spy"}, {"id": 818, "name": "based on novel"}, {"id": 4289, "name": "secret... | en | Spectre | A cryptic message from Bond’s past sends him on a trail to uncover a sinister organization. Whil... | 107.376788 | [{"name": "Columbia Pictures", "id": 5}, {"name": "Danjaq", "id": 10761}, {"name": "B24", "id": ... | [{"iso_3166_1": "GB", "name": "United Kingdom"}, {"iso_3166_1": "US", "name": "United States of ... | 2015-10-26 | 880674609 | 148.0 | [{"iso_639_1": "fr", "name": "Fran\u00e7ais"}, {"iso_639_1": "en", "name": "English"}, {"iso_639... | Released | A Plan No One Escapes | Spectre | 6.3 | 4466 |
3 | 250000000 | [{"id": 28, "name": "Action"}, {"id": 80, "name": "Crime"}, {"id": 18, "name": "Drama"}, {"id": ... | http://www.thedarkknightrises.com/ | 49026 | [{"id": 849, "name": "dc comics"}, {"id": 853, "name": "crime fighter"}, {"id": 949, "name": "te... | en | The Dark Knight Rises | Following the death of District Attorney Harvey Dent, Batman assumes responsibility for Dent's c... | 112.312950 | [{"name": "Legendary Pictures", "id": 923}, {"name": "Warner Bros.", "id": 6194}, {"name": "DC E... | [{"iso_3166_1": "US", "name": "United States of America"}] | 2012-07-16 | 1084939099 | 165.0 | [{"iso_639_1": "en", "name": "English"}] | Released | The Legend Ends | The Dark Knight Rises | 7.6 | 9106 |
4 | 260000000 | [{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 878, "name": "Science Fic... | http://movies.disney.com/john-carter | 49529 | [{"id": 818, "name": "based on novel"}, {"id": 839, "name": "mars"}, {"id": 1456, "name": "medal... | en | John Carter | John Carter is a war-weary, former military captain who's inexplicably transported to the myster... | 43.926995 | [{"name": "Walt Disney Pictures", "id": 2}] | [{"iso_3166_1": "US", "name": "United States of America"}] | 2012-03-07 | 284139100 | 132.0 | [{"iso_639_1": "en", "name": "English"}] | Released | Lost in our world, found in another. | John Carter | 6.1 | 2124 |
In [35]:
movies_df = movies[['id','title', 'genres', 'vote_average', 'vote_count',
'popularity', 'keywords', 'overview']]
In [36]:
pd.set_option('max_colwidth',100)
movies_df[['genres','keywords']][:1]
Out[36]:
genres | keywords | |
---|---|---|
0 | [{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {... | [{"id": 1463, "name": "culture clash"}, {"id": 2964, "name": "future"}, {"id": 3386, "name": "sp... |
In [37]:
movies_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 8 columns):
id 4803 non-null int64
title 4803 non-null object
genres 4803 non-null object
vote_average 4803 non-null float64
vote_count 4803 non-null int64
popularity 4803 non-null float64
keywords 4803 non-null object
overview 4800 non-null object
dtypes: float64(2), int64(2), object(4)
memory usage: 300.3+ KB
텍스트 문자 1차 가공, 파이썬 딕셔너리 변환 후 리스트 형태로 변환¶
In [38]:
from ast import literal_eval
#문자열을 가공해서 파이썬에서 사용하는 객체로 만들어줌
movies_df['genres'] = movies_df['genres'].apply(literal_eval)
movies_df['keywords'] = movies_df['keywords'].apply(literal_eval)
In [39]:
movies_df['keywords'].head(1)
Out[39]:
0 [{'id': 1463, 'name': 'culture clash'}, {'id': 2964, 'name': 'future'}, {'id': 3386, 'name': 'sp...
Name: keywords, dtype: object
In [40]:
movies_df['genres'] = movies_df['genres'].apply(lambda x: [y['name'] for y in x])
movies_df['keywords'] = movies_df['keywords'].apply(lambda x : [y['name'] for y in x])
movies_df[['genres','keywords']][:1]
Out[40]:
genres | keywords | |
---|---|---|
0 | [Action, Adventure, Fantasy, Science Fiction] | [culture clash, future, space war, space colony, society, space travel, futuristic, romance, spa... |
In [44]:
(' ').join(['test1', 'test2']) #join함수
Out[44]:
'test1 test2'
In [47]:
from sklearn.feature_extraction.text import CountVectorizer
#CountVectorizer를 적용하기 위해 공백 문자로 word 단위가 구분되는 문자열로 변환.
movies_df['genres_literal'] = movies_df['genres'].apply(lambda x : (' ').join(x))
count_vect = CountVectorizer(min_df = 0, ngram_range=(1,2))
genre_mat = count_vect.fit_transform(movies_df['genres_literal'])
print(genre_mat.shape)
(4803, 276)
장르에 따른 영화별 코사인 유사도 추출¶
In [48]:
from sklearn.metrics.pairwise import cosine_similarity
genre_sim = cosine_similarity(genre_mat, genre_mat)
print(genre_sim.shape)
print(genre_sim[:2])
(4803, 4803)
[[1. 0.59628479 0.4472136 ... 0. 0. 0. ]
[0.59628479 1. 0.4 ... 0. 0. 0. ]]
In [49]:
genre_sim_sorted_ind = genre_sim.argsort()[:,::-1]
print(genre_sim_sorted_ind[:1])
[[ 0 3494 813 ... 3038 3037 2401]]
특정 영화와 장르별 유사도가 높은 영화를 반환하는 함수 생성¶
In [55]:
def find_sim_movie(df, sorted_ind, title_name, top_n = 10):
#인자로 입력된 movies_df DataFrame에서 'title' 컬럼이 입력된 title_name값인 DataFrame추출
title_movie = df[df['title']== title_name]
#title_named을 가진 DataFrame의 index 개거체를 ndarray로 반환하고
# sorted_ind 인자로 입력된 genre_sim_sorted_ind 객체에서 유사도 순으로 top_n 개의 index 추출
title_index = title_movie.index.values
similar_indexes = sorted_ind[title_index, :(top_n)]
#추출된 top_n index들 출력. top_n index는 2차원 데이터임
#dataframe에서 index로 사용하기 위해서 1차원 array로 변경
print(similar_indexes)
similar_indexes = similar_indexes.reshape(-1)
return df.iloc[similar_indexes]
In [56]:
similar_movies = find_sim_movie(movies_df, genre_sim_sorted_ind, 'The Godfather',10)
similar_movies[['title', 'vote_average']]
[[2731 1243 3636 1946 2640 4065 1847 4217 883 3866]]
Out[56]:
title | vote_average | |
---|---|---|
2731 | The Godfather: Part II | 8.3 |
1243 | Mean Streets | 7.2 |
3636 | Light Sleeper | 5.7 |
1946 | The Bad Lieutenant: Port of Call - New Orleans | 6.0 |
2640 | Things to Do in Denver When You're Dead | 6.7 |
4065 | Mi America | 0.0 |
1847 | GoodFellas | 8.2 |
4217 | Kids | 6.8 |
883 | Catch Me If You Can | 7.7 |
3866 | City of God | 8.1 |
비슷한 컨텐츠지만 평점이 좋은 순으로 바꿔보자¶
In [57]:
movies_df[['title', 'vote_average','vote_count']].sort_values('vote_average', ascending = False)[:10]
Out[57]:
title | vote_average | vote_count | |
---|---|---|---|
3519 | Stiff Upper Lips | 10.0 | 1 |
4247 | Me You and Five Bucks | 10.0 | 2 |
4045 | Dancer, Texas Pop. 81 | 10.0 | 1 |
4662 | Little Big Top | 10.0 | 1 |
3992 | Sardaarji | 9.5 | 2 |
2386 | One Man's Hero | 9.3 | 2 |
2970 | There Goes My Baby | 8.5 | 2 |
1881 | The Shawshank Redemption | 8.5 | 8205 |
2796 | The Prisoner of Zenda | 8.4 | 11 |
3337 | The Godfather | 8.4 | 5893 |
In [59]:
C = movies_df['vote_average'].mean()
m = movies_df['vote_count'].quantile(0.6)
print(f'C : {round(C,3)}, m: {round(m,3)}')
C : 6.092, m: 370.2
In [61]:
percentile = 0.6
m = movies_df['vote_count'].quantile(percentile)
C = movies_df['vote_average'].mean()
def weighted_vote_average(record):
v = record['vote_count']
R = record['vote_average']
return ((v/(v+m))*R) + ((m/(m+v)) *C)
movies_df['weighted_vote'] = movies_df.apply(weighted_vote_average, axis = 1) #row별로 구한다.
In [62]:
movies_df[['title','vote_average', 'weighted_vote', 'vote_count']].sort_values('weighted_vote', ascending = False)[:10]
Out[62]:
title | vote_average | weighted_vote | vote_count | |
---|---|---|---|---|
1881 | The Shawshank Redemption | 8.5 | 8.396052 | 8205 |
3337 | The Godfather | 8.4 | 8.263591 | 5893 |
662 | Fight Club | 8.3 | 8.216455 | 9413 |
3232 | Pulp Fiction | 8.3 | 8.207102 | 8428 |
65 | The Dark Knight | 8.2 | 8.136930 | 12002 |
1818 | Schindler's List | 8.3 | 8.126069 | 4329 |
3865 | Whiplash | 8.3 | 8.123248 | 4254 |
809 | Forrest Gump | 8.2 | 8.105954 | 7927 |
2294 | Spirited Away | 8.3 | 8.105867 | 3840 |
2731 | The Godfather: Part II | 8.3 | 8.079586 | 3338 |
In [64]:
def find_sim_movie(df, sorted_ind, title_name, top_n=10):
title_movie = df[df['title'] ==title_name]
title_index = title_movie.index.values
#top_n의 2배에 해당하는 장르 유사성이 높은 index 추출
similar_indexes = sorted_ind[title_index, :(top_n*2)]
similar_indexes = similar_indexes.reshape(-1)
#기준영화 index는 제외
similar_indexes = similar_indexes[similar_indexes != title_index]
#top_n의 2배에 해당하는 후보군에서 weighted_vote 높은 순으로 top_n 만큼 추출
return df.iloc[similar_indexes].sort_values('weighted_vote', ascending = False)[:top_n]
similar_movies = find_sim_movie(movies_df, genre_sim_sorted_ind, 'The Godfather', 10)
similar_movies[['title', 'vote_average', 'weighted_vote']]
Out[64]:
title | vote_average | weighted_vote | |
---|---|---|---|
2731 | The Godfather: Part II | 8.3 | 8.079586 |
1847 | GoodFellas | 8.2 | 7.976937 |
3866 | City of God | 8.1 | 7.759693 |
1663 | Once Upon a Time in America | 8.2 | 7.657811 |
883 | Catch Me If You Can | 7.7 | 7.557097 |
281 | American Gangster | 7.4 | 7.141396 |
4041 | This Is England | 7.4 | 6.739664 |
1149 | American Hustle | 6.8 | 6.717525 |
1243 | Mean Streets | 7.2 | 6.626569 |
2839 | Rounders | 6.9 | 6.530427 |
In [ ]:
반응형
'인공지능 > Merchine_Learning' 카테고리의 다른 글
[파이썬 머신러닝 완벽가이드] 협업필터링 (0) | 2020.12.21 |
---|---|
[파이썬 머신러닝 완벽가이드] Ch7 Clustering 내용 (0) | 2020.11.23 |
[파이썬 머신러닝 완벽가이드] Ch4. Practice 산탄데르 고객 만족 예측 (0) | 2020.11.14 |
[파이썬 머신러닝 완벽가이드] Ch.5 Regression (0) | 2020.11.14 |
[파이썬 머신러닝 완벽가이드] Ch4. Classification (0) | 2020.11.10 |