Its easy to read csv files in python into a dataframe. Problem i am encountering is, i have a file with many columns, in the program, i don't need all the columns at once, i want to process, specific set of columns for one set of logic and use other set of columns somewhere in the program, similarly i had to write many programs. So, when i looked into the code for review, i saw too many pd.read_csv()
1filename = 'movies_metadata.csv'2all_cols = ['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',3 'imdb_id', 'original_language', 'original_title', 'overview',4 'popularity', 'poster_path', 'production_companies',5 'production_countries', 'release_date', 'revenue', 'runtime',6 'spoken_languages', 'status', 'tagline', 'title', 'video',7 'vote_average', 'vote_count']8
9use_cols = ['imdb_id', 'original_language', 'original_title', 'overview']10
11osfp_file = os.path.join(os.getcwd(), '..', 'data', 'mc_files', filename)12df = pd.read_csv(osfp_file, sep='|', names=all_cols, usecols=use_cols ,skip_blank_lines=True, dtype=object)
Somewhere else,
1filename = 'movies_metadata.csv'2all_cols = ['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',3 'imdb_id', 'original_language', 'original_title', 'overview',4 'popularity', 'poster_path', 'production_companies',5 'production_countries', 'release_date', 'revenue', 'runtime',6 'spoken_languages', 'status', 'tagline', 'title', 'video',7 'vote_average', 'vote_count']8
9use_cols = ['imdb_id', 'tagline', 'title', 'popularity', 'vote_count', 'vote_average']10
11osfp_file = os.path.join(os.getcwd(), '..', 'data', 'mc_files', filename)12df = pd.read_csv(osfp_file, sep='|', names=all_cols, usecols=use_cols ,skip_blank_lines=True, dtype=object)
Like this i have to read many csv files. So, if any input files change, i have to change in many places. So, to cut all these multiple changes, i have created a module(single python file) and added a UDF to read CSV file and pass use_cols as argument.
1# Table : movie_meta2def read_movie_meta(use_cols): 3 filename = 'movie_metadata.csv'4 print('Reading {}'.format(filename))5 all_cols = ['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',6 'imdb_id', 'original_language', 'original_title', 'overview',7 'popularity', 'poster_path', 'production_companies',8 'production_countries', 'release_date', 'revenue', 'runtime',9 'spoken_languages', 'status', 'tagline', 'title', 'video',10 'vote_average', 'vote_count']11
12 #use_cols = ['imdb_id', 'original_language', 'original_title', 'overview']13
14 osfp_file = os.path.join(os.getcwd(), '..', 'data', 'mc_files', filename)15 df = pd.read_csv(osfp_file, sep='|', names=all_cols, usecols=use_cols ,skip_blank_lines=True, dtype=object)16 return df17
18# Get data from movie_meta19use_cols = ['imdb_id', 'original_language', 'original_title', 'overview']20df_mm = read_movie_meta(use_cols)21print('{} = {}'.format('df_mm', df_mm.shape))
Now, my code is small and any change to input file, i have to change in only one place.
This might look basic, but helped a lot.