Skip to content
bobby_dreamer

Process to read file into dataframe

python, pandas1 min read

Its easy to read csv files in python into a dataframe. Problem i am encountering is, i have a file with many columns, in the program, i don't need all the columns at once, i want to process, specific set of columns for one set of logic and use other set of columns somewhere in the program, similarly i had to write many programs. So, when i looked into the code for review, i saw too many pd.read_csv()

1filename = 'movies_metadata.csv'
2all_cols = ['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
3 'imdb_id', 'original_language', 'original_title', 'overview',
4 'popularity', 'poster_path', 'production_companies',
5 'production_countries', 'release_date', 'revenue', 'runtime',
6 'spoken_languages', 'status', 'tagline', 'title', 'video',
7 'vote_average', 'vote_count']
8
9use_cols = ['imdb_id', 'original_language', 'original_title', 'overview']
10
11osfp_file = os.path.join(os.getcwd(), '..', 'data', 'mc_files', filename)
12df = pd.read_csv(osfp_file, sep='|', names=all_cols, usecols=use_cols ,skip_blank_lines=True, dtype=object)

Somewhere else,

1filename = 'movies_metadata.csv'
2all_cols = ['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
3 'imdb_id', 'original_language', 'original_title', 'overview',
4 'popularity', 'poster_path', 'production_companies',
5 'production_countries', 'release_date', 'revenue', 'runtime',
6 'spoken_languages', 'status', 'tagline', 'title', 'video',
7 'vote_average', 'vote_count']
8
9use_cols = ['imdb_id', 'tagline', 'title', 'popularity', 'vote_count', 'vote_average']
10
11osfp_file = os.path.join(os.getcwd(), '..', 'data', 'mc_files', filename)
12df = pd.read_csv(osfp_file, sep='|', names=all_cols, usecols=use_cols ,skip_blank_lines=True, dtype=object)

Like this i have to read many csv files. So, if any input files change, i have to change in many places. So, to cut all these multiple changes, i have created a module(single python file) and added a UDF to read CSV file and pass use_cols as argument.

1# Table : movie_meta
2def read_movie_meta(use_cols):
3 filename = 'movie_metadata.csv'
4 print('Reading {}'.format(filename))
5 all_cols = ['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
6 'imdb_id', 'original_language', 'original_title', 'overview',
7 'popularity', 'poster_path', 'production_companies',
8 'production_countries', 'release_date', 'revenue', 'runtime',
9 'spoken_languages', 'status', 'tagline', 'title', 'video',
10 'vote_average', 'vote_count']
11
12 #use_cols = ['imdb_id', 'original_language', 'original_title', 'overview']
13
14 osfp_file = os.path.join(os.getcwd(), '..', 'data', 'mc_files', filename)
15 df = pd.read_csv(osfp_file, sep='|', names=all_cols, usecols=use_cols ,skip_blank_lines=True, dtype=object)
16 return df
17
18# Get data from movie_meta
19use_cols = ['imdb_id', 'original_language', 'original_title', 'overview']
20df_mm = read_movie_meta(use_cols)
21print('{} = {}'.format('df_mm', df_mm.shape))

Now, my code is small and any change to input file, i have to change in only one place.

This might look basic, but helped a lot.