Process to read file into dataframe

06.07.2020 — python, pandas — 1 min read

Its easy to read csv files in python into a dataframe. Problem i am encountering is, i have a file with many columns, in the program, i don't need all the columns at once, i want to process, specific set of columns for one set of logic and use other set of columns somewhere in the program, similarly i had to write many programs. So, when i looked into the code for review, i saw too many pd.read_csv()

1filename = 'movies_metadata.csv'
2all_cols = ['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
3       'imdb_id', 'original_language', 'original_title', 'overview',
4       'popularity', 'poster_path', 'production_companies',
5       'production_countries', 'release_date', 'revenue', 'runtime',
6       'spoken_languages', 'status', 'tagline', 'title', 'video',
7       'vote_average', 'vote_count']
8
9use_cols = ['imdb_id', 'original_language', 'original_title', 'overview']
10
11osfp_file  = os.path.join(os.getcwd(), '..', 'data', 'mc_files', filename)
12df = pd.read_csv(osfp_file, sep='|', names=all_cols, usecols=use_cols ,skip_blank_lines=True, dtype=object)

Somewhere else,

1filename = 'movies_metadata.csv'
2all_cols = ['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
3       'imdb_id', 'original_language', 'original_title', 'overview',
4       'popularity', 'poster_path', 'production_companies',
5       'production_countries', 'release_date', 'revenue', 'runtime',
6       'spoken_languages', 'status', 'tagline', 'title', 'video',
7       'vote_average', 'vote_count']
8
9use_cols = ['imdb_id', 'tagline', 'title', 'popularity', 'vote_count', 'vote_average']
10
11osfp_file  = os.path.join(os.getcwd(), '..', 'data', 'mc_files', filename)
12df = pd.read_csv(osfp_file, sep='|', names=all_cols, usecols=use_cols ,skip_blank_lines=True, dtype=object)

Like this i have to read many csv files. So, if any input files change, i have to change in many places. So, to cut all these multiple changes, i have created a module(single python file) and added a UDF to read CSV file and pass use_cols as argument.

1# Table : movie_meta
2def read_movie_meta(use_cols):    
3    filename = 'movie_metadata.csv'
4    print('Reading {}'.format(filename))
5    all_cols = ['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
6          'imdb_id', 'original_language', 'original_title', 'overview',
7          'popularity', 'poster_path', 'production_companies',
8          'production_countries', 'release_date', 'revenue', 'runtime',
9          'spoken_languages', 'status', 'tagline', 'title', 'video',
10          'vote_average', 'vote_count']
11
12    #use_cols = ['imdb_id', 'original_language', 'original_title', 'overview']
13
14    osfp_file  = os.path.join(os.getcwd(), '..', 'data', 'mc_files', filename)
15    df = pd.read_csv(osfp_file, sep='|', names=all_cols, usecols=use_cols ,skip_blank_lines=True, dtype=object)
16    return df
17
18# Get data from movie_meta
19use_cols = ['imdb_id', 'original_language', 'original_title', 'overview']
20df_mm = read_movie_meta(use_cols)
21print('{} = {}'.format('df_mm', df_mm.shape))

Now, my code is small and any change to input file, i have to change in only one place.

This might look basic, but helped a lot.