Python tips

06.08.2022 — python, pandas — 1 min read

Last updated : 06/Aug/2022

Events are in reverse chronological order.

2022/08/06 - Find and delete specific keys from dict
2022/08/06 - Updating dataframe with another dataframe
2022/08/02 - Find null on specific columns
2022/07/27 - Convert specific column types
2021/10/01 - Convert dictionaries into values of variable of same name as key
2021/09/28 - Getting time differences for comparisons
2021/09/27 - To know installed python versions
2021/09/27 - Virtual Environment
2020/08/08 - Drop duplicates, but ignore nulls
2020/07/06 - Convert two arrays to dictionary
2020/07/12 - View specific row by index
2020/07/20 - Read all columns into pandas as object
2020/06/03 - Common pandas options
2020/06/06 - Display all columns of specific datatype
2020/04/11 - Saving memory when reading CSV file in pandas

Find and delete specific keys from dict

Lets assume if nc dict has keys like cur_7, cur_6, cur_5, cur_4, cur_3, cur_2, cur_1, cur and we want to keep keys cur and delete all keys greater than 4.

1nc = ['cur_7', 'cur_6', 'cur_5', 'cur_4', 'cur_3', 'cur_2', 'cur_1', 'cur']
2temp = []
3for k, v in nc.items():
4    print(k, v)
5    if(v=='cur'):
6        continue
7    num = int(v[4:])
8    print(num)
9    if(num > 4):
10        temp.append(k)
11        
12print(temp)        
13[nc.pop(k, None) for k in temp]
14print(nc)

Updating dataframe with another dataframe

1# Column names should be same be df1 and df2 for it to be updated
2df1.set_index(['sc_code'], inplace=True)
3df1.update(df2.set_index(['sc_code']))
4df1.reset_index(inplace=True)

Find null on specific columns

1-- Find rows which has null or NaN in any column
2df[df.isnull().any(axis=1)]
3
4
5cols = ['cur_4', 'cur_3', 'cur_2', 'cur_1', 'cur']
6-- Find null on specific columns
7df_pivot_avg = df_pivot_avg[df_pivot_avg[cols].isnull().all(axis=1)].copy()
8
9-- Find nonnull on specific columns
10df_pivot_avg = df_pivot_avg[df_pivot_avg[cols].notnull().all(axis=1)].copy()

Convert specific column types

1s = df.select_dtypes(include='object').columns
2df[s] = df[s].astype("float")

Convert dictionaries into values of variable of same name as key

1# Contents of input file walkPath.txt
2bucket=bucket-name
3pattern=*
4root=./tmp
5
6# Py Code
7if Path('walkPath.txt').is_file(): # file exists
8    temp = [line.rstrip() for line in open('walkPath.txt')]
9    for i in temp:
10        key, val = i.split('=')
11        exec(key + '=val')
12        print(key, val)
13
14# If it was a actual JSON dictionary
15d = {'bucket':'bucket-name', 'root':'./tmp'}
16for key,val in d.items():
17    exec(key + '=val')

Getting time differences for comparisons

1import datetime
2import time
3
4t1 = datetime.datetime.now()
5print("Local Time: ",datetime.datetime.now())
6time.sleep(5)
7t2 = datetime.datetime.now()
8print("Local Time: ",t2)
9print("Time taken: ",t2-t1)
10
11print((t2 - t1).total_seconds())
12minutes_diff = (t2 - t1).total_seconds() / 60.0
13print(minutes_diff)
14
15# Output
16Local Time:  2021-09-29 08:04:43.191698
17Local Time:  2021-09-29 08:04:48.195868
18Time taken:  0:00:05.004170
195.00417
200.08340283333333334

To know installed python versions

1(env) D:\BigData\12. Python\1. Tryouts\Beam>py -0p
2Installed Pythons found by py Launcher for Windows
3 (venv)         D:\BigData\12. Python\1. Tryouts\Beam\env\Scripts\python.exe *
4 -3.9-64        python3.9.exe
5 -3.7-64        C:\Users\Sushanth\Anaconda3\python.exe
6 -3.5-32        C:\Users\Sushanth\AppData\Local\Programs\Python\Python35-32\python.exe

Virtual Environments

1cd your-project
2python3 -m venv env
3
4.\env\Scripts\activate
5
6# Install any packages
7pip3 install google-cloud-storage
8pip3 install --upgrade google-cloud-storage
9
10pip3 install wheel
11pip3 install apache-beam[gcp]
12
13# Listing installed packages
14pip3 list
15pip3 freeze --all > requirements.txt
16
17# When done
18deactivate
19
20# Later in a new environment can use below command to install packages in requirement.txt
21pip3 install -r requirements.txt

Drop duplicates, but ignore nulls

1# Creating a sample dataframe
2df = pd.DataFrame()
3names = ['Gandolf', 'Gandolf','Gimli','Frodo','Legolas','Bilbo', 'Aragorn', 'Arwen']
4types = ['Wizard', 'Wizard','Dwarf','Hobbit','Elf','Hobbit', np.nan, np.nan]
5magic = [0, 10, 1, 4, 6, 4, np.nan, np.nan]
6aggression = [0, 7, 10, 2, 5, 1, np.nan, np.nan]
7stealth = [0, 8, 2, 5, 10, 5, np.nan, np.nan]
8df['names'], df['type'], df['magic_power'] = names, types, magic
9df['aggression'], df['stealth'] = aggression, stealth
10
11df
12+----+---------+--------+---------------+--------------+-----------+
13|    | names   | type   |   magic_power |   aggression |   stealth |
14|----+---------+--------+---------------+--------------+-----------|
15|  0 | Gandolf | Wizard |             0 |            0 |         0 |
16|  1 | Gandolf | Wizard |            10 |            7 |         8 |
17|  2 | Gimli   | Dwarf  |             1 |           10 |         2 |
18|  3 | Frodo   | Hobbit |             4 |            2 |         5 |
19|  4 | Legolas | Elf    |             6 |            5 |        10 |
20|  5 | Bilbo   | Hobbit |             4 |            1 |         5 |
21|  6 | Aragorn | nan    |           nan |          nan |       nan |
22|  7 | Arwen   | nan    |           nan |          nan |       nan |
23+----+---------+--------+---------------+--------------+-----------+

Here we have two sets of duplicates and as per the command we are keeping the last and eliminating the first.

1df1 = df[(~df['type'].duplicated(keep='last'))]
2+----+---------+--------+---------------+--------------+-----------+
3|    | names   | type   |   magic_power |   aggression |   stealth |
4|----+---------+--------+---------------+--------------+-----------|
5|  1 | Gandolf | Wizard |            10 |            7 |         8 |
6|  2 | Gimli   | Dwarf  |             1 |           10 |         2 |
7|  4 | Legolas | Elf    |             6 |            5 |        10 |
8|  5 | Bilbo   | Hobbit |             4 |            1 |         5 |
9|  7 | Arwen   | nan    |           nan |          nan |       nan |
10+----+---------+--------+---------------+--------------+-----------+

General idea is NaNs are suppose to be incomparable as it basically means 'no value'. So, comparing no values and keeping the last does not make sense. So, we do this and here we are maintaining the uniqueness and keeping nulls

1df1 = df[(~df['type'].duplicated(keep='last')) | df['type'].isna()]
2+----+---------+--------+---------------+--------------+-----------+
3|    | names   | type   |   magic_power |   aggression |   stealth |
4|----+---------+--------+---------------+--------------+-----------|
5|  1 | Gandolf | Wizard |            10 |            7 |         8 |
6|  2 | Gimli   | Dwarf  |             1 |           10 |         2 |
7|  4 | Legolas | Elf    |             6 |            5 |        10 |
8|  5 | Bilbo   | Hobbit |             4 |            1 |         5 |
9|  6 | Aragorn | nan    |           nan |          nan |       nan |
10|  7 | Arwen   | nan    |           nan |          nan |       nan |
11+----+---------+--------+---------------+--------------+-----------+

Convert two arrays to dictionary

1keys = ['a', 'b', 'c']
2values = [1, 2, 3]
3dictionary = dict(zip(keys, values))
4print(dictionary)
5
6# output
7{'b': 2, 'a': 1, 'c': 3}

View specific row by index

1df_joined.loc[[28]]
2df_joined.iloc[[27]]

Read all columns into pandas as object

1df = pd.read_csv(osfp_file, sep='|', names=all_cols, usecols=use_cols ,skip_blank_lines=True, dtype=object)

Common pandas options

1pd.set_option('display.max_columns', None)
2pd.set_option('display.max_colwidth', 200)
3
4# To display all rows
5pd.set_option('display.max_rows', None) 
6
7# Setting limit to display max 100 rows
8pd.set_option('display.max_rows', 100)
9
10# When there are lots of columns (by default 100+) then df.info() wouldn't show all the columns, 
11<class 'pandas.core.frame.DataFrame'>
12RangeIndex: 3984 entries, 0 to 3983
13Columns: 114 entries, name to process
14dtypes: datetime64[ns](10), float64(68), int64(1), object(35)
15memory usage: 2.9+ MB
16
17# When that happens, you can use, 
18df.info(verbose=True, null_counts=True)
19
20# What option sets that default 100 is this
21pd.options.display.max_info_columns 
22
23print('Max columns in display setting: ', pd.get_option('max_columns'))

Display all columns of specific datatype

If you have a column naming convention, like in column name itself we can identify datatype, you can use below method to get those columns. Here we are trying to get date columns,

Method 1

1[col for col in df_companies.columns if 'date' in col.lower()]

Method 2

1df_companies['lastraded'].dtypes
2# Output:
3dtype('<M8[ns]')
4
5df_companies.dtypes[df_companies.dtypes=='<M8[ns]']
6
7# Output:
8lastraded       datetime64[ns]
9QresultDate     datetime64[ns]
10PLResultDate    datetime64[ns]
11BSResultDate    datetime64[ns]
12RResultDate     datetime64[ns]
13CFResultDate    datetime64[ns]
14YRResultDate    datetime64[ns]
15HiDate          datetime64[ns]
16LoDate          datetime64[ns]
17TVDate          datetime64[ns]
18dtype: object

Method 3

1dtCols = list(df_companies.select_dtypes(include=['datetime']).columns)