Last updated : 06/Aug/2022
Events are in reverse chronological order.
Lets assume if nc dict has keys like cur_7, cur_6, cur_5, cur_4, cur_3, cur_2, cur_1, cur
and we want to keep keys cur
and delete all keys greater than 4.
1nc = ['cur_7', 'cur_6', 'cur_5', 'cur_4', 'cur_3', 'cur_2', 'cur_1', 'cur']2temp = []3for k, v in nc.items():4 print(k, v)5 if(v=='cur'):6 continue7 num = int(v[4:])8 print(num)9 if(num > 4):10 temp.append(k)11 12print(temp) 13[nc.pop(k, None) for k in temp]14print(nc)
1# Column names should be same be df1 and df2 for it to be updated2df1.set_index(['sc_code'], inplace=True)3df1.update(df2.set_index(['sc_code']))4df1.reset_index(inplace=True)
1-- Find rows which has null or NaN in any column2df[df.isnull().any(axis=1)]3
4
5cols = ['cur_4', 'cur_3', 'cur_2', 'cur_1', 'cur']6-- Find null on specific columns7df_pivot_avg = df_pivot_avg[df_pivot_avg[cols].isnull().all(axis=1)].copy()8
9-- Find nonnull on specific columns10df_pivot_avg = df_pivot_avg[df_pivot_avg[cols].notnull().all(axis=1)].copy()
1s = df.select_dtypes(include='object').columns2df[s] = df[s].astype("float")
1# Contents of input file walkPath.txt2bucket=bucket-name3pattern=*4root=./tmp5
6# Py Code7if Path('walkPath.txt').is_file(): # file exists8 temp = [line.rstrip() for line in open('walkPath.txt')]9 for i in temp:10 key, val = i.split('=')11 exec(key + '=val')12 print(key, val)13
14# If it was a actual JSON dictionary15d = {'bucket':'bucket-name', 'root':'./tmp'}16for key,val in d.items():17 exec(key + '=val')
1import datetime2import time3
4t1 = datetime.datetime.now()5print("Local Time: ",datetime.datetime.now())6time.sleep(5)7t2 = datetime.datetime.now()8print("Local Time: ",t2)9print("Time taken: ",t2-t1)10
11print((t2 - t1).total_seconds())12minutes_diff = (t2 - t1).total_seconds() / 60.013print(minutes_diff)14
15# Output16Local Time: 2021-09-29 08:04:43.19169817Local Time: 2021-09-29 08:04:48.19586818Time taken: 0:00:05.004170195.00417200.08340283333333334
1(env) D:\BigData\12. Python\1. Tryouts\Beam>py -0p2Installed Pythons found by py Launcher for Windows3 (venv) D:\BigData\12. Python\1. Tryouts\Beam\env\Scripts\python.exe *4 -3.9-64 python3.9.exe5 -3.7-64 C:\Users\Sushanth\Anaconda3\python.exe6 -3.5-32 C:\Users\Sushanth\AppData\Local\Programs\Python\Python35-32\python.exe
1cd your-project2python3 -m venv env3
4.\env\Scripts\activate5
6# Install any packages7pip3 install google-cloud-storage8pip3 install --upgrade google-cloud-storage9
10pip3 install wheel11pip3 install apache-beam[gcp]12
13# Listing installed packages14pip3 list15pip3 freeze --all > requirements.txt16
17# When done18deactivate19
20# Later in a new environment can use below command to install packages in requirement.txt21pip3 install -r requirements.txt
1# Creating a sample dataframe2df = pd.DataFrame()3names = ['Gandolf', 'Gandolf','Gimli','Frodo','Legolas','Bilbo', 'Aragorn', 'Arwen']4types = ['Wizard', 'Wizard','Dwarf','Hobbit','Elf','Hobbit', np.nan, np.nan]5magic = [0, 10, 1, 4, 6, 4, np.nan, np.nan]6aggression = [0, 7, 10, 2, 5, 1, np.nan, np.nan]7stealth = [0, 8, 2, 5, 10, 5, np.nan, np.nan]8df['names'], df['type'], df['magic_power'] = names, types, magic9df['aggression'], df['stealth'] = aggression, stealth10
11df12+----+---------+--------+---------------+--------------+-----------+13| | names | type | magic_power | aggression | stealth |14|----+---------+--------+---------------+--------------+-----------|15| 0 | Gandolf | Wizard | 0 | 0 | 0 |16| 1 | Gandolf | Wizard | 10 | 7 | 8 |17| 2 | Gimli | Dwarf | 1 | 10 | 2 |18| 3 | Frodo | Hobbit | 4 | 2 | 5 |19| 4 | Legolas | Elf | 6 | 5 | 10 |20| 5 | Bilbo | Hobbit | 4 | 1 | 5 |21| 6 | Aragorn | nan | nan | nan | nan |22| 7 | Arwen | nan | nan | nan | nan |23+----+---------+--------+---------------+--------------+-----------+
Here we have two sets of duplicates and as per the command we are keeping the last and eliminating the first.
1df1 = df[(~df['type'].duplicated(keep='last'))]2+----+---------+--------+---------------+--------------+-----------+3| | names | type | magic_power | aggression | stealth |4|----+---------+--------+---------------+--------------+-----------|5| 1 | Gandolf | Wizard | 10 | 7 | 8 |6| 2 | Gimli | Dwarf | 1 | 10 | 2 |7| 4 | Legolas | Elf | 6 | 5 | 10 |8| 5 | Bilbo | Hobbit | 4 | 1 | 5 |9| 7 | Arwen | nan | nan | nan | nan |10+----+---------+--------+---------------+--------------+-----------+
General idea is NaNs are suppose to be incomparable as it basically means 'no value'. So, comparing no values and keeping the last does not make sense. So, we do this and here we are maintaining the uniqueness and keeping nulls
1df1 = df[(~df['type'].duplicated(keep='last')) | df['type'].isna()]2+----+---------+--------+---------------+--------------+-----------+3| | names | type | magic_power | aggression | stealth |4|----+---------+--------+---------------+--------------+-----------|5| 1 | Gandolf | Wizard | 10 | 7 | 8 |6| 2 | Gimli | Dwarf | 1 | 10 | 2 |7| 4 | Legolas | Elf | 6 | 5 | 10 |8| 5 | Bilbo | Hobbit | 4 | 1 | 5 |9| 6 | Aragorn | nan | nan | nan | nan |10| 7 | Arwen | nan | nan | nan | nan |11+----+---------+--------+---------------+--------------+-----------+
1keys = ['a', 'b', 'c']2values = [1, 2, 3]3dictionary = dict(zip(keys, values))4print(dictionary)5
6# output7{'b': 2, 'a': 1, 'c': 3}
1df_joined.loc[[28]]2df_joined.iloc[[27]]
1df = pd.read_csv(osfp_file, sep='|', names=all_cols, usecols=use_cols ,skip_blank_lines=True, dtype=object)
1pd.set_option('display.max_columns', None)2pd.set_option('display.max_colwidth', 200)3
4# To display all rows5pd.set_option('display.max_rows', None) 6
7# Setting limit to display max 100 rows8pd.set_option('display.max_rows', 100)9
10# When there are lots of columns (by default 100+) then df.info() wouldn't show all the columns, 11<class 'pandas.core.frame.DataFrame'>12RangeIndex: 3984 entries, 0 to 398313Columns: 114 entries, name to process14dtypes: datetime64[ns](10), float64(68), int64(1), object(35)15memory usage: 2.9+ MB16
17# When that happens, you can use, 18df.info(verbose=True, null_counts=True)19
20# What option sets that default 100 is this21pd.options.display.max_info_columns 22
23print('Max columns in display setting: ', pd.get_option('max_columns'))
If you have a column naming convention, like in column name itself we can identify datatype, you can use below method to get those columns. Here we are trying to get date columns,
Method 1
1[col for col in df_companies.columns if 'date' in col.lower()]
Method 2
1df_companies['lastraded'].dtypes2# Output:3dtype('<M8[ns]')4
5df_companies.dtypes[df_companies.dtypes=='<M8[ns]']6
7# Output:8lastraded datetime64[ns]9QresultDate datetime64[ns]10PLResultDate datetime64[ns]11BSResultDate datetime64[ns]12RResultDate datetime64[ns]13CFResultDate datetime64[ns]14YRResultDate datetime64[ns]15HiDate datetime64[ns]16LoDate datetime64[ns]17TVDate datetime64[ns]18dtype: object
Method 3
1dtCols = list(df_companies.select_dtypes(include=['datetime']).columns)