— python, pandas, notes — 6 min read
Null is just absence of a value in a variable. You can use null when you cannot specify any default value where any value would mean something.
1>>> def has_no_return(): #<- Defining a function which doesn't return anything2... pass3...4>>> has_no_return() #<- When called, it doesn't return anything as expected5>>> print(has_no_return()) #<- When the function called using print(), which actually needs to print something6None #<- prints NONE as function didnt return anything, it printed NONE7>>> # a hidden value called None
Why its so important in python ? There are two ways to say a variable is null in Python. Its confusing and it causes issues unnecessarily and breaks stuff.
None is a object in python and objects are usually String class.
1>>> type(None)2<class 'NoneType'>
None is a keyword, just like True and False, so you cannot declare it as a variable.
None is a singleton. That is, the NoneType class will only point to same single instance of None in the program. You can create many variables and assign NONE to it and all the variables will point to same instance of None.
1>>> id(None)215606444803>>> a = None4>>> b = None5>>> id(a)615606444807>>> id(b)81560644480
When checking whether a value is null or not null, should use identity operators(is, is not) rather than equality operators(==, !=). Sidetrack 1
None is falsy meaning it will be evaluated to false. If you want to know whether a condition is true/false. You can test like below,
1>>> a = 'hi'2>>> if a: #<- 'a' has 'hi' value. 3... print(a)4... else:5... print('Other than a')6... 7hi #<- if condition tested to True and printed 'hi'8
9>>> a='' 10>>> if a: #<- 'a' has blank value11... print(a)12... else:13... print('Other than a')14... 15Other than a #<- if condition tested false. Actually it should have printed <blank>, right?. 16 # What happened ? Falsy
Truthy and Falsy are in Sidetrack 2
NaN means (Not-A-Number).
The IEEE-754 standard defines a NaN as a number with all ones in the exponent, and a non-zero significand. The highest-order bit in the significand specifies whether the NaN is a signaling or quiet one. The remaining bits of the significand form what is referred to as the payload of the NaN.
1>>> import numpy as np2>>> np.nan==np.nan #<- It is how it is. 3False
To know why it is like that refer Sidetrack 3
Multiple ways to check, whether a value is NaN. Recommendation is, if you are using Pandas use pandas, if you are using Numpy use numpy, if you are not using both use Math. Why ? import
takes space, import math
is around 2MB other two are > 10MB
1import pandas as pd2import numpy as np3import math4
5#For single variable all three libraries return single boolean6x1 = float("nan")7
8print("It's pd.isna : {}".format(pd.isna(x1)) )9print("It's pd.isnull : {}".format(pd.isnull(x1)) )10print("It's np.isnan : {}".format(np.isnan(x1)) )11print("It's math.isnan : {}".format(math.isnan(x1)) )12
13# Output14It's pd.isna : True15It's pd.isnull : True16It's np.isnan : True17It's math.isnan : True
1print(math.nan is math.nan) #<- True2print(math.nan is np.nan) #<- False3print(math.nan is float('nan')) #<- False
Why ? They all have different IDs.
1>>> id(math.nan), id(np.nan), id(float('nan')) 2(32474464, 32473712, 225025248)
Numpy/Pandas can convert column/series to float or object based on None/np.nan values, if you don't handle it.
Here, because of None the array is converted to Object dtype.
1>>> vals1 = np.array([1, None, 3, 4])2>>> vals13array([1, None, 3, 4], dtype=object)4>>> vals1.sum()5Traceback (most recent call last):6 File "<stdin>", line 1, in <module>7 File "C:\Users\Sushanth\AppData\Local\Programs\Python\Python35-32\lib\site-packages\numpy\core\_methods.py", line 38, in _sum8 return umr_sum(a, axis, dtype, out, keepdims, initial, where)9TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'10>>>
np.nan
makes it a float64. Instead of NaN, if a numeric was there, it would have been int32
1>>> vals1 = np.array([1, np.nan, 3, 4]) 2>>> vals13array([ 1., nan, 3., 4.])4>>> type(vals1), vals1.dtype 5(<class 'numpy.ndarray'>, dtype('float64'))6>>> vals1.sum()7nan
sum()
function in both the places triggered different type of error. In object dtype, it throws a TypeError and in float64, it returned nan.
Below proves, you possibly cannot do any calculation, when you have NaN.
1>>> 1 + np.nan2nan
This is true, when you are not using pandas. See the below example, there is no error.
1>>> import pandas as pd2>>> df = pd.DataFrame({'A': [0, 1, 2, 3, 4]})3>>> df['A'].sum()410 5>>> df = pd.DataFrame({'A': [0, 1, 2, 3, np.nan]})6>>> df['A'].sum()76.0
Why ? sum() in pandas has a option skipnabool
and its default value is True
. So by default, sum() will exclude all NA/null values when computing the result. So, when working in pandas its always better to check documentation is if any features or options available.
There are only few ways of handling nulls, they are
Below are some ways numpy provides to ignore nans and perform simple calculations.
1>>> vals1 = np.array([1, np.nan, 3, 4]) 2>>> np.nansum(vals1), np.nanmin(vals1), np.nanmax(vals1)3(8.0, 1.0, 4.0)
1data = pd.Series([1, np.nan, 3, 4])2>>> data.isnull()30 False41 True52 False63 False7dtype: bool8
9>>> data[data.isnull()] #<- data.isna() also gives the same result101 NaN11dtype: float6412
13>>> data[data.notnull()] 140 1.0152 3.0163 4.017dtype: float6418>>>
1>>> data = pd.Series([1, np.nan, 3, 4])2>>> data.dropna()30 1.0 #<- Here data in index[1] is dropped42 3.053 4.06dtype: float647>>>
Here you can observe that you cannot drop single value from a DataFrame. In the below example you can see, a entire row getting removed. Options are availble to entire column as well. Sometimes this type of result may not be desirable.
1>>> df = pd.DataFrame([[1, np.nan, 5],2... [2, 3, 6],3... [np.nan, 4, 7]])4
5>>> df6 0 1 270 1.0 NaN 581 2.0 3.0 692 NaN 4.0 710
11>>> df.dropna() 12 0 1 2131 2.0 3.0 6
df.dropna()
has multiple options,
df.dropna(axis='columns')
: drops all columns that has a null value. Instead of axis='columns'
, axis=1
can be mentioned. df.dropna(axis='rows')
: drops all rows that has a null value. Instead of axis='rows'
, axis=0
can be mentioned. df.dropna(how='any')
: (default). Default axis is rowsdf.dropna(how='all')
: Drop rows or columns which has all nulls, by default it drops rows(axis=0). 1>>> df = pd.DataFrame([[1, np.nan, 5],2... [2, 3, 6],3... [np.nan, 4, 7]])4
5>>> df.dropna(axis='columns')6 270 581 692 710
11>>> df.dropna(axis='rows')12 0 1 2131 2.0 3.0 614
15>>> df.dropna(how='any')16 0 1 2171 2.0 3.0 618
19>>> df = pd.DataFrame([[np.nan, np.nan, np.nan],20... [np.nan, 3, 6],21... [np.nan, 4, 7]])22>>> 23>>> df.dropna(how='all') #<- It dropped first row24 0 1 2251 NaN 3.0 6.0262 NaN 4.0 7.027>>>28>>> df.dropna(how='all', axis=1) #<- It dropped first column 29 1 2300 NaN NaN311 3.0 6.0322 4.0 7.0
To have more control on the non-values to be kept, you can specify `thresh=2', having 2 as its parameter means, atleast 2 non-null values should be there in the row/column.
1df = pd.DataFrame([[np.nan, np.nan, np.nan],2 [np.nan, 3, 6],3 [np.nan, 4, 7]])4
5>>> df.dropna(thresh=2) #<- Default axis='rows' or axis=0, so first row is dropped6 0 1 271 NaN 3.0 6.082 NaN 4.0 7.09>>>10>>> df.dropna(axis=1, thresh=2) 11 1 2120 NaN NaN131 3.0 6.0142 4.0 7.0
1# This option fills all nulls to a predefined value. 2>>> data = pd.Series([1, np.nan, 3, 4])3>>> data.fillna(0)40 1.051 0.062 3.073 4.08dtype: float649
10>>> data.fillna(method='bfill') #<- bfill is backward fill. Data in index[2] is filled in index[1]110 1.0121 3.0132 3.0143 4.015dtype: float6416
17>>> df = pd.DataFrame([[np.nan, 1, np.nan],18... [2, np.nan, 3],19... [np.nan, 4, 7]])20>>> 21>>> df.fillna(method='ffill', axis=1) # Here we are forward filling at column level( Left -> Right )22 0 1 2 #<- df[column][row]230 NaN 1.0 1.0 #<- Data in df[1][0] is filled in df[2][0]241 2.0 2.0 3.0 #<- Data in df[0][1] is filled in df[1][1]252 NaN 4.0 7.0 #<- There is nothing to forward fill as nan is in df[0][2]26
27>>> df.fillna(method='bfill', axis=0) # Here we are backward filling at row level (Bottom to Top)28 0 1 2290 2.0 1.0 3.0 #<- Data in df[0][1] & df[2][1] is filled in df[0][0] & df[2][0]301 2.0 4.0 3.0 #<- Data in df[1][2] is filled in df[1][1]312 NaN 4.0 7.0 #<- Null in df[0][2] is left as in.
Above we have seen filling nulls for the full dataframe. It can be filled column-wise as well. Below are some examples
1# Pandas2df['col1'] = df['col1'].fillna(0)3
4# Numpy 5df['col1'] = df['col1'].replace(np.nan, 0)
1>>> df = pd.DataFrame([[1 , 1, np.nan],2... [np.nan, np.nan, np.nan],3... [np.nan, 4, 7]])4>>> df5 0 1 260 1.0 1.0 NaN71 NaN NaN NaN82 NaN 4.0 7.09>>> df.isnull().sum(axis=1) #<- Counts all nulls in columns by row100 1111 3122 113dtype: int6414
15>>> df.isnull().sum(axis=0) #<- Counts all the nulls in rows by column160 2171 1182 219dtype: int64
Two identity operators available are is
and is not
1>>> a = 'hi'2>>> b = 'hello'3>>> id(a)421506528 #<- Variable a's ID5>>> id(b)621100928 #<- Variable b's ID7
8>>> type(a)9<class 'str'> #<- data type of variable a is string10
11>>> id(str) 121560662608 #<- ID of str 13
14>>> id(type(a))151560662608 #<- ID of data type of variable a. It is the ID of str class16>>> id(type(b))171560662608 #<- ID of data type of variable b. It is the ID of str class18
19>>> type(a) is str20True #<- So obviously, its going to be true. 21
22>>> b = 1 #<- Assigning integer value to variable b23>>> id(b)241560762640 #<- Now variable b has different ID25>>> type(b)26<class 'int'>27>>> type(b) is not str28True #<- Now we definitely know that variable b is not a string
Equality operator : Checks whether the two values are equal(which is defined from object to object)
==
and !=
1>>> a = 'hi'2>>> print(a)3hi4>>> a is None5False6>>> a == None7False8>>> a != None #<- At this point you can think, to check for None, you can use equality operator itself. 9True # Its not recomended
Its not recommended because PEP 8 says so :
"Comparisons to singletons like None should always be done with 'is' or 'is not', never the equality operators."
Check the below example, copied from realpython.com
1>>> class BrokenComparison:2... def __eq__(self, other):3... return True4... 5>>> b = BrokenComparison()6>>> b == None7True
The equality operators can be fooled when you’re comparing user-defined objects that override them. Here, the equality operator == returns the wrong answer. The identity operator is, on the other hand, can’t be fooled because you can’t override it.
This again works, but not recommended. So should be careful, not to use it. None by definition is absence of value (null). Here whats happening is comparing the id() of None, which is going to exact same memory location, so None comparison becomes True. Python tests object's identity first meaning it checks whether the objects are found at the same memory address.
1>>> None==None2True3>>> id(None) #<- Both the None will be pointing to same id()41560644480
Expanding the above bit more. Check[ Ref1 ]
1>>> nans = [None for i in range(2)] #<- Adding None to list 2>>> list(map(id, nans)) #<- printing the id()'s of None 3[1539541888, 1539541888] #<- As expected they have same ID4
5>>> nans = [np.nan for i in range(2)] #<- Adding numpy.nan to list 6>>> list(map(id, nans)) #<- printing the id()'s of np.nan 7[32473712, 32473712] #<- As expected they have same ID8
9>>> nans = [float("NaN") for i in range(2)] #<- Adding float("NaN") to list and thing to remember 10>>> list(map(id, nans)) #<- is each call to float("NaN") creates a new object. 11[26864592, 201935840] #<- Different ID's meaning they are different objects.
To check if the item is in the list, Python tests for object identity first, and then tests for equality only if the objects are different.
1>>> nans = [None, np.nan, float("NaN")]2>>> None in nans #<- Object identity will return True and Python recognises the item in the list.3True4>>> np.nan in nans #<- Object identity will return True and Python recognises the item in the list.5True6>>> float("NaN") in nans #<- False because two different NaN objects as you can see in above map example7False8>>> fnan = float("NaN") #<- This is obviously true because you are refering to same item.9>>> fnan in [fnan] 10True
This is always false, we should learn to live with it. Good points here by Stephen Canon.
1>>> np.nan==np.nan2False
1>>> a=np.array([2, [3], 4])2>>> a[1]==[3] 3True4>>> a==[3]5array([False, True, False])6
7>>> b = np.array([None,[np.nan]])8>>> b[1]==[np.nan] #<- Comparing two lists and same NaN object and id() are compared. 9True10>>> b==[np.nan] #<- Here in the comparison, its False. Numpy checks values both are different. 11array([False, False])
1>>> lst = [1,2,3]2>>> id(lst)31941549444>>> lst == lst[:]5True # <- This is True since the lists are "equivalent"6>>> lst is lst[:]7False # <- This is False since they're actually different objects8>>> id(lst[:])919415606410>>>
When you are comparing values, there can be only two results, True or False which is a boolean and as of now i dont think there is a programming language supporting Not-a-Boolean(NaB). Usually expressions evaluate to these values.
We can test expressions like below without operators,
1a = 102if a:3 print(a)4else:5 print('i hope variable has a value initialized')6
7# Output 8109
10a = 011if a:12 print(a)13else:14 print('i hope variable has a value initialized')15
16# Output 17i hope variable has a value initialized
What happened in second example is because of the Concept of Truthy & Falsy. Here,
Below are falsy values
[]
()
{}
set()
"", ''
0, 0.0
False
Below are truthy values
Simple example in using Truthy
1name = "sushanth"2if len(name) > 0 :3 print('Hello {}'.format(name))4else:5 print('Wassap')6
7name = "sushanth"8if name:9 print('Hello {}'.format(name))10else:11 print('Wassap')
This is from Reflexivity, and other pillars of civilization . This is a good read.
Equality is reflexive (every value is equal to itself, at any longitude and temperature, no excuses and no exceptions); and the purpose of assignment is to make the value of the target equal to the value of the source.
754 enters the picture
Now assume that the value of x is a NaN. If you use a programming language that supports IEEE 754 (as they all do, I think, today) the test in
if x = x then …
is supposed to yield False. Yes, that is specified in the standard: NaN is never equal to NaN (even with the same payload); nor is it equal to anything else; the result of an equality comparison involving NaN will always be False.
I am by no means a numerics expert; I know that IEEE 754 was a tremendous advance, and that it was designed by some of the best minds in the field, headed by Velvel Kahan who received a Turing Award in part for that success.
Why the result is False ? The conclusion is not that the result should be False. The rational conclusion is that True and False are both unsatisfactory solutions. The reason is very simple: in a proper theory (I will sketch it below) the result of such a comparison should be some special undefined below; the same way that IEEE 754 extends the set of floating-point numbers with NaN, a satisfactory solution would extend the set of booleans with some NaB (Not a Boolean). But there is no NaB, probably because no one (understandably) wanted to bother, and also because being able to represent a value of type BOOLEAN on a single bit is, if not itself a pillar of civilization, one of the secrets of a happy life.
If both True and False are unsatisfactory solutions, we should use the one that is the “least” bad, according to some convincing criterion . That is not the attitude that 754 takes; it seems to consider (as illustrated by the justification cited above) that False is somehow less committing than True. But it is not! Stating that something is false is just as much of a commitment as stating that it is true. False is no closer to NaB than True is. A better criterion is: which of the two possibilities is going to be least damaging to time-honored assumptions embedded in mathematics? One of these assumptions is the reflexivity of equality: come rain or shine, x is equal to itself. Its counterpart for programming is that after an assignment the target will be equal to the original value of the source. This applies to numbers, and it applies to a NaN as well.
Note that this argument does not address equality between different NaNs. The standard as it is states that a specific NaN, with a specific payload, is not equal to itself.
And this is where i stopped and decided not to go further in this subject.