Pandas dataframe is a 2D data structure which looks like a table. There are different ways to create a dataframe, mostly everybody sticks to few formats they like and stick to it, thats good. But, sometimes when you have a problem and you want somebody’s opinion or get a suggestion or solution, you cannot give all your code and ask them to execute and start explaining the problem. Its a lot easier if you could generate a small sample data and explain it and from suggestion/solution you could copy the concept to your main program.
Whenever you ask a question in StackOverflow or any forum and when preparing to ask the question itself, you would know that it would be a lot easier if you provided a sample data and say this the problem you are facing and this what i have tried, so far ? In python pandas world, you need to build small dataframes to build those examples.
Here are few ways of creating new dataframes and generating data.
Empty dataframe
Tuple technique
Dictionary with value lists
Array of Arrays / Lists of List
Array of dictionary
pd.Series to dataframe
8x4 dataframe with some random data
Single column list of alphabets
Dictionary to Dataframe
Raw text to dataframe
Dummy data : Generating a very big table
Empty dataframe
# Starting from empty dataframe
df = pd.DataFrame()
names = [ 'Gandolf' , 'Gimli' , 'Frodo' , 'Legolas' , 'Bilbo' ]
types = [ 'Wizard' , 'Dwarf' , 'Hobbit' , 'Elf' , 'Hobbit' ]
magic = [ 10 , 1 , 4 , 6 , 4 ]
aggression = [ 7 , 10 , 2 , 5 , 1 ]
stealth = [ 8 , 2 , 5 , 10 , 5 ]
df[ 'names' ], df[ 'type' ], df[ 'magic_power' ] = names, types, magic
df[ 'aggression' ], df[ 'stealth' ] = aggression, stealth
Output :
+ ---- + -------- -+ -------- + -------------- -+ -------------- + ---------- -+
| | names | type | magic_power | aggression | stealth |
| ---- + -------- -+ -------- + -------------- -+ -------------- + ---------- -|
| 0 | Gandolf | Wizard | 10 | 7 | 8 |
| 1 | Gimli | Dwarf | 1 | 10 | 2 |
| 2 | Frodo | Hobbit | 4 | 2 | 5 |
| 3 | Legolas | Elf | 6 | 5 | 10 |
| 4 | Bilbo | Hobbit | 4 | 1 | 5 |
+ ---- + -------- -+ -------- + -------------- -+ -------------- + ---------- -+
Empty Dataframe with columns
df_mv = pd.DataFrame( columns = [ 'movie_id' , 'name' , 'short_description' , 'plot' , 'rating' ])
Output :
+ ------------ + -------- + -------------------- -+ -------- + ---------- +
| movie_id | name | short_description | plot | rating |
| ------------ + -------- + -------------------- -+ -------- + ---------- |
+ ------------ + -------- + -------------------- -+ -------- + ---------- +
# When you add index, a null row will be added(meaning all columns will have null values)
df_mv = pd.DataFrame( columns = [ 'movie_id' , 'name' , 'short_description' , 'plot' , 'rating' ], index = [ 0 ])
Output :
+ ---- + ------------ + -------- + -------------------- -+ -------- + ---------- +
| | movie_id | name | short_description | plot | rating |
| ---- + ------------ + -------- + -------------------- -+ -------- + ---------- |
| 0 | nan | nan | nan | nan | nan |
+ ---- + ------------ + -------- + -------------------- -+ -------- + ---------- +
# Adding values
df_mv.loc[ 0 ] = [ 1 , 'Abyss' , 'Higher form of intelligent aliens' , 'aliens under the ocean' , 9 ]
Output :
+ ---- + ------------ + -------- + ---------------------------------- -+ ------------------------ + ---------- +
| | movie_id | name | short_description | plot | rating |
| ---- + ------------ + -------- + ---------------------------------- -+ ------------------------ + ---------- |
| 0 | 1 | Abyss | Higher form of intelligent aliens | aliens under the ocean | 9 |
+ ---- + ------------ + -------- + ---------------------------------- -+ ------------------------ + ---------- +
Tuple technique
Here index is the first
tuple variable
first = ( 'Mike' , 'Dorothee' , 'Tom' , 'Bill' , 'Pete' , 'Kate' )
last = ( 'Meyer' , 'Maier' , 'Meyer' , 'Mayer' , 'Meyr' , 'Mair' )
job = ( 'data analyst' , 'programmer' , 'computer scientist' ,
'data scientist' , 'programmer' , 'psychiatrist' )
language = ( 'Python' , 'Perl' , 'Java' , 'Pithon' , 'Pythen' , 'Brainfuck' )
data = list ( zip (last, job, language)
Output :
[( 'Meyer' , 'data analyst' , 'Python' ),
( 'Maier' , 'Web Developer' , 'NodeJS' ),
( 'Meyer' , 'Enterprise Engineer' , 'Java' ),
( 'Mayer' , 'Mainframe Developer' , 'COBOL' ),
( 'Meyr' , 'Analyst' , 'SAS' ),
( 'Mair' , 'psychiatrist' , 'Brain Mechainic' )]
df = pd.DataFrame(data, columns = [ 'last' , 'job' , 'language' ], index = first)
Output :
+ ---------- + -------- + -------------------- -+ ---------------- -+
| | last | job | language |
| ---------- + -------- + -------------------- -+ ---------------- -|
| Mike | Meyer | data analyst | Python |
| Dorothee | Maier | Web Developer | NodeJS |
| Tom | Meyer | Enterprise Engineer | Java |
| Bill | Mayer | Mainframe Developer | COBOL |
| Pete | Meyr | Analyst | SAS |
| Kate | Mair | psychiatrist | Brain Mechainic |
+ ---------- + -------- + -------------------- -+ ---------------- -+
Note : Instead of a tuple , if it was a list , it would work in the same way.
Dictionary with value lists
df = pd.DataFrame({ 'A' : [ 0 , 1 , 2 , 3 , 4 ],
'B' : [ 'foo' , 'bar' , 'bloo' , 'blee' , 'bloo' ],
'C' : [ 'green' , 'red' , 'blue' , 'yellow' , 'green' ]})
Output :
+ ---- -+ ------ + -------- +
| A | B | C |
| ---- -+ ------ + -------- |
| 0 | foo | green |
| 1 | bar | red |
| 2 | bloo | blue |
| 3 | blee | yellow |
| 4 | bloo | green |
+ ---- -+ ------ + -------- +
Array of Arrays / Lists of List
data = [[ 'tom' , 10 ], [ 'nick' , 15 ], [ 'juli' , 14 ]]
df = pd.DataFrame(data, columns = [ 'Name' , 'Age' ])
Output :
+ -------- + ------ -+
| Name | Age |
| -------- + ------ -|
| tom | 10 |
| nick | 15 |
| juli | 14 |
+ -------- + ------ -+
Array of dictionary
# Third item in array doesn't have key 'a'
data = [{ 'a' : 1 , 'b' : 2 , 'c' : 3 }, { 'a' : 10 , 'b' : 20 , 'c' : 30 }
, { 'b' : 2 , 'c' : 3 }]
df = pd.DataFrame(data)
Output :
+ ---- -+ ---- -+ ---- -+
| a | b | c |
| ---- -+ ---- -+ ---- -|
| 1 | 2 | 3 |
| 10 | 20 | 30 |
| nan | 2 | 3 |
+ ---- -+ ---- -+ ---- -+
# 'a1' completely happens to be a new column
data = [{ 'a' : 1 , 'b' : 2 , 'c' : 3 }, { 'a' : 10 , 'b' : 20 , 'c' : 30 }
, { 'b' : 2 , 'c' : 3 }]
df = pd.DataFrame(data, columns = [ 'a' , 'a1' , 'b' , 'c' ])
Output :
+ ---- -+ ------ + ---- -+ ---- -+
| a | a1 | b | c |
| ---- -+ ------ + ---- -+ ---- -|
| 1 | nan | 2 | 3 |
| 10 | nan | 20 | 30 |
| nan | nan | 2 | 3 |
+ ---- -+ ------ + ---- -+ ---- -+
pd.Series to dataframe
d = { 'one' : pd.Series([ 10 , 20 , 30 , 40 ], index = [ 'a' , 'b' , 'c' , 'd' ]),
'two' : pd.Series([ 10 , 20 , 30 , 40 ], index = [ 'd' , 'c' , 'b' , 'a' ])}
df = pd.DataFrame(d)
Output :
+ ------ -+ ------ -+
| one | two |
| ------ -+ ------ -|
| 10 | 40 |
| 20 | 30 |
| 30 | 20 |
| 40 | 10 |
+ ------ -+ ------ -+
8x4 dataframe with some random data
# Generates date
dates = pd.date_range( '1/1/2000' , periods = 8 )
df = pd.DataFrame(np.random.randn( 8 , 4 ), index = dates, columns = [ 'A' , 'B' , 'C' , 'D' ])
Output :
+ -------------------- -+ ------------ + ---------- -+ ---------- -+ ------------ +
| | A | B | C | D |
| -------------------- -+ ------------ + ---------- -+ ---------- -+ ------------ |
| 2000 - 0 1 - 0 1 00 : 00 : 00 | - 1.05128 | 0.238627 | 0.106378 | 1.10883 |
| 2000 - 0 1 - 0 2 00 : 00 : 00 | 0.314559 | 1.8427 | 0.282821 | - 1.47774 |
| 2000 - 0 1 - 0 3 00 : 00 : 00 | - 0.638806 | 0.139111 | 1.08201 | 0.35632 |
| 2000 - 0 1 - 0 4 00 : 00 : 00 | 1.44253 | - 0.113561 | 0.829517 | 0.581576 |
| 2000 - 0 1 - 0 5 00 : 00 : 00 | - 1.40117 | - 2.12707 | 0.127889 | 0.721586 |
| 2000 - 0 1 - 0 6 00 : 00 : 00 | 0.724627 | 0.186909 | - 0.56129 | - 0.528885 |
| 2000 - 0 1 - 0 7 00 : 00 : 00 | - 0.0975346 | - 0.343885 | - 0.589784 | 0.0576514 |
| 2000 - 0 1 - 0 8 00 : 00 : 00 | - 1.56196 | 0.343319 | - 0.172569 | - 0.232157 |
+ -------------------- -+ ------------ + ---------- -+ ---------- -+ ------------ +
Single column list of alphabets
import string
df = pd.DataFrame({ 'alpha' : list (string.ascii_lowercase)})
Output :
+ ---- + -------- -+
| | alpha |
| ---- + -------- -|
| 0 | a |
| 1 | b |
| 2 | c |
| 3 | d |
| 4 | e |
| 5 | f |
| 6 | g |
| 7 | h |
| 8 | i |
| 9 | j |
| 10 | k |
| 11 | l |
| 12 | m |
| 13 | n |
| 14 | o |
| 15 | p |
| 16 | q |
| 17 | r |
| 18 | s |
| 19 | t |
| 20 | u |
| 21 | v |
| 22 | w |
| 23 | x |
| 24 | y |
| 25 | z |
+ ---- + -------- -+
dictionary to Dataframe
data = { 'row_1' : [ 3 , 2 , 1 , 0 ], 'row_2' : [ 'a' , 'b' , 'c' , 'd' ]}
df = pd.DataFrame.from_dict(data, orient = 'index' ,
columns = [ 'A' , 'B' , 'C' , 'D' ])
Output :
+ ------ -+ ---- -+ ---- -+ ---- -+ ---- -+
| | A | B | C | D |
| ------ -+ ---- -+ ---- -+ ---- -+ ---- -|
| row_2 | a | b | c | d |
| row_1 | 3 | 2 | 1 | 0 |
+ ------ -+ ---- -+ ---- -+ ---- -+ ---- -+
Here we create dataframe from a complex dictionary
data = { "date" : "2018-01-02" , "data" :{ "AAPL" :{ "open" : "170.16" , "close" : "172.26" , "high" : "172.30" , "low" : "169.26" , "volume" : "25555934" }
, "MSFT" :{ "open" : "86.13" , "close" : "85.95" , "high" : "86.31" , "low" : "85.50" , "volume" : "22483797" }
}
}
df = pd.DataFrame.from_dict(data)
Output :
+ ------ + ------------------------------------------------------------------------------------------------ + ------------ +
| | data | date |
| ------ + ------------------------------------------------------------------------------------------------ + ------------ |
| AAPL | { 'high' : '172.30' , 'volume' : '25555934' , 'close' : '172.26' , 'low' : '169.26' , 'open' : '170.16' } | 2018 - 0 1 - 0 2 |
| MSFT | { 'high' : '86.31' , 'volume' : '22483797' , 'close' : '85.95' , 'low' : '85.50' , 'open' : '86.13' } | 2018 - 0 1 - 0 2 |
+ ------ + ------------------------------------------------------------------------------------------------ + ------------ +
df = pd.concat([df[ 'date' ],df[ 'data' ].apply(pd.Series)], axis = 1 )
Output :
+ ------ + ------------ + -------- -+ -------- + -------- + -------- + ---------- +
| | date | close | high | low | open | volume |
| ------ + ------------ + -------- -+ -------- + -------- + -------- + ---------- |
| AAPL | 2018 - 0 1 - 0 2 | 172.26 | 172.3 | 169.26 | 170.16 | 25555934 |
| MSFT | 2018 - 0 1 - 0 2 | 85.95 | 86.31 | 85.5 | 86.13 | 22483797 |
+ ------ + ------------ + -------- -+ -------- + -------- + -------- + ---------- +
Raw text to dataframe
import sys
if sys.version_info[ 0 ] < 3 :
from StringIO import StringIO
else :
from io import StringIO
import pandas as pd
DF1 = StringIO( """
Date Fruit Num Color
2013-11-24 Banana 22.1 Yellow
2013-11-24 Orange 8.6 Orange
2013-11-24 Apple 7.6 Green
2013-11-24 Celery 10.2 Green
""" )
DF2 = StringIO( """
Date Fruit Num Color
2013-11-24 Banana 22.1 Yellow
2013-11-24 Orange 8.6 Orange
2013-11-24 Apple 7.6 Green
2013-11-24 Celery 10.2 Green
2013-11-25 Apple 22.1 Red
2013-11-25 Orange 8.6 Orange
""" )
df1 = pd.read_csv( DF1 , sep = '\s+' )
df2 = pd.read_csv( DF2 , sep = '\s+' )
dfs_dictionary = { 'DF1' :df1, 'DF2' :df2}
df = pd.concat(dfs_dictionary)
print (tabulate(df, headers = 'keys' , tablefmt = 'psql' , showindex = True ))
# Output
+ ------------ + ------------ + -------- -+ ------ -+ -------- -+
| | Date | Fruit | Num | Color |
| ------------ + ------------ + -------- -+ ------ -+ -------- -|
| ( 'DF1' , 0 ) | 2013 - 11 - 24 | Banana | 22.1 | Yellow |
| ( 'DF1' , 1 ) | 2013 - 11 - 24 | Orange | 8.6 | Orange |
| ( 'DF1' , 2 ) | 2013 - 11 - 24 | Apple | 7.6 | Green |
| ( 'DF1' , 3 ) | 2013 - 11 - 24 | Celery | 10.2 | Green |
| ( 'DF2' , 0 ) | 2013 - 11 - 24 | Banana | 22.1 | Yellow |
| ( 'DF2' , 1 ) | 2013 - 11 - 24 | Orange | 8.6 | Orange |
| ( 'DF2' , 2 ) | 2013 - 11 - 24 | Apple | 7.6 | Green |
| ( 'DF2' , 3 ) | 2013 - 11 - 24 | Celery | 10.2 | Green |
| ( 'DF2' , 4 ) | 2013 - 11 - 25 | Apple | 22.1 | Red |
| ( 'DF2' , 5 ) | 2013 - 11 - 25 | Orange | 8.6 | Orange |
+ ------------ + ------------ + -------- -+ ------ -+ -------- -+
Dummy data : Generating a very big table
Not sure how or when this would be useful, but you could do this as well.
import numpy as np
# Add more zeros, you can hit memory limit
df = pd.DataFrame(np.random.randn( 6000 , 5000 ))
df.shape
# Output
( 6000 , 5000 )
df.info()
# Output
<class 'pandas.core.frame.DataFrame' >
RangeIndex: 6000 entries, 0 to 5999
Columns: 5000 entries, 0 to 4999
dtypes: float64( 5000 )
memory usage: 228.9 MB