Pandas dataframe is a 2D data structure which looks like a table. There are different ways to create a dataframe, mostly everybody sticks to few formats they like and stick to it, thats good. But, sometimes when you have a problem and you want somebody’s opinion or get a suggestion or solution, you cannot give all your code and ask them to execute and start explaining the problem. Its a lot easier if you could generate a small sample data and explain it and from suggestion/solution you could copy the concept to your main program.
Whenever you ask a question in StackOverflow or any forum and when preparing to ask the question itself, you would know that it would be a lot easier if you provided a sample data and say this the problem you are facing and this what i have tried, so far ? In python pandas world, you need to build small dataframes to build those examples.
Here are few ways of creating new dataframes and generating data.
- Empty dataframe
- Tuple technique
- Dictionary with value lists
- Array of Arrays / Lists of List
- Array of dictionary
- pd.Series to dataframe
- 8x4 dataframe with some random data
- Single column list of alphabets
- Dictionary to Dataframe
- Raw text to dataframe
- Dummy data : Generating a very big table
Empty dataframe
# Starting from empty dataframe
df = pd.DataFrame()
names = ['Gandolf','Gimli','Frodo','Legolas','Bilbo']
types = ['Wizard','Dwarf','Hobbit','Elf','Hobbit']
magic = [10, 1, 4, 6, 4]
aggression = [7, 10, 2, 5, 1]
stealth = [8, 2, 5, 10, 5]
df['names'], df['type'], df['magic_power'] = names, types, magic
df['aggression'], df['stealth'] = aggression, stealth
Output :
+----+---------+--------+---------------+--------------+-----------+
| | names | type | magic_power | aggression | stealth |
|----+---------+--------+---------------+--------------+-----------|
| 0 | Gandolf | Wizard | 10 | 7 | 8 |
| 1 | Gimli | Dwarf | 1 | 10 | 2 |
| 2 | Frodo | Hobbit | 4 | 2 | 5 |
| 3 | Legolas | Elf | 6 | 5 | 10 |
| 4 | Bilbo | Hobbit | 4 | 1 | 5 |
+----+---------+--------+---------------+--------------+-----------+
Empty Dataframe with columns
df_mv = pd.DataFrame(columns = ['movie_id', 'name', 'short_description', 'plot', 'rating'])
Output :
+------------+--------+---------------------+--------+----------+
| movie_id | name | short_description | plot | rating |
|------------+--------+---------------------+--------+----------|
+------------+--------+---------------------+--------+----------+
# When you add index, a null row will be added(meaning all columns will have null values)
df_mv = pd.DataFrame(columns = ['movie_id', 'name', 'short_description', 'plot', 'rating'], index=[0])
Output :
+----+------------+--------+---------------------+--------+----------+
| | movie_id | name | short_description | plot | rating |
|----+------------+--------+---------------------+--------+----------|
| 0 | nan | nan | nan | nan | nan |
+----+------------+--------+---------------------+--------+----------+
# Adding values
df_mv.loc[0] = [1, 'Abyss', 'Higher form of intelligent aliens', 'aliens under the ocean', 9]
Output :
+----+------------+--------+-----------------------------------+------------------------+----------+
| | movie_id | name | short_description | plot | rating |
|----+------------+--------+-----------------------------------+------------------------+----------|
| 0 | 1 | Abyss | Higher form of intelligent aliens | aliens under the ocean | 9 |
+----+------------+--------+-----------------------------------+------------------------+----------+
Tuple technique
Here index is the first
tuple variable
first = ('Mike', 'Dorothee', 'Tom', 'Bill', 'Pete', 'Kate')
last = ('Meyer', 'Maier', 'Meyer', 'Mayer', 'Meyr', 'Mair')
job = ('data analyst', 'programmer', 'computer scientist',
'data scientist', 'programmer', 'psychiatrist')
language= ('Python', 'Perl', 'Java', 'Pithon', 'Pythen', 'Brainfuck')
data = list(zip(last, job, language)
Output :
[('Meyer', 'data analyst', 'Python'),
('Maier', 'Web Developer', 'NodeJS'),
('Meyer', 'Enterprise Engineer', 'Java'),
('Mayer', 'Mainframe Developer', 'COBOL'),
('Meyr', 'Analyst', 'SAS'),
('Mair', 'psychiatrist', 'Brain Mechainic')]
df = pd.DataFrame(data, columns =['last', 'job', 'language'], index=first)
Output :
+----------+--------+---------------------+-----------------+
| | last | job | language |
|----------+--------+---------------------+-----------------|
| Mike | Meyer | data analyst | Python |
| Dorothee | Maier | Web Developer | NodeJS |
| Tom | Meyer | Enterprise Engineer | Java |
| Bill | Mayer | Mainframe Developer | COBOL |
| Pete | Meyr | Analyst | SAS |
| Kate | Mair | psychiatrist | Brain Mechainic |
+----------+--------+---------------------+-----------------+
Note : Instead of a tuple, if it was a list, it would work in the same way.
Dictionary with value lists
df = pd.DataFrame({'A': [0, 1, 2, 3, 4],
'B': ['foo', 'bar', 'bloo', 'blee', 'bloo'],
'C': ['green', 'red', 'blue', 'yellow', 'green']})
Output :
+-----+------+--------+
| A | B | C |
|-----+------+--------|
| 0 | foo | green |
| 1 | bar | red |
| 2 | bloo | blue |
| 3 | blee | yellow |
| 4 | bloo | green |
+-----+------+--------+
Array of Arrays / Lists of List
data = [['tom', 10], ['nick', 15], ['juli', 14]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
Output :
+--------+-------+
| Name | Age |
|--------+-------|
| tom | 10 |
| nick | 15 |
| juli | 14 |
+--------+-------+
Array of dictionary
# Third item in array doesn't have key 'a'
data = [{'a': 1, 'b': 2, 'c':3}, {'a':10, 'b': 20, 'c': 30}
, {'b': 2, 'c':3}]
df = pd.DataFrame(data)
Output :
+-----+-----+-----+
| a | b | c |
|-----+-----+-----|
| 1 | 2 | 3 |
| 10 | 20 | 30 |
| nan | 2 | 3 |
+-----+-----+-----+
# 'a1' completely happens to be a new column
data = [{'a': 1, 'b': 2, 'c':3}, {'a':10, 'b': 20, 'c': 30}
, {'b': 2, 'c':3}]
df = pd.DataFrame(data, columns=['a', 'a1', 'b', 'c'])
Output :
+-----+------+-----+-----+
| a | a1 | b | c |
|-----+------+-----+-----|
| 1 | nan | 2 | 3 |
| 10 | nan | 20 | 30 |
| nan | nan | 2 | 3 |
+-----+------+-----+-----+
pd.Series to dataframe
d = {'one' : pd.Series([10, 20, 30, 40], index =['a', 'b', 'c', 'd']),
'two' : pd.Series([10, 20, 30, 40], index =['d', 'c', 'b', 'a'])}
df = pd.DataFrame(d)
Output :
+-------+-------+
| one | two |
|-------+-------|
| 10 | 40 |
| 20 | 30 |
| 30 | 20 |
| 40 | 10 |
+-------+-------+
8x4 dataframe with some random data
# Generates date
dates = pd.date_range('1/1/2000', periods=8)
df = pd.DataFrame(np.random.randn(8, 4), index=dates, columns=['A', 'B', 'C', 'D'])
Output :
+---------------------+------------+-----------+-----------+------------+
| | A | B | C | D |
|---------------------+------------+-----------+-----------+------------|
| 2000-01-01 00:00:00 | -1.05128 | 0.238627 | 0.106378 | 1.10883 |
| 2000-01-02 00:00:00 | 0.314559 | 1.8427 | 0.282821 | -1.47774 |
| 2000-01-03 00:00:00 | -0.638806 | 0.139111 | 1.08201 | 0.35632 |
| 2000-01-04 00:00:00 | 1.44253 | -0.113561 | 0.829517 | 0.581576 |
| 2000-01-05 00:00:00 | -1.40117 | -2.12707 | 0.127889 | 0.721586 |
| 2000-01-06 00:00:00 | 0.724627 | 0.186909 | -0.56129 | -0.528885 |
| 2000-01-07 00:00:00 | -0.0975346 | -0.343885 | -0.589784 | 0.0576514 |
| 2000-01-08 00:00:00 | -1.56196 | 0.343319 | -0.172569 | -0.232157 |
+---------------------+------------+-----------+-----------+------------+
Single column list of alphabets
import string
df = pd.DataFrame({'alpha': list(string.ascii_lowercase)})
Output :
+----+---------+
| | alpha |
|----+---------|
| 0 | a |
| 1 | b |
| 2 | c |
| 3 | d |
| 4 | e |
| 5 | f |
| 6 | g |
| 7 | h |
| 8 | i |
| 9 | j |
| 10 | k |
| 11 | l |
| 12 | m |
| 13 | n |
| 14 | o |
| 15 | p |
| 16 | q |
| 17 | r |
| 18 | s |
| 19 | t |
| 20 | u |
| 21 | v |
| 22 | w |
| 23 | x |
| 24 | y |
| 25 | z |
+----+---------+
dictionary to Dataframe
data = {'row_1': [3, 2, 1, 0], 'row_2': ['a', 'b', 'c', 'd']}
df = pd.DataFrame.from_dict(data, orient='index',
columns=['A', 'B', 'C', 'D'])
Output :
+-------+-----+-----+-----+-----+
| | A | B | C | D |
|-------+-----+-----+-----+-----|
| row_2 | a | b | c | d |
| row_1 | 3 | 2 | 1 | 0 |
+-------+-----+-----+-----+-----+
Here we create dataframe from a complex dictionary
data = {"date":"2018-01-02","data":{"AAPL":{"open":"170.16","close":"172.26","high":"172.30","low":"169.26","volume":"25555934"}
,"MSFT":{"open":"86.13","close":"85.95","high":"86.31","low":"85.50","volume":"22483797"}
}
}
df = pd.DataFrame.from_dict(data)
Output :
+------+------------------------------------------------------------------------------------------------+------------+
| | data | date |
|------+------------------------------------------------------------------------------------------------+------------|
| AAPL | {'high': '172.30', 'volume': '25555934', 'close': '172.26', 'low': '169.26', 'open': '170.16'} | 2018-01-02 |
| MSFT | {'high': '86.31', 'volume': '22483797', 'close': '85.95', 'low': '85.50', 'open': '86.13'} | 2018-01-02 |
+------+------------------------------------------------------------------------------------------------+------------+
df = pd.concat([df['date'],df['data'].apply(pd.Series)], axis=1)
Output :
+------+------------+---------+--------+--------+--------+----------+
| | date | close | high | low | open | volume |
|------+------------+---------+--------+--------+--------+----------|
| AAPL | 2018-01-02 | 172.26 | 172.3 | 169.26 | 170.16 | 25555934 |
| MSFT | 2018-01-02 | 85.95 | 86.31 | 85.5 | 86.13 | 22483797 |
+------+------------+---------+--------+--------+--------+----------+
Raw text to dataframe
import sys
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
import pandas as pd
DF1 = StringIO("""
Date Fruit Num Color
2013-11-24 Banana 22.1 Yellow
2013-11-24 Orange 8.6 Orange
2013-11-24 Apple 7.6 Green
2013-11-24 Celery 10.2 Green
""")
DF2 = StringIO("""
Date Fruit Num Color
2013-11-24 Banana 22.1 Yellow
2013-11-24 Orange 8.6 Orange
2013-11-24 Apple 7.6 Green
2013-11-24 Celery 10.2 Green
2013-11-25 Apple 22.1 Red
2013-11-25 Orange 8.6 Orange
""")
df1 = pd.read_csv(DF1, sep='\s+')
df2 = pd.read_csv(DF2, sep='\s+')
dfs_dictionary = {'DF1':df1,'DF2':df2}
df=pd.concat(dfs_dictionary)
print(tabulate(df, headers='keys', tablefmt='psql', showindex=True))
# Output
+------------+------------+---------+-------+---------+
| | Date | Fruit | Num | Color |
|------------+------------+---------+-------+---------|
| ('DF1', 0) | 2013-11-24 | Banana | 22.1 | Yellow |
| ('DF1', 1) | 2013-11-24 | Orange | 8.6 | Orange |
| ('DF1', 2) | 2013-11-24 | Apple | 7.6 | Green |
| ('DF1', 3) | 2013-11-24 | Celery | 10.2 | Green |
| ('DF2', 0) | 2013-11-24 | Banana | 22.1 | Yellow |
| ('DF2', 1) | 2013-11-24 | Orange | 8.6 | Orange |
| ('DF2', 2) | 2013-11-24 | Apple | 7.6 | Green |
| ('DF2', 3) | 2013-11-24 | Celery | 10.2 | Green |
| ('DF2', 4) | 2013-11-25 | Apple | 22.1 | Red |
| ('DF2', 5) | 2013-11-25 | Orange | 8.6 | Orange |
+------------+------------+---------+-------+---------+
Dummy data : Generating a very big table
Not sure how or when this would be useful, but you could do this as well.
import numpy as np
# Add more zeros, you can hit memory limit
df = pd.DataFrame(np.random.randn(6000, 5000))
df.shape
# Output
(6000, 5000)
df.info()
# Output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6000 entries, 0 to 5999
Columns: 5000 entries, 0 to 4999
dtypes: float64(5000)
memory usage: 228.9 MB