Skip to content
bobby_dreamer

Creating pandas dataframes

python, pandas1 min read

Pandas dataframe is a 2D data structure which looks like a table. There are different ways to create a dataframe, mostly everybody sticks to few formats they like and stick to it, thats good. But, sometimes when you have a problem and you want somebody's opinion or get a suggestion or solution, you cannot give all your code and ask them to execute and start explaining the problem. Its a lot easier if you could generate a small sample data and explain it and from suggestion/solution you could copy the concept to your main program.

Whenever you ask a question in StackOverflow or any forum and when preparing to ask the question itself, you would know that it would be a lot easier if you provided a sample data and say this the problem you are facing and this what i have tried, so far ? In python pandas world, you need to build small dataframes to build those examples.

Here are few ways of creating new dataframes and generating data.

  1. Empty dataframe
  2. Tuple technique
  3. Dictionary with value lists
  4. Array of Arrays / Lists of List
  5. Array of dictionary
  6. pd.Series to dataframe
  7. 8x4 dataframe with some random data
  8. Single column list of alphabets
  9. Dictionary to Dataframe
  10. Raw text to dataframe
  11. Dummy data : Generating a very big table


Empty dataframe
1# Starting from empty dataframe
2df = pd.DataFrame()
3
4names = ['Gandolf','Gimli','Frodo','Legolas','Bilbo']
5types = ['Wizard','Dwarf','Hobbit','Elf','Hobbit']
6magic = [10, 1, 4, 6, 4]
7aggression = [7, 10, 2, 5, 1]
8stealth = [8, 2, 5, 10, 5]
9
10df['names'], df['type'], df['magic_power'] = names, types, magic
11df['aggression'], df['stealth'] = aggression, stealth
12
13Output :
14+----+---------+--------+---------------+--------------+-----------+
15| | names | type | magic_power | aggression | stealth |
16|----+---------+--------+---------------+--------------+-----------|
17| 0 | Gandolf | Wizard | 10 | 7 | 8 |
18| 1 | Gimli | Dwarf | 1 | 10 | 2 |
19| 2 | Frodo | Hobbit | 4 | 2 | 5 |
20| 3 | Legolas | Elf | 6 | 5 | 10 |
21| 4 | Bilbo | Hobbit | 4 | 1 | 5 |
22+----+---------+--------+---------------+--------------+-----------+

Empty Dataframe with columns

1df_mv = pd.DataFrame(columns = ['movie_id', 'name', 'short_description', 'plot', 'rating'])
2
3Output :
4+------------+--------+---------------------+--------+----------+
5| movie_id | name | short_description | plot | rating |
6|------------+--------+---------------------+--------+----------|
7+------------+--------+---------------------+--------+----------+
8
9# When you add index, a null row will be added(meaning all columns will have null values)
10df_mv = pd.DataFrame(columns = ['movie_id', 'name', 'short_description', 'plot', 'rating'], index=[0])
11
12Output :
13+----+------------+--------+---------------------+--------+----------+
14| | movie_id | name | short_description | plot | rating |
15|----+------------+--------+---------------------+--------+----------|
16| 0 | nan | nan | nan | nan | nan |
17+----+------------+--------+---------------------+--------+----------+
18
19# Adding values
20df_mv.loc[0] = [1, 'Abyss', 'Higher form of intelligent aliens', 'aliens under the ocean', 9]
21Output :
22+----+------------+--------+-----------------------------------+------------------------+----------+
23| | movie_id | name | short_description | plot | rating |
24|----+------------+--------+-----------------------------------+------------------------+----------|
25| 0 | 1 | Abyss | Higher form of intelligent aliens | aliens under the ocean | 9 |
26+----+------------+--------+-----------------------------------+------------------------+----------+

Tuple technique

Here index is the first tuple variable

1first = ('Mike', 'Dorothee', 'Tom', 'Bill', 'Pete', 'Kate')
2last = ('Meyer', 'Maier', 'Meyer', 'Mayer', 'Meyr', 'Mair')
3job = ('data analyst', 'programmer', 'computer scientist',
4 'data scientist', 'programmer', 'psychiatrist')
5language= ('Python', 'Perl', 'Java', 'Pithon', 'Pythen', 'Brainfuck')
6
7data = list(zip(last, job, language)
8Output :
9[('Meyer', 'data analyst', 'Python'),
10 ('Maier', 'Web Developer', 'NodeJS'),
11 ('Meyer', 'Enterprise Engineer', 'Java'),
12 ('Mayer', 'Mainframe Developer', 'COBOL'),
13 ('Meyr', 'Analyst', 'SAS'),
14 ('Mair', 'psychiatrist', 'Brain Mechainic')]
15
16df = pd.DataFrame(data, columns =['last', 'job', 'language'], index=first)
17Output :
18+----------+--------+---------------------+-----------------+
19| | last | job | language |
20|----------+--------+---------------------+-----------------|
21| Mike | Meyer | data analyst | Python |
22| Dorothee | Maier | Web Developer | NodeJS |
23| Tom | Meyer | Enterprise Engineer | Java |
24| Bill | Mayer | Mainframe Developer | COBOL |
25| Pete | Meyr | Analyst | SAS |
26| Kate | Mair | psychiatrist | Brain Mechainic |
27+----------+--------+---------------------+-----------------+
28
29Note : Instead of a tuple, if it was a list, it would work in the same way.

Dictionary with value lists
1df = pd.DataFrame({'A': [0, 1, 2, 3, 4],
2 'B': ['foo', 'bar', 'bloo', 'blee', 'bloo'],
3 'C': ['green', 'red', 'blue', 'yellow', 'green']})
4
5Output :
6+-----+------+--------+
7| A | B | C |
8|-----+------+--------|
9| 0 | foo | green |
10| 1 | bar | red |
11| 2 | bloo | blue |
12| 3 | blee | yellow |
13| 4 | bloo | green |
14+-----+------+--------+

Array of Arrays / Lists of List
1data = [['tom', 10], ['nick', 15], ['juli', 14]]
2df = pd.DataFrame(data, columns = ['Name', 'Age'])
3
4Output :
5+--------+-------+
6| Name | Age |
7|--------+-------|
8| tom | 10 |
9| nick | 15 |
10| juli | 14 |
11+--------+-------+

Array of dictionary
1# Third item in array doesn't have key 'a'
2data = [{'a': 1, 'b': 2, 'c':3}, {'a':10, 'b': 20, 'c': 30}
3 , {'b': 2, 'c':3}]
4df = pd.DataFrame(data)
5
6Output :
7+-----+-----+-----+
8| a | b | c |
9|-----+-----+-----|
10| 1 | 2 | 3 |
11| 10 | 20 | 30 |
12| nan | 2 | 3 |
13+-----+-----+-----+
14
15# 'a1' completely happens to be a new column
16data = [{'a': 1, 'b': 2, 'c':3}, {'a':10, 'b': 20, 'c': 30}
17 , {'b': 2, 'c':3}]
18df = pd.DataFrame(data, columns=['a', 'a1', 'b', 'c'])
19
20Output :
21+-----+------+-----+-----+
22| a | a1 | b | c |
23|-----+------+-----+-----|
24| 1 | nan | 2 | 3 |
25| 10 | nan | 20 | 30 |
26| nan | nan | 2 | 3 |
27+-----+------+-----+-----+

pd.Series to dataframe
1d = {'one' : pd.Series([10, 20, 30, 40], index =['a', 'b', 'c', 'd']),
2 'two' : pd.Series([10, 20, 30, 40], index =['d', 'c', 'b', 'a'])}
3df = pd.DataFrame(d)
4
5Output :
6+-------+-------+
7| one | two |
8|-------+-------|
9| 10 | 40 |
10| 20 | 30 |
11| 30 | 20 |
12| 40 | 10 |
13+-------+-------+

8x4 dataframe with some random data
1# Generates date
2dates = pd.date_range('1/1/2000', periods=8)
3
4df = pd.DataFrame(np.random.randn(8, 4), index=dates, columns=['A', 'B', 'C', 'D'])
5
6Output :
7+---------------------+------------+-----------+-----------+------------+
8| | A | B | C | D |
9|---------------------+------------+-----------+-----------+------------|
10| 2000-01-01 00:00:00 | -1.05128 | 0.238627 | 0.106378 | 1.10883 |
11| 2000-01-02 00:00:00 | 0.314559 | 1.8427 | 0.282821 | -1.47774 |
12| 2000-01-03 00:00:00 | -0.638806 | 0.139111 | 1.08201 | 0.35632 |
13| 2000-01-04 00:00:00 | 1.44253 | -0.113561 | 0.829517 | 0.581576 |
14| 2000-01-05 00:00:00 | -1.40117 | -2.12707 | 0.127889 | 0.721586 |
15| 2000-01-06 00:00:00 | 0.724627 | 0.186909 | -0.56129 | -0.528885 |
16| 2000-01-07 00:00:00 | -0.0975346 | -0.343885 | -0.589784 | 0.0576514 |
17| 2000-01-08 00:00:00 | -1.56196 | 0.343319 | -0.172569 | -0.232157 |
18+---------------------+------------+-----------+-----------+------------+

Single column list of alphabets
1import string
2df = pd.DataFrame({'alpha': list(string.ascii_lowercase)})
3
4Output :
5+----+---------+
6| | alpha |
7|----+---------|
8| 0 | a |
9| 1 | b |
10| 2 | c |
11| 3 | d |
12| 4 | e |
13| 5 | f |
14| 6 | g |
15| 7 | h |
16| 8 | i |
17| 9 | j |
18| 10 | k |
19| 11 | l |
20| 12 | m |
21| 13 | n |
22| 14 | o |
23| 15 | p |
24| 16 | q |
25| 17 | r |
26| 18 | s |
27| 19 | t |
28| 20 | u |
29| 21 | v |
30| 22 | w |
31| 23 | x |
32| 24 | y |
33| 25 | z |
34+----+---------+

dictionary to Dataframe
1data = {'row_1': [3, 2, 1, 0], 'row_2': ['a', 'b', 'c', 'd']}
2df = pd.DataFrame.from_dict(data, orient='index',
3 columns=['A', 'B', 'C', 'D'])
4
5Output :
6+-------+-----+-----+-----+-----+
7| | A | B | C | D |
8|-------+-----+-----+-----+-----|
9| row_2 | a | b | c | d |
10| row_1 | 3 | 2 | 1 | 0 |
11+-------+-----+-----+-----+-----+

Here we create dataframe from a complex dictionary

1data = {"date":"2018-01-02","data":{"AAPL":{"open":"170.16","close":"172.26","high":"172.30","low":"169.26","volume":"25555934"}
2 ,"MSFT":{"open":"86.13","close":"85.95","high":"86.31","low":"85.50","volume":"22483797"}
3 }
4 }
5df = pd.DataFrame.from_dict(data)
6
7Output :
8+------+------------------------------------------------------------------------------------------------+------------+
9| | data | date |
10|------+------------------------------------------------------------------------------------------------+------------|
11| AAPL | {'high': '172.30', 'volume': '25555934', 'close': '172.26', 'low': '169.26', 'open': '170.16'} | 2018-01-02 |
12| MSFT | {'high': '86.31', 'volume': '22483797', 'close': '85.95', 'low': '85.50', 'open': '86.13'} | 2018-01-02 |
13+------+------------------------------------------------------------------------------------------------+------------+
14
15df = pd.concat([df['date'],df['data'].apply(pd.Series)], axis=1)
16Output :
17+------+------------+---------+--------+--------+--------+----------+
18| | date | close | high | low | open | volume |
19|------+------------+---------+--------+--------+--------+----------|
20| AAPL | 2018-01-02 | 172.26 | 172.3 | 169.26 | 170.16 | 25555934 |
21| MSFT | 2018-01-02 | 85.95 | 86.31 | 85.5 | 86.13 | 22483797 |
22+------+------------+---------+--------+--------+--------+----------+

Raw text to dataframe
1import sys
2if sys.version_info[0] < 3:
3 from StringIO import StringIO
4else:
5 from io import StringIO
6import pandas as pd
7DF1 = StringIO("""
8Date Fruit Num Color
92013-11-24 Banana 22.1 Yellow
102013-11-24 Orange 8.6 Orange
112013-11-24 Apple 7.6 Green
122013-11-24 Celery 10.2 Green
13""")
14DF2 = StringIO("""
15Date Fruit Num Color
162013-11-24 Banana 22.1 Yellow
172013-11-24 Orange 8.6 Orange
182013-11-24 Apple 7.6 Green
192013-11-24 Celery 10.2 Green
202013-11-25 Apple 22.1 Red
212013-11-25 Orange 8.6 Orange
22""")
23df1 = pd.read_csv(DF1, sep='\s+')
24df2 = pd.read_csv(DF2, sep='\s+')
25
26dfs_dictionary = {'DF1':df1,'DF2':df2}
27df=pd.concat(dfs_dictionary)
28
29print(tabulate(df, headers='keys', tablefmt='psql', showindex=True))
30# Output
31+------------+------------+---------+-------+---------+
32| | Date | Fruit | Num | Color |
33|------------+------------+---------+-------+---------|
34| ('DF1', 0) | 2013-11-24 | Banana | 22.1 | Yellow |
35| ('DF1', 1) | 2013-11-24 | Orange | 8.6 | Orange |
36| ('DF1', 2) | 2013-11-24 | Apple | 7.6 | Green |
37| ('DF1', 3) | 2013-11-24 | Celery | 10.2 | Green |
38| ('DF2', 0) | 2013-11-24 | Banana | 22.1 | Yellow |
39| ('DF2', 1) | 2013-11-24 | Orange | 8.6 | Orange |
40| ('DF2', 2) | 2013-11-24 | Apple | 7.6 | Green |
41| ('DF2', 3) | 2013-11-24 | Celery | 10.2 | Green |
42| ('DF2', 4) | 2013-11-25 | Apple | 22.1 | Red |
43| ('DF2', 5) | 2013-11-25 | Orange | 8.6 | Orange |
44+------------+------------+---------+-------+---------+

Dummy data : Generating a very big table

Not sure how or when this would be useful, but you could do this as well.

1import numpy as np
2
3# Add more zeros, you can hit memory limit
4df = pd.DataFrame(np.random.randn(6000, 5000))
5
6df.shape
7# Output
8(6000, 5000)
9
10df.info()
11# Output
12<class 'pandas.core.frame.DataFrame'>
13RangeIndex: 6000 entries, 0 to 5999
14Columns: 5000 entries, 0 to 4999
15dtypes: float64(5000)
16memory usage: 228.9 MB