Creating pandas dataframes

08.08.2020 — python, pandas — 1 min read

Pandas dataframe is a 2D data structure which looks like a table. There are different ways to create a dataframe, mostly everybody sticks to few formats they like and stick to it, thats good. But, sometimes when you have a problem and you want somebody's opinion or get a suggestion or solution, you cannot give all your code and ask them to execute and start explaining the problem. Its a lot easier if you could generate a small sample data and explain it and from suggestion/solution you could copy the concept to your main program.

Whenever you ask a question in StackOverflow or any forum and when preparing to ask the question itself, you would know that it would be a lot easier if you provided a sample data and say this the problem you are facing and this what i have tried, so far ? In python pandas world, you need to build small dataframes to build those examples.

Here are few ways of creating new dataframes and generating data.

Empty dataframe
Tuple technique
Dictionary with value lists
Array of Arrays / Lists of List
Array of dictionary
pd.Series to dataframe
8x4 dataframe with some random data
Single column list of alphabets
Dictionary to Dataframe
Raw text to dataframe
Dummy data : Generating a very big table

Empty dataframe

1# Starting from empty dataframe 
2df = pd.DataFrame()
3
4names = ['Gandolf','Gimli','Frodo','Legolas','Bilbo']
5types = ['Wizard','Dwarf','Hobbit','Elf','Hobbit']
6magic = [10, 1, 4, 6, 4]
7aggression = [7, 10, 2, 5, 1]
8stealth = [8, 2, 5, 10, 5]
9
10df['names'], df['type'], df['magic_power'] = names, types, magic
11df['aggression'], df['stealth'] = aggression, stealth
12
13Output : 
14+----+---------+--------+---------------+--------------+-----------+
15|    | names   | type   |   magic_power |   aggression |   stealth |
16|----+---------+--------+---------------+--------------+-----------|
17|  0 | Gandolf | Wizard |            10 |            7 |         8 |
18|  1 | Gimli   | Dwarf  |             1 |           10 |         2 |
19|  2 | Frodo   | Hobbit |             4 |            2 |         5 |
20|  3 | Legolas | Elf    |             6 |            5 |        10 |
21|  4 | Bilbo   | Hobbit |             4 |            1 |         5 |
22+----+---------+--------+---------------+--------------+-----------+

Empty Dataframe with columns

1df_mv = pd.DataFrame(columns = ['movie_id', 'name', 'short_description', 'plot', 'rating'])
2
3Output : 
4+------------+--------+---------------------+--------+----------+
5| movie_id   | name   | short_description   | plot   | rating   |
6|------------+--------+---------------------+--------+----------|
7+------------+--------+---------------------+--------+----------+
8
9# When you add index, a null row will be added(meaning all columns will have null values)
10df_mv = pd.DataFrame(columns = ['movie_id', 'name', 'short_description', 'plot', 'rating'], index=[0])
11
12Output : 
13+----+------------+--------+---------------------+--------+----------+
14|    |   movie_id |   name |   short_description |   plot |   rating |
15|----+------------+--------+---------------------+--------+----------|
16|  0 |        nan |    nan |                 nan |    nan |      nan |
17+----+------------+--------+---------------------+--------+----------+
18
19# Adding values 
20df_mv.loc[0] = [1, 'Abyss', 'Higher form of intelligent aliens', 'aliens under the ocean', 9]
21Output : 
22+----+------------+--------+-----------------------------------+------------------------+----------+
23|    |   movie_id | name   | short_description                 | plot                   |   rating |
24|----+------------+--------+-----------------------------------+------------------------+----------|
25|  0 |          1 | Abyss  | Higher form of intelligent aliens | aliens under the ocean |        9 |
26+----+------------+--------+-----------------------------------+------------------------+----------+

Tuple technique

Here index is the first tuple variable

1first   = ('Mike', 'Dorothee', 'Tom', 'Bill', 'Pete', 'Kate')
2last    = ('Meyer', 'Maier', 'Meyer', 'Mayer', 'Meyr', 'Mair')
3job     = ('data analyst', 'programmer', 'computer scientist', 
4           'data scientist', 'programmer', 'psychiatrist')
5language= ('Python', 'Perl', 'Java', 'Pithon', 'Pythen', 'Brainfuck')
6
7data = list(zip(last, job, language)
8Output : 
9[('Meyer', 'data analyst', 'Python'),
10 ('Maier', 'Web Developer', 'NodeJS'),
11 ('Meyer', 'Enterprise Engineer', 'Java'),
12 ('Mayer', 'Mainframe Developer', 'COBOL'),
13 ('Meyr', 'Analyst', 'SAS'),
14 ('Mair', 'psychiatrist', 'Brain Mechainic')]
15
16df = pd.DataFrame(data, columns =['last', 'job', 'language'], index=first) 
17Output : 
18+----------+--------+---------------------+-----------------+
19|          | last   | job                 | language        |
20|----------+--------+---------------------+-----------------|
21| Mike     | Meyer  | data analyst        | Python          |
22| Dorothee | Maier  | Web Developer       | NodeJS          |
23| Tom      | Meyer  | Enterprise Engineer | Java            |
24| Bill     | Mayer  | Mainframe Developer | COBOL           |
25| Pete     | Meyr   | Analyst             | SAS             |
26| Kate     | Mair   | psychiatrist        | Brain Mechainic |
27+----------+--------+---------------------+-----------------+
28
29Note : Instead of a tuple, if it was a list, it would work in the same way.

Dictionary with value lists

1df = pd.DataFrame({'A': [0, 1, 2, 3, 4],
2                   'B': ['foo', 'bar', 'bloo', 'blee', 'bloo'],
3                   'C': ['green', 'red', 'blue', 'yellow', 'green']})
4
5Output : 
6+-----+------+--------+
7|   A | B    | C      |
8|-----+------+--------|
9|   0 | foo  | green  |
10|   1 | bar  | red    |
11|   2 | bloo | blue   |
12|   3 | blee | yellow |
13|   4 | bloo | green  |
14+-----+------+--------+

Array of Arrays / Lists of List

1data = [['tom', 10], ['nick', 15], ['juli', 14]] 
2df = pd.DataFrame(data, columns = ['Name', 'Age']) 
3
4Output : 
5+--------+-------+
6| Name   |   Age |
7|--------+-------|
8| tom    |    10 |
9| nick   |    15 |
10| juli   |    14 |
11+--------+-------+

Array of dictionary

1# Third item in array doesn't have key 'a'
2data = [{'a': 1, 'b': 2, 'c':3}, {'a':10, 'b': 20, 'c': 30}
3        , {'b': 2, 'c':3}] 
4df = pd.DataFrame(data) 
5
6Output : 
7+-----+-----+-----+
8|   a |   b |   c |
9|-----+-----+-----|
10|   1 |   2 |   3 |
11|  10 |  20 |  30 |
12| nan |   2 |   3 |
13+-----+-----+-----+
14
15# 'a1' completely happens to be a new column
16data = [{'a': 1, 'b': 2, 'c':3}, {'a':10, 'b': 20, 'c': 30}
17        , {'b': 2, 'c':3}] 
18df = pd.DataFrame(data, columns=['a', 'a1', 'b', 'c']) 
19
20Output : 
21+-----+------+-----+-----+
22|   a |   a1 |   b |   c |
23|-----+------+-----+-----|
24|   1 |  nan |   2 |   3 |
25|  10 |  nan |  20 |  30 |
26| nan |  nan |   2 |   3 |
27+-----+------+-----+-----+

pd.Series to dataframe

1d = {'one' : pd.Series([10, 20, 30, 40], index =['a', 'b', 'c', 'd']), 
2    'two' : pd.Series([10, 20, 30, 40], index =['d', 'c', 'b', 'a'])} 
3df = pd.DataFrame(d) 
4
5Output : 
6+-------+-------+
7|   one |   two |
8|-------+-------|
9|    10 |    40 |
10|    20 |    30 |
11|    30 |    20 |
12|    40 |    10 |
13+-------+-------+

8x4 dataframe with some random data

1# Generates date 
2dates = pd.date_range('1/1/2000', periods=8)
3
4df = pd.DataFrame(np.random.randn(8, 4), index=dates, columns=['A', 'B', 'C', 'D'])
5
6Output : 
7+---------------------+------------+-----------+-----------+------------+
8|                     |          A |         B |         C |          D |
9|---------------------+------------+-----------+-----------+------------|
10| 2000-01-01 00:00:00 | -1.05128   |  0.238627 |  0.106378 |  1.10883   |
11| 2000-01-02 00:00:00 |  0.314559  |  1.8427   |  0.282821 | -1.47774   |
12| 2000-01-03 00:00:00 | -0.638806  |  0.139111 |  1.08201  |  0.35632   |
13| 2000-01-04 00:00:00 |  1.44253   | -0.113561 |  0.829517 |  0.581576  |
14| 2000-01-05 00:00:00 | -1.40117   | -2.12707  |  0.127889 |  0.721586  |
15| 2000-01-06 00:00:00 |  0.724627  |  0.186909 | -0.56129  | -0.528885  |
16| 2000-01-07 00:00:00 | -0.0975346 | -0.343885 | -0.589784 |  0.0576514 |
17| 2000-01-08 00:00:00 | -1.56196   |  0.343319 | -0.172569 | -0.232157  |
18+---------------------+------------+-----------+-----------+------------+

Single column list of alphabets

1import string
2df = pd.DataFrame({'alpha': list(string.ascii_lowercase)})
3
4Output : 
5+----+---------+
6|    | alpha   |
7|----+---------|
8|  0 | a       |
9|  1 | b       |
10|  2 | c       |
11|  3 | d       |
12|  4 | e       |
13|  5 | f       |
14|  6 | g       |
15|  7 | h       |
16|  8 | i       |
17|  9 | j       |
18| 10 | k       |
19| 11 | l       |
20| 12 | m       |
21| 13 | n       |
22| 14 | o       |
23| 15 | p       |
24| 16 | q       |
25| 17 | r       |
26| 18 | s       |
27| 19 | t       |
28| 20 | u       |
29| 21 | v       |
30| 22 | w       |
31| 23 | x       |
32| 24 | y       |
33| 25 | z       |
34+----+---------+

dictionary to Dataframe

1data = {'row_1': [3, 2, 1, 0], 'row_2': ['a', 'b', 'c', 'd']}
2df = pd.DataFrame.from_dict(data, orient='index',
3                       columns=['A', 'B', 'C', 'D'])
4
5Output : 
6+-------+-----+-----+-----+-----+
7|       | A   | B   | C   | D   |
8|-------+-----+-----+-----+-----|
9| row_2 | a   | b   | c   | d   |
10| row_1 | 3   | 2   | 1   | 0   |
11+-------+-----+-----+-----+-----+

Here we create dataframe from a complex dictionary

1data = {"date":"2018-01-02","data":{"AAPL":{"open":"170.16","close":"172.26","high":"172.30","low":"169.26","volume":"25555934"}
2                                    ,"MSFT":{"open":"86.13","close":"85.95","high":"86.31","low":"85.50","volume":"22483797"}
3                                    }
4        }
5df = pd.DataFrame.from_dict(data)
6
7Output : 
8+------+------------------------------------------------------------------------------------------------+------------+
9|      | data                                                                                           | date       |
10|------+------------------------------------------------------------------------------------------------+------------|
11| AAPL | {'high': '172.30', 'volume': '25555934', 'close': '172.26', 'low': '169.26', 'open': '170.16'} | 2018-01-02 |
12| MSFT | {'high': '86.31', 'volume': '22483797', 'close': '85.95', 'low': '85.50', 'open': '86.13'}     | 2018-01-02 |
13+------+------------------------------------------------------------------------------------------------+------------+
14
15df = pd.concat([df['date'],df['data'].apply(pd.Series)], axis=1)
16Output : 
17+------+------------+---------+--------+--------+--------+----------+
18|      | date       |   close |   high |    low |   open |   volume |
19|------+------------+---------+--------+--------+--------+----------|
20| AAPL | 2018-01-02 |  172.26 | 172.3  | 169.26 | 170.16 | 25555934 |
21| MSFT | 2018-01-02 |   85.95 |  86.31 |  85.5  |  86.13 | 22483797 |
22+------+------------+---------+--------+--------+--------+----------+

Raw text to dataframe

1import sys
2if sys.version_info[0] < 3:
3    from StringIO import StringIO
4else:
5    from io import StringIO
6import pandas as pd
7DF1 = StringIO("""
8Date       Fruit  Num  Color 
92013-11-24 Banana 22.1 Yellow
102013-11-24 Orange  8.6 Orange
112013-11-24 Apple   7.6 Green
122013-11-24 Celery 10.2 Green
13""")
14DF2 = StringIO("""
15Date       Fruit  Num  Color 
162013-11-24 Banana 22.1 Yellow
172013-11-24 Orange  8.6 Orange
182013-11-24 Apple   7.6 Green
192013-11-24 Celery 10.2 Green
202013-11-25 Apple  22.1 Red
212013-11-25 Orange  8.6 Orange
22""")
23df1 = pd.read_csv(DF1, sep='\s+')
24df2 = pd.read_csv(DF2, sep='\s+')
25
26dfs_dictionary = {'DF1':df1,'DF2':df2}
27df=pd.concat(dfs_dictionary)
28
29print(tabulate(df, headers='keys', tablefmt='psql', showindex=True))
30# Output
31+------------+------------+---------+-------+---------+
32|            | Date       | Fruit   |   Num | Color   |
33|------------+------------+---------+-------+---------|
34| ('DF1', 0) | 2013-11-24 | Banana  |  22.1 | Yellow  |
35| ('DF1', 1) | 2013-11-24 | Orange  |   8.6 | Orange  |
36| ('DF1', 2) | 2013-11-24 | Apple   |   7.6 | Green   |
37| ('DF1', 3) | 2013-11-24 | Celery  |  10.2 | Green   |
38| ('DF2', 0) | 2013-11-24 | Banana  |  22.1 | Yellow  |
39| ('DF2', 1) | 2013-11-24 | Orange  |   8.6 | Orange  |
40| ('DF2', 2) | 2013-11-24 | Apple   |   7.6 | Green   |
41| ('DF2', 3) | 2013-11-24 | Celery  |  10.2 | Green   |
42| ('DF2', 4) | 2013-11-25 | Apple   |  22.1 | Red     |
43| ('DF2', 5) | 2013-11-25 | Orange  |   8.6 | Orange  |
44+------------+------------+---------+-------+---------+

Dummy data : Generating a very big table

Not sure how or when this would be useful, but you could do this as well.

1import numpy as np
2
3# Add more zeros, you can hit memory limit
4df = pd.DataFrame(np.random.randn(6000, 5000))
5
6df.shape
7# Output 
8(6000, 5000)
9
10df.info()
11# Output 
12<class 'pandas.core.frame.DataFrame'>
13RangeIndex: 6000 entries, 0 to 5999
14Columns: 5000 entries, 0 to 4999
15dtypes: float64(5000)
16memory usage: 228.9 MB