Pandas dataframe is a 2D data structure which looks like a table. There are different ways to create a dataframe, mostly everybody sticks to few formats they like and stick to it, thats good. But, sometimes when you have a problem and you want somebody's opinion or get a suggestion or solution, you cannot give all your code and ask them to execute and start explaining the problem. Its a lot easier if you could generate a small sample data and explain it and from suggestion/solution you could copy the concept to your main program.
Whenever you ask a question in StackOverflow or any forum and when preparing to ask the question itself, you would know that it would be a lot easier if you provided a sample data and say this the problem you are facing and this what i have tried, so far ? In python pandas world, you need to build small dataframes to build those examples.
Here are few ways of creating new dataframes and generating data.
1# Starting from empty dataframe 2df = pd.DataFrame()3
4names = ['Gandolf','Gimli','Frodo','Legolas','Bilbo']5types = ['Wizard','Dwarf','Hobbit','Elf','Hobbit']6magic = [10, 1, 4, 6, 4]7aggression = [7, 10, 2, 5, 1]8stealth = [8, 2, 5, 10, 5]9
10df['names'], df['type'], df['magic_power'] = names, types, magic11df['aggression'], df['stealth'] = aggression, stealth12
13Output : 14+----+---------+--------+---------------+--------------+-----------+15| | names | type | magic_power | aggression | stealth |16|----+---------+--------+---------------+--------------+-----------|17| 0 | Gandolf | Wizard | 10 | 7 | 8 |18| 1 | Gimli | Dwarf | 1 | 10 | 2 |19| 2 | Frodo | Hobbit | 4 | 2 | 5 |20| 3 | Legolas | Elf | 6 | 5 | 10 |21| 4 | Bilbo | Hobbit | 4 | 1 | 5 |22+----+---------+--------+---------------+--------------+-----------+
Empty Dataframe with columns
1df_mv = pd.DataFrame(columns = ['movie_id', 'name', 'short_description', 'plot', 'rating'])2
3Output : 4+------------+--------+---------------------+--------+----------+5| movie_id | name | short_description | plot | rating |6|------------+--------+---------------------+--------+----------|7+------------+--------+---------------------+--------+----------+8
9# When you add index, a null row will be added(meaning all columns will have null values)10df_mv = pd.DataFrame(columns = ['movie_id', 'name', 'short_description', 'plot', 'rating'], index=[0])11
12Output : 13+----+------------+--------+---------------------+--------+----------+14| | movie_id | name | short_description | plot | rating |15|----+------------+--------+---------------------+--------+----------|16| 0 | nan | nan | nan | nan | nan |17+----+------------+--------+---------------------+--------+----------+18
19# Adding values 20df_mv.loc[0] = [1, 'Abyss', 'Higher form of intelligent aliens', 'aliens under the ocean', 9]21Output : 22+----+------------+--------+-----------------------------------+------------------------+----------+23| | movie_id | name | short_description | plot | rating |24|----+------------+--------+-----------------------------------+------------------------+----------|25| 0 | 1 | Abyss | Higher form of intelligent aliens | aliens under the ocean | 9 |26+----+------------+--------+-----------------------------------+------------------------+----------+
Here index is the first
tuple variable
1first = ('Mike', 'Dorothee', 'Tom', 'Bill', 'Pete', 'Kate')2last = ('Meyer', 'Maier', 'Meyer', 'Mayer', 'Meyr', 'Mair')3job = ('data analyst', 'programmer', 'computer scientist', 4 'data scientist', 'programmer', 'psychiatrist')5language= ('Python', 'Perl', 'Java', 'Pithon', 'Pythen', 'Brainfuck')6
7data = list(zip(last, job, language)8Output : 9[('Meyer', 'data analyst', 'Python'),10 ('Maier', 'Web Developer', 'NodeJS'),11 ('Meyer', 'Enterprise Engineer', 'Java'),12 ('Mayer', 'Mainframe Developer', 'COBOL'),13 ('Meyr', 'Analyst', 'SAS'),14 ('Mair', 'psychiatrist', 'Brain Mechainic')]15
16df = pd.DataFrame(data, columns =['last', 'job', 'language'], index=first) 17Output : 18+----------+--------+---------------------+-----------------+19| | last | job | language |20|----------+--------+---------------------+-----------------|21| Mike | Meyer | data analyst | Python |22| Dorothee | Maier | Web Developer | NodeJS |23| Tom | Meyer | Enterprise Engineer | Java |24| Bill | Mayer | Mainframe Developer | COBOL |25| Pete | Meyr | Analyst | SAS |26| Kate | Mair | psychiatrist | Brain Mechainic |27+----------+--------+---------------------+-----------------+28
29Note : Instead of a tuple, if it was a list, it would work in the same way.
1df = pd.DataFrame({'A': [0, 1, 2, 3, 4],2 'B': ['foo', 'bar', 'bloo', 'blee', 'bloo'],3 'C': ['green', 'red', 'blue', 'yellow', 'green']})4
5Output : 6+-----+------+--------+7| A | B | C |8|-----+------+--------|9| 0 | foo | green |10| 1 | bar | red |11| 2 | bloo | blue |12| 3 | blee | yellow |13| 4 | bloo | green |14+-----+------+--------+
1data = [['tom', 10], ['nick', 15], ['juli', 14]] 2df = pd.DataFrame(data, columns = ['Name', 'Age']) 3
4Output : 5+--------+-------+6| Name | Age |7|--------+-------|8| tom | 10 |9| nick | 15 |10| juli | 14 |11+--------+-------+
1# Third item in array doesn't have key 'a'2data = [{'a': 1, 'b': 2, 'c':3}, {'a':10, 'b': 20, 'c': 30}3 , {'b': 2, 'c':3}] 4df = pd.DataFrame(data) 5
6Output : 7+-----+-----+-----+8| a | b | c |9|-----+-----+-----|10| 1 | 2 | 3 |11| 10 | 20 | 30 |12| nan | 2 | 3 |13+-----+-----+-----+14
15# 'a1' completely happens to be a new column16data = [{'a': 1, 'b': 2, 'c':3}, {'a':10, 'b': 20, 'c': 30}17 , {'b': 2, 'c':3}] 18df = pd.DataFrame(data, columns=['a', 'a1', 'b', 'c']) 19
20Output : 21+-----+------+-----+-----+22| a | a1 | b | c |23|-----+------+-----+-----|24| 1 | nan | 2 | 3 |25| 10 | nan | 20 | 30 |26| nan | nan | 2 | 3 |27+-----+------+-----+-----+
1d = {'one' : pd.Series([10, 20, 30, 40], index =['a', 'b', 'c', 'd']), 2 'two' : pd.Series([10, 20, 30, 40], index =['d', 'c', 'b', 'a'])} 3df = pd.DataFrame(d) 4
5Output : 6+-------+-------+7| one | two |8|-------+-------|9| 10 | 40 |10| 20 | 30 |11| 30 | 20 |12| 40 | 10 |13+-------+-------+
1# Generates date 2dates = pd.date_range('1/1/2000', periods=8)3
4df = pd.DataFrame(np.random.randn(8, 4), index=dates, columns=['A', 'B', 'C', 'D'])5
6Output : 7+---------------------+------------+-----------+-----------+------------+8| | A | B | C | D |9|---------------------+------------+-----------+-----------+------------|10| 2000-01-01 00:00:00 | -1.05128 | 0.238627 | 0.106378 | 1.10883 |11| 2000-01-02 00:00:00 | 0.314559 | 1.8427 | 0.282821 | -1.47774 |12| 2000-01-03 00:00:00 | -0.638806 | 0.139111 | 1.08201 | 0.35632 |13| 2000-01-04 00:00:00 | 1.44253 | -0.113561 | 0.829517 | 0.581576 |14| 2000-01-05 00:00:00 | -1.40117 | -2.12707 | 0.127889 | 0.721586 |15| 2000-01-06 00:00:00 | 0.724627 | 0.186909 | -0.56129 | -0.528885 |16| 2000-01-07 00:00:00 | -0.0975346 | -0.343885 | -0.589784 | 0.0576514 |17| 2000-01-08 00:00:00 | -1.56196 | 0.343319 | -0.172569 | -0.232157 |18+---------------------+------------+-----------+-----------+------------+
1import string2df = pd.DataFrame({'alpha': list(string.ascii_lowercase)})3
4Output : 5+----+---------+6| | alpha |7|----+---------|8| 0 | a |9| 1 | b |10| 2 | c |11| 3 | d |12| 4 | e |13| 5 | f |14| 6 | g |15| 7 | h |16| 8 | i |17| 9 | j |18| 10 | k |19| 11 | l |20| 12 | m |21| 13 | n |22| 14 | o |23| 15 | p |24| 16 | q |25| 17 | r |26| 18 | s |27| 19 | t |28| 20 | u |29| 21 | v |30| 22 | w |31| 23 | x |32| 24 | y |33| 25 | z |34+----+---------+
1data = {'row_1': [3, 2, 1, 0], 'row_2': ['a', 'b', 'c', 'd']}2df = pd.DataFrame.from_dict(data, orient='index',3 columns=['A', 'B', 'C', 'D'])4
5Output : 6+-------+-----+-----+-----+-----+7| | A | B | C | D |8|-------+-----+-----+-----+-----|9| row_2 | a | b | c | d |10| row_1 | 3 | 2 | 1 | 0 |11+-------+-----+-----+-----+-----+
Here we create dataframe from a complex dictionary
1data = {"date":"2018-01-02","data":{"AAPL":{"open":"170.16","close":"172.26","high":"172.30","low":"169.26","volume":"25555934"}2 ,"MSFT":{"open":"86.13","close":"85.95","high":"86.31","low":"85.50","volume":"22483797"}3 }4 }5df = pd.DataFrame.from_dict(data)6
7Output : 8+------+------------------------------------------------------------------------------------------------+------------+9| | data | date |10|------+------------------------------------------------------------------------------------------------+------------|11| AAPL | {'high': '172.30', 'volume': '25555934', 'close': '172.26', 'low': '169.26', 'open': '170.16'} | 2018-01-02 |12| MSFT | {'high': '86.31', 'volume': '22483797', 'close': '85.95', 'low': '85.50', 'open': '86.13'} | 2018-01-02 |13+------+------------------------------------------------------------------------------------------------+------------+14
15df = pd.concat([df['date'],df['data'].apply(pd.Series)], axis=1)16Output : 17+------+------------+---------+--------+--------+--------+----------+18| | date | close | high | low | open | volume |19|------+------------+---------+--------+--------+--------+----------|20| AAPL | 2018-01-02 | 172.26 | 172.3 | 169.26 | 170.16 | 25555934 |21| MSFT | 2018-01-02 | 85.95 | 86.31 | 85.5 | 86.13 | 22483797 |22+------+------------+---------+--------+--------+--------+----------+
1import sys2if sys.version_info[0] < 3:3 from StringIO import StringIO4else:5 from io import StringIO6import pandas as pd7DF1 = StringIO("""8Date Fruit Num Color 92013-11-24 Banana 22.1 Yellow102013-11-24 Orange 8.6 Orange112013-11-24 Apple 7.6 Green122013-11-24 Celery 10.2 Green13""")14DF2 = StringIO("""15Date Fruit Num Color 162013-11-24 Banana 22.1 Yellow172013-11-24 Orange 8.6 Orange182013-11-24 Apple 7.6 Green192013-11-24 Celery 10.2 Green202013-11-25 Apple 22.1 Red212013-11-25 Orange 8.6 Orange22""")23df1 = pd.read_csv(DF1, sep='\s+')24df2 = pd.read_csv(DF2, sep='\s+')25
26dfs_dictionary = {'DF1':df1,'DF2':df2}27df=pd.concat(dfs_dictionary)28
29print(tabulate(df, headers='keys', tablefmt='psql', showindex=True))30# Output31+------------+------------+---------+-------+---------+32| | Date | Fruit | Num | Color |33|------------+------------+---------+-------+---------|34| ('DF1', 0) | 2013-11-24 | Banana | 22.1 | Yellow |35| ('DF1', 1) | 2013-11-24 | Orange | 8.6 | Orange |36| ('DF1', 2) | 2013-11-24 | Apple | 7.6 | Green |37| ('DF1', 3) | 2013-11-24 | Celery | 10.2 | Green |38| ('DF2', 0) | 2013-11-24 | Banana | 22.1 | Yellow |39| ('DF2', 1) | 2013-11-24 | Orange | 8.6 | Orange |40| ('DF2', 2) | 2013-11-24 | Apple | 7.6 | Green |41| ('DF2', 3) | 2013-11-24 | Celery | 10.2 | Green |42| ('DF2', 4) | 2013-11-25 | Apple | 22.1 | Red |43| ('DF2', 5) | 2013-11-25 | Orange | 8.6 | Orange |44+------------+------------+---------+-------+---------+
Not sure how or when this would be useful, but you could do this as well.
1import numpy as np2
3# Add more zeros, you can hit memory limit4df = pd.DataFrame(np.random.randn(6000, 5000))5
6df.shape7# Output 8(6000, 5000)9
10df.info()11# Output 12<class 'pandas.core.frame.DataFrame'>13RangeIndex: 6000 entries, 0 to 599914Columns: 5000 entries, 0 to 499915dtypes: float64(5000)16memory usage: 228.9 MB