Class 08 – More Line Fitting | Big Data with Python

We have done some line fitting and we are going to do a lot more! Half the life of a scientist is taking data, plotting it, and looking for a correlation. Plus I want to make sure you can do this in your sleep. What we are adding now is we will start to read data from a file so we can handle much larger datasets. Today we are going to read in an excel file into Pandas which is the python library for line fitting and fit a line. We are going to look at GDP versus life expectancy. A common dataset in sustainable development. This is to the website where I got the the data but I also put the excel file on GitHub. Al\l files are here https://github.com/bmaillou/BigDataPython/tree/master/08-MoreLineFitting

VIDEO

new website for videos.

NEW VIDEO NO QUIZ

Homework

Due Wednesday February 19, 2025

1. You need to find some data from the web or a paper. It could be from a paper you have read. It could be from another class. Find data with at least 10 data points. I would do somewhere from 10-50 points. Do NOT use data with dates. Dates are hard. If you use time for an x-axis it should be hours or something similar but not months/days/years with slashes. We will learn how to do dates later in the semester. All other data should work.

2. Put the data in an excel file. Add your name to the name of the excel file. This way I will be able to differentiate and read all the files.

3. Read in the excel file and fit a best fit line. Remember to add a figure caption and all other parts of a figure. Add the text box showing m,b,r,p values to the graph. In the figure caption say if the result is significant.

4. Make a new sheet in your excel file that is a copy of your data. If you data had a significant correlation (p<0.05) change the data in the second sheet to make it not significant (p>0.05). Then read in the second sheet but remember you need the sheet_name keyword. Show the new correlation. If your data had no significant correlation (p>0.05) change your data to make it significant. Show this new changed plot with a best fit line, figure caption etc. I want you to manually alter your data in a second sheet and use the new data. Add the text box showing m,b,r,p values to the graph. In the figure caption say if the result is significant.

5. Make a third sheet in your excel file. Change your data again but this time take your significant correlation and multiple the y-values by -1. Make sure to take your significant correlation. Now save that sheet as a csv file with your name in the title again so I can read it in when grading. Read it in and show the best fit line, figure caption etc. Add the text box showing m,b,r,p values to the graph. In the figure caption say if the result is significant. Now answer this question. Hoes does the r-value, p-value, and r-squared value compare between the data and the data when you multiplied by -1. What does multiplying by -1 do? Be specific.

6. Summary

one figure of best fit line with your data
one figure of best fit line where you altered your data to make either significant or non-significant
one figure of your significant data altered by multiplying by -1 read in from a csv file.

7. For handing in.

Hand in your notebook
excel file
csv file
The excel file should have three sheets in it.
Make sure your name is in each of the file names.

If you are ever reading a file in your notebook make sure to hand in that file. Make sure that file has your name in the title and works.

DON’T MISS THIS NEXT PART!

8. This part will also be graded for the final 10%. As part of the class you have to comment on three of the lectures by using the comments section of the blog over the course of the semester. You do not have to comment now. You will comment as the semester proceeds. Comments can be wide ranging but need to be designed to improve the teaching or the homework given how you learn. Would you change the notebook? Would you change the homework. Is there a hint I should have given. How can you help me improve your learning and future students learning. Previous comments have already helped me to streamline the packets and homeworks. But now to make it fun you have to write an ipython script that randomly chooses your three classes. You need to randomly choose 2 classes between classes 2 and 16 and 1 class between classes 17 and 23. There are at least two ways to do this; you could use numpy.random.randint or you could use random.randint. my answer using numpy looks like

Brian will comment on classes  [15 12]
Brian will comment on classes  [18]

my answer using randint looks like

Brian will comment on class  15
Brian will comment on class  3
Brian will comment on class  21

Remember since you are using a random number generator everyone’s answers should come out different! If you get two of the same number re-run it to get 3 unique numbers! Plus every time you rerun the cell your numbers will change!

Big Data with Python

Barnard | Department of Environmental Science | Professor Brian J. Mailloux

Category Archives: Class 08 – More Line Fitting

Class 08 – More Line Fitting

VIDEO

Homework