We have done some line fitting and will do more (pdf=more-line-fit-2024). Half the life of a scientist is taking data, plotting it, and looking for a correlation. Plus I want to make sure you can do this in your sleep. What we are adding now is we will start to read data from a file so we can handle much larger datasets. Today we are going to read in an excel file into Pandas which is the python library for line fitting and fit a line. We are going to look at GDP versus life expectancy. A common dataset in sustainable development. This is to the website where I got the the data but I also put the excel file on GitHub.
VIDEO
new website for videos.
NEW VIDEO NO QUIZ
Homework
Due Wednesday February 19, 2024 💙
1. You need to find some data from the web or a paper. It could be from a paper you have read. It could be from another class. Find data with at least 10 data points. I would do somewhere from 10-50 points. Do NOT use data with dates. Dates are hard. If you use time for an x-axis it should be hours or something similar but not months/days/years with slashes. We will learn how to do dates later in the semester. All other data should work.
2. Put the data in an excel file. Add your name to the name of the excel file. This way I will be able to differentiate and read all the files.
3. Read in the excel file and fit a best fit line. Remember to add a figure caption and all other parts of a figure. Add the text box showing m,b,r,p values to the graph. In the figure caption say if the result is significant.
4. Make a new sheet in your excel file that is a copy of your data. If you data had a significant correlation (p<0.05) change the data in the second sheet to make it not significant (p>0.05). Then read in the second sheet but remember you need the sheet_name keyword. Show the new correlation. If your data had no significant correlation (p>0.05) change your data to make it significant. Show this new changed plot with a best fit line, figure caption etc. I want you to manually alter your data in a second sheet and use the new data. Add the text box showing m,b,r,p values to the graph. In the figure caption say if the result is significant.
5. Make a third sheet in your excel file. Change your data again but this time take your significant correlation and multiple the y-values by -1. Now save that sheet as a csv file with your name in the title again so I can read it in when grading. Read it in and show the best fit line, figure caption etc. Add the text box showing m,b,r,p values to the graph. In the figure caption say if the result is significant. Now answer this question. Hoes does the r-value, p-value, and r-squared value compare between when you the original data and the data when you multiplied by -1. Be specific.
6. Summary
- one figure of best fit line with your data
- one figure of best fit line where you altered your data to make either significant or non-significant
- one figure of data altered by multiplying by -1 read in from a csv file.
7. For handing in. Hand in your notebook, excel file, and csv file. The excel file should have three sheets in it.
DON’T MISS THIS NEXT PART!
8. This part will also be graded for the final 10%. As part of the class you have to comment on three of the lectures by using the comments section of the blog over the course of the semester. You do not have to comment now. You will comment as the semester proceeds. Comments can be wide ranging but need to be designed to improve the teaching or the homework given how you learn. Would you change the notebook? Would you change the homework. Is there a hint I should have given. How can you help me improve your learning and future students learning. Previous comments have already helped me to streamline the packets and homeworks. But now to make it fun you have to write an ipython script that randomly chooses your three classes. You need to randomly choose 2 classes between classes 2 and 16 and 1 class between classes 17 and 23. There are at least two ways to do this; you could use numpy.random.randint or you could use random.randint. my answer using numpy looks like
my answer using randint looks like