Class 08 – More Line Fitting

 

We have done some line fitting and  will do more (pdf=more-line-fit-2024).  Half the life of a scientist is taking data, plotting it, and looking for a correlation.  Plus I want to make sure you can do this in your sleep.  What we are adding now is we will start to read data from a file so we can handle much larger datasets.   Today we are going to read in an excel file into Pandas which is the python library for line fitting and fit a line.  We are going to look at GDP versus life expectancy.  A common dataset in sustainable development.  This is to the website where I got the  the data but I also put the excel file on GitHub.

 VIDEO

new website for videos.

NEW VIDEO NO QUIZ

Homework

Due Wednesday February 19, 2024 💙

1.  You need to find some data from the web or a paper.  It could be from a paper you have read.  It could be from another class.  Find data with at least 10 data points.  I would do somewhere from 10-50 points.  Do NOT use data with dates.  Dates are hard.  If you use time for an x-axis it should be hours or something similar but not months/days/years with slashes. We will learn how to do dates later in the semester.  All other data should work.

2.  Put the data in an excel file.  Add your name to the name of the excel file.  This way I will be able to differentiate and read all the files.

3.  Read in the excel file and fit a best fit line.  Remember to add a figure caption and all other parts of a figure.  Add the text box showing m,b,r,p values to the graph.  In the figure caption say if the result is significant.

4.  Make a new sheet in your excel file that is a copy of your data.   If you data had a significant correlation (p<0.05) change the data in the second sheet to make it not significant (p>0.05).    Then read in the second sheet but remember you need the sheet_name keyword.  Show the new correlation.  If your data had no significant correlation (p>0.05) change your data to make it significant.   Show this new changed plot with a best fit line, figure caption etc.   I want you to manually alter your data in a second sheet and use the new data.  Add the text box showing m,b,r,p values to the graph.  In the figure caption say if the result is significant.

5.  Make a third sheet in your excel file.  Change your data again but this time take your significant correlation and multiple the y-values by -1.   Now save that sheet as a csv file with your name in the title again so I can read it in when grading.   Read it in and show the best fit line, figure caption etc. Add the text box showing m,b,r,p values to the graph.  In the figure caption say if the result is significant. Now answer this question.  Hoes does the r-value, p-value, and r-squared value compare between when you the original data and the data when you multiplied by -1.  Be specific.

6.  Summary

  • one figure of best fit line with your data
  • one figure of best fit line where you altered your data to make either significant or non-significant
  • one figure of data altered by multiplying by -1 read in from a csv file.

7.  For handing in.  Hand in your notebook, excel file, and csv file.   The excel file should have three sheets in it.

DON’T MISS THIS NEXT PART!

8.  This part will also be graded for the final 10%.  As part of the class you have to comment on three of the lectures by using the comments section of the blog over the course of the semester.  You do not have to comment now.  You will comment as the semester proceeds.  Comments can be wide ranging but need to be designed to improve the teaching or the homework given how you learn.  Would you change the notebook?  Would you change the homework.  Is there a hint I should have given.  How can you help me improve your learning and future students learning.  Previous comments have already helped me to streamline the packets and homeworks.   But now to make it  fun you have to write an ipython script that randomly chooses your three classes.  You need to randomly choose 2 classes between classes 2 and 16 and 1 class between classes 17 and 23.  There are at least two ways to do this;  you could use numpy.random.randint or you could use random.randint.  my answer using numpy looks like

Brian will comment on classes  [15 12]
Brian will comment on classes  [18]

my answer using randint looks like

Brian will comment on class  15
Brian will comment on class  3
Brian will comment on class  21
 Remember since you are using a random number generator everyone’s answers should come out different!  If you get two of the same number re-run it to get 3 unique numbers!  Plus every time you rerun the cell your numbers will change!

15 thoughts on “Class 08 – More Line Fitting

  1. I liked this class because it taught us how to do more statistical analysis and because it allowed us to work on our own data sets. I think the beginning of the notebook where we learn the different ways to read in data was crucial as the class continued and we worked on more complex data sets. I got confused when we got to the log section in the notebook. There are three sequential questions that built on each other. I think it could be beneficial to have a few hints for the back to back log plots. I think there should be answers for each of those parts. There is currently only a complete loglog answer, but if you get stuck, looking at the full answer will take away the opportunity to struggle through the rest of the parts of the question.

    I think the homework was straight forward and helped me think about what kind of data I currently have the skills to analyze.

  2. I really liked this class and thought that being able to choose which dataset we used really helped motivate me. Additionally, I found that code academy was good to use in terms of explaining a lot of the concepts we were using and I really preferred it to code.org.

  3. I really liked this class and this homework in particular because I was able to use my own data set. While the environmental science data sets were interesting, as an Economics major it was really interesting for me to use Python to visualize economics data that I had been using in my other courses. While I know this class is officially part of the Environmental Science department, I think students may enjoy working with other types of data sets as well more often throughout the course.

  4. This was easily one of the best, most pragmatic class as we saw the real life application of using python. It was amazing to see how quickly and efficiently make meaningful plots from huge data sets. I personally used the GDP data and was amazed at how I could plot multiple correlations and scatters to get a general sense of the trends in the data. The random number generator part is really interesting as well and helped brought out the computer science mystery to the class.

    I think some ways to potentially improve this class would be to go over multiple types of graphs we could make beyond the simple scatter and make curves if data set demands it.

  5. Importing data into a notebook is a good skill because always, at the first step of every analysis, a person needs to pull in data into iPython Notebook. It was nice learning how to import data and then create a line using that data set, to show that it’s very possible to get started quickly. As far as Code Academy goes, always helpful but I found myself learning more from the in class assignments than working through the online modules.

  6. It’s kind of funny that i was randomized to comment on the lesson that was concerned with generating the random numbers in the first place. Also, when I first completed the homework for this lesson/class (the plot some random data from some study/experience), I didn’t notice the random number generator part so maybe just clarifying that we turn that in with the other HW notebook mightve helped. Also, I thought the original HW assignment was pretty awesome and gave kids the opportunity to think about their final project early and what kinds of data graphs well (at this point we were only looking at pretty basic graphs). This definitely made me feel like I had some power in python and that I might actually be able to use it to make pretty professional looking graphs. That being said, I do remember this being a sort of rushed process and so I settled with a dataset that wasn’t that interesting to plot or analyze. Maybe drawing this assignment out over a weekend would be good (instead of the Monday to Wednesday turn around).

    Also, relabeling the titles of the classes and adding concrete dates for when each HW is due would help with the organization of the site and help from the student’s perspective (it was sometime tough to tell what was due and when)

    elias

  7. I think this lesson was effective in demonstrating how to use linregress to make a best fit. I was confused about taking the log of the data, and am not sure how you would know when to do that. Also, I don’t think I had time for the bonus section, and I think the class would benefit from more practice with poly1d and polyfit

  8. I like how you showed a few different ways to plot the same line.

    It would have been helpful to have learned how to drop NA values during this lesson because since the assignment was so open ended, my dataset had NA values (and many do) and I wasn’t sure how to work around them

  9. I thought this lesson was really critical for a few reasons. The first was being able to manipulate the general shape of a dataset; learning to apply a log-linear fit on the data really deepened my understanding of what a relationship between variables really means. Another reason I liked this unit was going out and exploring real data on our own. This boosted my confidence in finding and approaching unfamiliar datasets.

    I think the codecademy was helpful as well, though I would have preferred some of these lessons closer to the start of class. After finishing the code.org, we had already learned a significant amount of the codecademy material in class. I’m really just fishing for constructive criticism here (as there wasn’t any issue learning the material) but I talked to a few students and they agreed that pushing the codecademy earlier to line up with class material might improve the course flow.

  10. I liked the homework assignment for this lesson because we had some freedom to sift around for our own dataset to use. Another great first intro to how powerful python is for data analysis!

    I think for practice, it may be good to have us fit a couple different order polynomials to our dataset (and then discuss which fits best). For example, with my dataset I just fit with a straight line, but then down the line when I wanted to use the parabola-fit I didn’t remember how to do it right away (I had to go back to the lesson from class, which I have to do a lot of the time anyways), so an extra chance to practice with polyfit etc could be useful!

    –Jennifer Olson

  11. This lesson was really helpful in a lot of the more recent projects. It was really cool learning how to make a best fit line of a second order polynomial, but I think it would also be really cool to learn to make different trendlines, like a power function.

    One thing that I struggled with at the time was remembering to import stats so that I could include the statistics in a text box for the last graph.

  12. It was really cool to plot non-linear data and parabolic trendlines!! Fit orders and axis labels have become very important in the more recent lectures.
    One area where I struggled during this lecture was actually with reading in the data – I think it would be helpful to clarify at the beginning of the lesson that in order to read in the data, it should be saved in the same folder in which we are running anacondas.

    • Good idea! I updated the notebook already to remind people about working directory’s, paths, and where you save things. I agree that this regularly messes me up.

Leave a Reply