Let's just say that if we are not using lots of equations, we are really not dealing with stats in detail. So the OP in this thread is akin to using an axe, not a scalpel. But we should remember that the original definition of "hacker" is "one who builds furniture with an axe". So in the right hands, hands with lots of skill, an axe can do some pretty fine things.
If anyone is actually thinking of using linear regression for non-linear data, or time series, or anything like that, they should first do a search on
problems using linear regression for TOPIC. This will easily pull up relevant mathematical treatments, problems with using linear regression for the topic, how to fix some of those problems, and how some of those problem fixes create other problems, and perhaps even how some problems cannot be fixed. Anyone with some math skills can do this, and copy and paste the reasoning. The statistical reasoning. For the general case.
However, you will note that I specified a very specific case. A case that relies on methodological control. And if you do the search above, you are very unlikely to find anything that addresses methods as they affect testing a hypothesis. If you know math very well, but know little about research in the social sciences, you are unlikely to think of how methods and statistics interact in the testing of hypotheses in the social sciences.
Further, if you are unaware that explaining 30-40% of the variance of a dependent variable is likely to be considered a STRONG test of the theory being tested, and has a good chance of extending the level of prediction in the field (depending on area), then you might not have appropriate standards in place to judge the quality of such a model as it speaks to the field of knowledge the researcher inhabits.
In non-technical terms:
Theory: what we think is the case.
Methods: how we gather data and set it up to test the case
Statistics: how we decide what we can say, and to what degree of accuracy, about the case.
Problems with statistical estimation are methodological problems to the extent that they interfere with our ability to test the case.
Now, look at L and E again. Note that I specified a method where if we only use data at three time points, both L and E predict the same results. The lines intersect at those three points. If I use 100 cases of "going viral", I get 100 data points at A B and C. And that PARTICULAR
data is expected, by both L and E, to have a strong linear relationship.
However, methodologically, by doing that I have stripped out information from all other time points in the process, the process that is assumed in E. And I can't do ANYTHING to speak to the rate of change of the slope of the curve over time. That information is no longer present in the data. So if I do use non-linear techniques, which are designed to capture such information, I will find NOTHING relating to that information.
There is nothing to find in that limited data.But what can I say? Can I speak to how well L fits E in the general case? No. But can I say that if L produces poor fit, E is not very likely to be the case? Yes.
IOW, I can do a test of the theory, and reject a null hypothesis. Failure of L gives me good reason to believe that E would fail as well.
So, why does so much social science research rely on linear regression? Part of the reason is that it is taught in the first stats class in many fields (often with an econometrics textbook), and some at weaker programs, qualitative research programs, and MA/MS programs take no more stats than that. So they know linear regression, and can follow it. They have one tool, and use it for whatever they can. With higher or lower attention to the issues in doing so. But they can do a search, and figure things out from an existing base of knowledge.
But, here's the thing. Most every single stats topic AFTER that starts the same way. Book, monograph, paper, class lectures... the same way. From my classes on categorical data analysis to time series analysis, the first lecture started:
"If you use linear regression for this, this is how it goes wrong. If you try to correct, here is what happens. So to avoid all that, we use TECHNIQUE for this type of analysis."Those at top schools get this, as they tend to require multiple stats classes and not just one. If you look at the higher level journals, you will see that people get this.
And point of fact, I get it too. I don't try to use linear regression for everything, or even "when I can". I let what I need to test, and the nature of the data I have, determine the technique I use.
For example, social scientists are often called on to analyze data that has already been collected. In this case, we are limited in how we can exert methodological control. This is common in evaluation research.
I was recently contacted by a health researcher, who asked me how to do some case selection on data on a health intervention. That data was mostly biomedical measures like blood pressure. I suggested she construct some dummy variables (0,1) and then use those to sort cases. NOT use them to do analysis, but just sort cases. When she did what I said, she ended up with zero cases in her models. The reason is that she did not set the dummy variables = 0 at the start, so she ended up with some 1s and missing data, but no 0s. That was just an error in practical sequencing of setting up the dummies, which I only saw when she sent me the actual dataset.
After further conversation, it became apparent that she wanted to use linear regression to predict how long people stayed in the intervention program. A good assessment goal. But there are problems. Blood pressure over time is highly intercorrelated, which is bad for linear regression. Could she use change scores instead of BP numbers? Maybe, I would expect that changes in BP (+10, -10) are less correlated over time than BP (170, 160).
But I did not even suggest that. I suggested logistic regression.
https://www3.nd.edu/~rwilliam/stats2/l81.pdf By using this, a dependent variable of "still in the program" can be constructed. Those who stayed from time 1 to time 2 are coded 1, those who drop are coded 0. We can then determine the likelihood of staying in the program (compared to dropping out) due to other factors at time 1 included as independent variables in the model.
In this case the health researcher had actually used logistic regression before, but for whatever reason did not think about how she might do it in this situation. Given her problems coding dummy variables, I assume part of the reason was a lack of experience in the methods of research in general and working stats programs in specific, not her statistical knowledge in the mathematical sense. If she had designed the evaluation from the start, she might have used logistic regression from the start. But facing variables that were not dichotomous in her data, and a general tendency to AVOID destroying information in the data during analysis by going to a lower level of measurement, she got locked in.
Sometimes less is more.