Start Date: 09/15/2019
Course Type: Common Course |
Course Link: https://www.coursera.org/learn/exploratory-data-analysis
Explore 1600+ online courses from top universities. Join Coursera today to learn data science, programming, business strategy, and more.Welcome to Week 2 of Exploratory Data Analysis. This week covers some of the more advanced graphing systems available in R: the Lattice system and the ggplot2 system. While the base graphics system provides many important tools for visualizing data, it was part of the original R system and lacks many features that may be desirable in a plotting system, particularly when visualizing high dimensional data. The Lattice and ggplot2 systems also simplify the laying out of plots making it a much less tedious process.
This course covers the essential exploratory techniques for summarizing data. These techniques are t
Article | Example |
---|---|
Exploratory data analysis | John W. Tukey wrote the book "Exploratory Data Analysis" in 1977. Tukey held that too much emphasis in statistics was placed on statistical hypothesis testing (confirmatory data analysis); more emphasis needed to be placed on using data to suggest hypotheses to test. In particular, he held that confusing the two types of analyses and employing them on the same set of data can lead to systematic bias owing to the issues inherent in testing hypotheses suggested by the data. |
Exploratory data analysis | In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task. Exploratory data analysis was promoted by John Tukey to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments. EDA is different from initial data analysis (IDA), which focuses more narrowly on checking assumptions required for model fitting and hypothesis testing, and handling missing values and making transformations of variables as needed. EDA encompasses IDA. |
Univariate analysis | Descriptive statistics describe a sample or population. They can be part of exploratory data analysis. |
Exploratory data analysis | Exploratory data analysis, robust statistics, nonparametric statistics, and the development of statistical programming languages facilitated statisticians' work on scientific and engineering problems. Such problems included the fabrication of semiconductors and the understanding of communications networks, which concerned Bell Labs. These statistical developments, all championed by Tukey, were designed to complement the analytic theory of testing statistical hypotheses, particularly the Laplacian tradition's emphasis on exponential families. |
Exploratory data analysis | Tukey defined data analysis in 1961 as: "[P]rocedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data." |
Data Desk | Data Desk is a software program for visual data analysis, visual data exploration, and statistics. It carries out Exploratory Data Analysis (EDA) and standard statistical analyses by means of dynamically linked graphic data displays that update any change simultaneously. |
Statistical hypothesis testing | Confirmatory data analysis can be contrasted with exploratory data analysis, which may not have pre-specified hypotheses. |
Exploratory data analysis | Findings from EDA are often orthogonal to the primary analysis task. To illustrate, consider an example from Cook et al where the analysis task is to find the variables which best predict the tip that a dining party will give to the waiter. The variables available in the data collected for this task are: the tip amount, total bill, payer gender, smoking/non-smoking section, time of day, day of the week, and size of the party. The primary analysis task is approached by fitting a regression model where the tip rate as the response variable. The fitted model is |
Association rule learning | GUHA is a general method for exploratory data analysis that has theoretical foundations in observational calculi. |
Exploratory data analysis | Many EDA techniques have been adopted into data mining, as well as into big data analytics. They are also being taught to young students as a way to introduce them to statistical thinking. |
Exploratory data analysis | However, exploring the data reveals other interesting features not described by this model. |
Exploratory search | Consequently, exploratory search covers a broader class of activities than typical information retrieval, such as investigating, evaluating, comparing, and synthesizing, where new information is sought in a defined conceptual area; exploratory data analysis is another example of an information exploration activity. Typically, therefore, such users generally combine querying and browsing strategies to foster learning and investigation. |
Data analysis | Exploratory data analysis should be interpreted carefully. When testing multiple models at once there is a high chance on finding at least one of them to be significant, but this can be due to a type 1 error. It is important to always adjust the significance level when testing multiple models with, for example, a Bonferroni correction. Also, one should not follow up an exploratory analysis with a confirmatory analysis in the same dataset. An exploratory analysis is used to find ideas for a theory, but not to test that theory as well. When a model is found exploratory in a dataset, then following up that analysis with a confirmatory analysis in the same dataset could simply mean that the results of the confirmatory analysis are due to the same type 1 error that resulted in the exploratory model in the first place. The confirmatory analysis therefore will not be more informative than the original exploratory analysis. |
Exploratory data analysis | What is learned from the plots is different from what is illustrated by the regression model, even though the experiment was not designed to investigate any of these other trends. The patterns found by exploring the data suggest hypotheses about tipping that may not have been anticipated in advance, and which could lead to interesting follow-up experiments where the hypotheses are formally stated and tested by collecting new data. |
Exploratory data analysis | Tukey's championing of EDA encouraged the development of statistical computing packages, especially "S" at Bell Labs. The "S" programming language inspired the systems 'S'-PLUS and "R". This family of statistical-computing environments featured vastly improved dynamic visualization capabilities, which allowed statisticians to identify outliers, trends and patterns in data that merited further study. |
Order statistic | A similar important statistic in exploratory data analysis that is simply related to the order statistics is the sample interquartile range. |
Statistica | Statistica includes analytic and exploratory graphs in addition to standard 2- and 3-dimensional graphs. Brushing actions (interactive labeling, marking, and data exclusion) allow for investigation of outliers and exploratory data analysis. |
Exploratory data analysis | Tukey's EDA was related to two other developments in statistical theory: robust statistics and nonparametric statistics, both of which tried to reduce the sensitivity of statistical inferences to errors in formulating statistical models. Tukey promoted the use of five number summary of numerical data—the two extremes (maximum and minimum), the median, and the quartiles—because these median and quartiles, being functions of the empirical distribution are defined for all distributions, unlike the mean and standard deviation; moreover, the quartiles and median are more robust to skewed or heavy-tailed distributions than traditional summaries (the mean and standard deviation). The packages "S", "S"-PLUS, and "R" included routines using resampling statistics, such as Quenouille and Tukey's jackknife and Efron bootstrap, which are nonparametric and robust (for many problems). |
Data analysis | Once the data is cleaned, it can be analyzed. Analysts may apply a variety of techniques referred to as exploratory data analysis to begin understanding the messages contained in the data. The process of exploration may result in additional data cleaning or additional requests for data, so these activities may be iterative in nature. Descriptive statistics such as the average or median may be generated to help understand the data. Data visualization may also be used to examine the data in graphical format, to obtain additional insight regarding the messages within the data. |
Data analysis | Data mining is a particular data analysis technique that focuses on modeling and knowledge discovery for predictive rather than purely descriptive purposes, while business intelligence covers data analysis that relies heavily on aggregation, focusing on business information. In statistical applications data analysis can be divided into descriptive statistics, exploratory data analysis (EDA), and confirmatory data analysis (CDA). EDA focuses on discovering new features in the data and CDA on confirming or falsifying existing hypotheses. Predictive analytics focuses on application of statistical models for predictive forecasting or classification, while text analytics applies statistical, linguistic, and structural techniques to extract and classify information from textual sources, a species of unstructured data. All are varieties of data analysis. |