Data visualization, in the simplest terms, is a way to encode data as visual elements. The effective use of these visualizations can provide a powerful means to quickly perceive and understand phenomena, which would otherwise take much longer to comprehend if only using text-based descriptions.
Data visualization with Python and JS is important because it increases the ability of an analyst or computer to recognize trends, patterns, or other useful information that may not be immediately obvious from just looking at the raw numbers.
With python, you get some of the best graphing libraries for plotting. Each of these libraries comes with amazing features that allow you to create plots depending on your likings and preferences.
The following are the most popular plotting libraries that come with python:
For the sake of this article, however, we will focus on creating plots while referring to Pandas visualization, Seaborn, and Matplotlib. We will also look at how you can use particular features of each library.
Since this is a beginner's guide, we will first pay attention to the syntax, after which we can then consider looking at graphs.
For this article, we can apply the Iris and Wine Reviews datasets, taking into consideration that they are easily available. To load them in, we shall use the pandas *read_csv* method as in the following codes.
loading iris.py
loading wine_reviews.py
Introduced by John Hunter in 2002, this Python and Javascript data visualization library offer the best freedom in plotting. In addition to being written in Python, this library also uses the NumPy library. As such, it can be utilized in Python and Ipython shells, web application servers, and Jupyter notebooks.
Users can make use of the wide variety of plots that come with this library such as bar, scatter, line, histogram, and others. All of these plots come in handy in helping users to understand patterns, trends, and correlations.
One thing to keep in mind about Matplotlib is that as a low-level library, it comes with a Matlab-like interface. This gives the user a lot of freedom, albeit at the cost of being required to write more code.
You can always use pip and conda to install Matplotlib as in the illustration below:
conda install matplotlib
or
pip install matplotlib
Matplotlib comes in handy especially when you want to create bar charts, line charts, histograms, and others. You can import it using the following import command:
import matplotlib.pyplot as plt
You can use the *scatter* technique in Matplotlib to make a scatter plot. In this example code, we have also used *plt.subplots* to give the plot a title and labels.
To make things a whole lot more interesting and easy to understand, you can add color to the graph. The colors should be assigned to each one of the data points depending on their class to give the graph more meaning, as in the illustration below
The Matplotlib library allows for the creation of line charts using the *plot* function. You will also have the option to make multiple columns in the same graph. The easiest and most effective way to do this is to loop through the columns that you are interested in, after which you plot the columns, like in the following code:
You can also make a histogram to determine how often particular classes of a given dataset occur. To create a histogram in Matplotlib, we use the *hist* function. In this illustration, let's refer back to the wine_review dataset.
One thing that you need to understand about bar charts in Matplotlib is that they just won't calculate the frequency of a class automatically. As such, you will have to make use of the pandas *value_counts* function for this purpose.
To create the bar chart itself, you can use the *bar* method. It's also good to keep in mind that bar charts are best applicable for categorical data with few categories (below 30). The reason behind this is that more than 30 classes would make the whole representation a lot messy.
This is the only Python and JS data visualization library here that qualifies as having an easy-to-use interface. Interestingly, it utilizes an API of a higher level as opposed to Matplotlib. What this means is that you can achieve similar results to those of Matplotlib while using less code with Pandas Visualization.
With Pandas visualization, it's considerably easy to plot using the Pandas data frame and series.
To install Pandas, you can either use the pip or conda functions.
pip install pandas
or
conda install pandas
To make a scatter plot, you will have to call <dataset>.plot.scatter () after which you will then pass it two different arguments. One of the arguments is the X-column's name, and the other is the Y-column's name. You also have the option to skip this and pass the title of the plot, instead. Here's the illustration code:
Once you have the image from the code, the scatter plot should have the "x" and "y" column names assigned to it automatically.
We can use the <dataframe>.plot.line. () function to create line charts in Pandas. Unlike in Matplotlib where we have to make use of looping to make multiple columns, Pandas does not require the same hack.
This library will plot all numeric columns that are available automatically. That is if you have not specified a column(s). Here's an illustration code that you can try:
It's also good to keep in mind that Pandas will create a legend automatically for you in cases where there is more than a single feature.
To create Histograms in Pandas, you will be required to use the *plot.hist* function. Whereas you are not required to pass any arguments, you have the option to specify some such as the bin size. Here's an illustration code:
You also have the ability to make more than one histogram, like in the following illustration code:
In the code above, you will note that we have used a *subplots* argument, as well as a *layout* argument.
The subplots function comes in handy in specifying that the outcome should have a separate plot for every feature. The layout function, on the other hand, allows us to create a specification for the number of plots for each row and column.
To create a bar chart in Pandas, you will use the *plot.bar ()* method. Before doing this, however, you will have to get your data using the *value_count ()* and *sort_index ()* methods as in the code illustration below:
If required, you can use the *plot.barh ()* function to create a horizontal chart like in the following example:
The other option that you can take advantage of is the option to first plot the data and then the number of occurrences, like in the following code:
In the outcome of the example code above, we've grouped data based on the specific country, and then highlighted the mean prices of wine. After that, we've then represented the top 5 countries that have the highest wine price based on the average price.
This data visualization library comes in handy in the creation of statistical representations in Python. In addition to being integrated with pandas data structures, this library is also built on Matplotlib. This library provides for required mapping and aggregation, in the creation of informative visuals.
One thing that you will love about this library is the convenience that it offers when it comes to creating graphs. Compared to Matplotlib where you would have to use multiple tens of lines to create a single graph, you can use just one line to get the same output using Seaborn.
What's more? You will definitely love the standard designs in addition to the amazing interface that you can use to work with Pandas dataframes.
Here's how to import Seaborn:
import seaborn as sns
To create a scatter plot in Seaborn, you are required to use the *.scatterplot* function. It's also worth noting that Seaborn is similar to Pandas in that you are required to pass the X and Y column names.
Unlike in Pandas, however, you will also be required to use an additional argument for purposes of passing the data, due to the fact that you won't be calling the function directly on the data like in Pandas.
Have a look at this illustration code:
Just like in Matplotlib, you have the option to use color to highlight the data points. In this case, however, it's considerably easier, as is evident in the illustration code below:
To make line charts in Seaborn, you are required to use the *sns.lineplot* function. The only argument that you are required to make in this case is the data.
It's worth noting that you can make use of the *sns.kdeplot* function to round off the curves' edges. That way, your representation will be a lot cleaner whenever you have many outliers in the dataset.
Here's a sample code:
You are required to use the *sns.distplot* function to make histograms in Seaborn. The argument that you need to pass, in this case, is the column that you are interested in plotting.
In addition, you can choose to pass the number of bins. You also have the option to decide whether you're interested in creating a Gaussian kernel density estimate in the graph.
Here's a sample code without the Gaussian kernel density estimate:
And the same sample code but this time with the Gaussian kernel density estimate:
We use the *sns.countplot* function to make bar charts in Seaborn like in the following illustration code:
Having understood the graph syntax basics of the Matplotlib, Seaborn, and Pandas Visualization libraries, we can look at other types of graphs that can also be used to extract insides.
The majority of these additional graphs go best with Seaborn, taking into consideration that the library has a high-level interface that makes it possible to make appealing yet complete graphs using just a few lines of code.
This graph makes it possible to display the five-number summary of data. You are required to use the *sns.boxplot* function to pass the X and Y column names, as well as the data, as in the illustration code below:
Just like bar charts, box plots are only suitable for small sets of data, lest you end up with a messy outcome.
In this graphical data representation method, the values of a matrix are represented using colors. As such, Heatmaps are best applicable for the exploration of feature correlation in datasets.
You will be required to use the *<dataset>.corr ()* function for showing the correlation of features in a dataset. Once you have the correlation matrix, it now becomes possible to make a heatmap using either Seaborn or Matplotlib. Here is an illustration code for Matplotlib
You can also add annotations to the heatmap for the above code by adding the following:
When using Seaborn, it's easier to make the heatmap and add annotations like in the following illustration code:
Faceting allows the joining of data variables from several subplots so as to combine the different subplots into one. This brings the added convenience of being able to explore a dataset easily and quickly.
In Seaborn, you can apply faceting using the *FacetGrid* function. To begin with, you have to define the FacetGrid, after which you can then pass the data and a column or row that is required to split the data.
After this, you should apply the *map* function to the *FacetGrid* object while selecting the type of plot that you want to use in addition to the column that you would like to graph.
Here is an illustration code:
Depending on what outcome you want to achieve, you have the option to make your graphs more complicated and larger than the ones in the illustration code above.
This is the last type of graphical representation method that we shall look at in this section. The two main functions that can be used for this method are Pandas' *scatter_matrix* and Seaborn's *pairplot*. You can use these two functions to create grids of pairwise relationships for a given dataset.
Here's an example code for Seaborn's pairplot:
And the following is an illustration code for Pandas scatter_matrix:
Both Pandas and Seaborn's illustration codes should give you several graphical outcomes, where the diagonals of the whole graphs are made of histograms, whereas the remaining parts of the graph comprise scatter plots.
Data Visualization with Python and JS is simply a graphical representation of data to make it more accessible and straightforward to understand. This is especially important for people who are not used to raw data.
Seaborn, Pandas Visualization, and Matplotlib, as graphical representation libraries, come packed with a range of features that allow you to represent data just as you intend. As you may have noted from this article, you have the freedom to manipulate each of these libraries to make them suit your representation needs.