Deep neural networks are a fascinating and powerful concept that facilitates learning from complex data. The underlying mathematical construction is that of a directed graph, and there are a number of different architectures (i.e. choices of a directed graph) proposed for various problems. The default structure is fully connected, where every neuron in a fixed layer is connected to every neuron in the following layer. What this means is that the neural network works with formulas involving every feature in the preceding layer. Such architectures sometimes work, but in general the very large number of connections means that learning is slow and is often not accurate. In order to accelerate computation and increase accuracy we must develop methods that restrict or prune the connections in a useful way. Specific classes of architectures have been developed to address specific data types, in particular we have the convolutional architectures which address, among other things, time series and images. These constructions appear to be one-offs for particular classes of problems.
In this post, we will discuss how to take much more general data types and adapt the architecture to them. In fact, we will demonstrate cases where we can construct an architecture for a single data set – a custom neural network. In order to do this, we must first determine a way to think about the convolutional construction and understand how it can generalize.
We recall that for images, a convolutional neural net is composed of layers of neurons, in this case thought of as pixels, each of which is laid out as a collection of regular square grids. While the grid size can vary, typically the number of points in the grid becomes smaller for deeper layers. Often, however, the convolutional neural net has consecutive layers in which the grids are of identical size. In this case, the connections are local in the sense that neurons are connected to nearby neurons. For example, one might take the 3×3 grid patch with the original neuron in the center, and have the connection pattern below repeated around the whole grid.
In a convolutional net for time series, the layers of neurons would be arranged in linear segments instead. Again, the interlayer connections between neurons are specified by the requirement that a neuron in a given layer is connected with neurons in the next higher layer that lie in a small segment around the placement of the given neuron. In this case the picture is like this.
In the case of images, the features defining a data set of images consists of pixels, each one assigned a gray scale numerical value. The identification with pixels with points in the plane equips the pixels (i.e. the set of features) with a geometric structure. In particular, one can consider the pixels as having a notion of distance between them, and that pixels only are connected to pixels in the following layer that lie close to them in the grid.
This means that we have removed a large number of connections from the fully connected structure based on a geometric principle, which we will refer to as locality.
The times series situation is similar – in this case time points in one layer connect to time points in the next layer if and only if they are within a fixed distance of each other.
Why does this kind of pruning of connections make sense?
Let’s consider the case of a data set of time series, and suppose that each time series consists of 10^{4 }consecutive time points and measurements at those time points for example stock ticker data. Suppose further that we are trying to train a predictor for estimating the value of the measurement at time t_{1+1} from all the measurements at all values of time. That would mean that we would be attempting to search for features that include all possible combinations of the 10^{4} variables. Even if we only included pairs of time measurements, this would already enlarge the search space to 10^{8} combinations. If we are searching through all combinations, then the search space is of size 2^{10,000}, obviously a very large number.
By pruning all connections between time points that are further than one unit apart, we are restricting the way in which the neural net can modify the formulas to those combinations of features that consist only of sets of three consecutive time points, for example. This means that we are searching through a space consisting only of 104 combinations, one for each center of an interval of three time points.
This is an important efficiency.
Furthermore, it can easily be justified, since we believe that it is much more likely that features that are nearby (in time) will be relevant to the prediction task at hand than time points that are far removed in time.
A similar argument works for images, where the geometrically local structure encodes the idea that the features we are looking for to distinguish images are local, i.e. are concentrated in a small part of the image. Pixels that are far removed from each other are unlikely to produce combined features that are meaningful. We will refer to this condition imposed on connections as locality, a notion that makes sense in both of the two situations in question.
Naturally, there is great utility in extending this idea to new situations, beyond just time series and images. This means that we want our feature space to be equipped with a notion of distance, which can be encoded in the idea of an abstract metric on a set X. A metric on X is a function that assigns a non-negative real value d(x, y) to every pair of points (x, y), with x and y members of X, subject to three properties.
- d (x, y) = 0 if and only if x = y.
- d (x, y) = d (y, x)
- d (x, z) = d (x, y) + d (y, z)
These conditions abstract three important properties of the ordinary distance function on the line or in the plane or space. In particular, in the case of time series and images, the data sets are equipped with an a priori metric since they are subsets of the line and the plane, respectively. To see another example, consider the circle, where we can assign a notion of distance that assigns, to two different angles, the arc length of the shortest arc from the one to the other, as points on the circle. This quantity can easily be seen to be a metric on the circle. Let’s now recall the features _{θ} that were introduced in our previous post, one for each angular value on the circle. If we choose a discrete set on the circle, it can be equipped with a metric by restricting the distance on the circle. The points could be chosen to be regularly spaced, as in the diagram below.
We now have a collection of 16 features, equipped with a distance function, just as we had distance functions on the one-dimensional lattice in the time series case and the two-dimensional lattice in the case of images.
The key difference is that these features were not raw data from an image data set, but were derived features that were found to be of importance via the work in an earlier blog post. Nevertheless, it is possible to build a feed forward neural net with neurons in each layer in correspondence with the 16 points on the circle, and where connections from one such layer to the following one are local, i.e. a neuron in a layer is connected to a neuron in the following layer if and only if they are close, say adjacent to each other, in the circular geometry.
Using networks built in this way, but in combination with the grids coming from images, we are exploring how this kind of architecture (again, choices of a directed graph in the mathematical sense) can speed up the learning process in image convolutional neural nets. There are many such geometries that can be used to build neural net structures adapted to various kinds of data.
We have described how to use geometry to specify connections when one has two adjacent layers that are identical. However, and important construction in the image (and time series) examples is that of a pooling process, which has a grid in one layer and a smaller grid in the following layer. In the time series case, it looks like this.
Each neuron in the right layer is connected to two neurons in the left layer. A similar thing is done with images, having the effect of forming a new image at a lower resolution. A variant of this pooling procedure is available for the circular geometries as well. Here is a picture describing it.
The previous example shows how one can incorporate knowledge about features spaces into the construction of neural nets, and that knowledge can be a priori knowledge coming directly from the raw data (as in the case of images or time series) or it can be knowledge derived from research about the features, as in the case of the circular geometry on derived features from images.
But what about a situation in which one does not have research leading to a simple model of geometry on a space of features? There is still something one can do. The important idea is to use the methods described in our earlier blog post. Suppose that we have a data matrix D, whose rows are the data points, and whose columns are the features. Using such a matrix, one can put a metric on the set of rows in a number of ways (higher dimensional versions of the Euclidean metric, Hamming metrics, etc.) , and doing this turns out to be extremely useful.
It is the basis for Ayasdi’s software, where topological models of data sets are constructed using a metric on the data set. What was observed in our previous post, though, is that it is equally possible to put a metric on the columns of D, i.e. the features. By performing this simple transformation, we have a metric on the feature space, as we did in the cases described above.
One could use this geometry directly, but often there are many more features than the number of neurons one would prefer to have. What one can do is to create an Ayasdi topological model of the set of columns, using the distance function. In this model, each node is a collection of features, and I can assign to it a new feature that is the average value of all of the features that correspond to the node. We now have a graph structure so that each node corresponds to a feature.
We can use this to create a feed forward network, with each layer having neurons in one to one correspondence with the set of nodes in the topological model, and where we assign a edge from the neuron in a particular layer corresponding to a vertex of the topological model to a neuron in the following layer corresponding to a vertex if and only if and form an edge in the topological model. This creates a feed forward network with identical layers. There is also a way to construct pooling layers in this context, based on varying the resolution in the topological model.
The activation and weight information in this net gives rise to derived features for the data set, as well as solving a classification or estimation problem on the data set. One important point in using the topological model for the feature space is that it forces the neural network to deal with the results of groups of raw features, and therefore will not be able to overfit on a single raw feature. It should mitigate the overfitting that often occurs in this technology.
What we described above is a methodology that facilitates the construction of feed forward deep learning architectures adapted to classes of data sets, such as images, time series, etc.
Having said that, this methodology also facilitates the construction of a bespoke architectures for one-off data sets that fall outside the classes outlined above. As such, we have a unified approach that covers the vast majority of potential data sets.
Deep neural networks are a fascinating and powerful concept that facilitates learning from complex data. The underlying mathematical construction is that of a directed graph, and there are a number of different architectures (i.e. choices of a directed graph) proposed for various problems. The default structure is fully connected, where every neuron in a fixed layer is connected to every neuron in the following layer. What this means is that the neural network works with formulas involving every feature in the preceding layer. Such architectures sometimes work, but in general the very large number of connections means that learning is slow and is often not accurate. In order to accelerate computation and increase accuracy we must develop methods that restrict or prune the connections in a useful way. Specific classes of architectures have been developed to address specific data types, in particular we have the convolutional architectures which address, among other things, time series and images. These constructions appear to be one-offs for particular classes of problems.
In this post, we will discuss how to take much more general data types and adapt the architecture to them. In fact, we will demonstrate cases where we can construct an architecture for a single data set – a custom neural network. In order to do this, we must first determine a way to think about the convolutional construction and understand how it can generalize.
We recall that for images, a convolutional neural net is composed of layers of neurons, in this case thought of as pixels, each of which is laid out as a collection of regular square grids. While the grid size can vary, typically the number of points in the grid becomes smaller for deeper layers. Often, however, the convolutional neural net has consecutive layers in which the grids are of identical size. In this case, the connections are local in the sense that neurons are connected to nearby neurons. For example, one might take the 3×3 grid patch with the original neuron in the center, and have the connection pattern below repeated around the whole grid.
In a convolutional net for time series, the layers of neurons would be arranged in linear segments instead. Again, the interlayer connections between neurons are specified by the requirement that a neuron in a given layer is connected with neurons in the next higher layer that lie in a small segment around the placement of the given neuron. In this case the picture is like this.
In the case of images, the features defining a data set of images consists of pixels, each one assigned a gray scale numerical value. The identification with pixels with points in the plane equips the pixels (i.e. the set of features) with a geometric structure. In particular, one can consider the pixels as having a notion of distance between them, and that pixels only are connected to pixels in the following layer that lie close to them in the grid.
This means that we have removed a large number of connections from the fully connected structure based on a geometric principle, which we will refer to as locality.
The times series situation is similar – in this case time points in one layer connect to time points in the next layer if and only if they are within a fixed distance of each other.
Why does this kind of pruning of connections make sense?
Let’s consider the case of a data set of time series, and suppose that each time series consists of 10^{4 }consecutive time points and measurements at those time points for example stock ticker data. Suppose further that we are trying to train a predictor for estimating the value of the measurement at time t_{1+1} from all the measurements at all values of time. That would mean that we would be attempting to search for features that include all possible combinations of the 10^{4} variables. Even if we only included pairs of time measurements, this would already enlarge the search space to 10^{8} combinations. If we are searching through all combinations, then the search space is of size 2^{10,000}, obviously a very large number.
By pruning all connections between time points that are further than one unit apart, we are restricting the way in which the neural net can modify the formulas to those combinations of features that consist only of sets of three consecutive time points, for example. This means that we are searching through a space consisting only of 104 combinations, one for each center of an interval of three time points.
This is an important efficiency.
Furthermore, it can easily be justified, since we believe that it is much more likely that features that are nearby (in time) will be relevant to the prediction task at hand than time points that are far removed in time.
A similar argument works for images, where the geometrically local structure encodes the idea that the features we are looking for to distinguish images are local, i.e. are concentrated in a small part of the image. Pixels that are far removed from each other are unlikely to produce combined features that are meaningful. We will refer to this condition imposed on connections as locality, a notion that makes sense in both of the two situations in question.
Naturally, there is great utility in extending this idea to new situations, beyond just time series and images. This means that we want our feature space to be equipped with a notion of distance, which can be encoded in the idea of an abstract metric on a set X. A metric on X is a function that assigns a non-negative real value d(x, y) to every pair of points (x, y), with x and y members of X, subject to three properties.
- d (x, y) = 0 if and only if x = y.
- d (x, y) = d (y, x)
- d (x, z) = d (x, y) + d (y, z)
These conditions abstract three important properties of the ordinary distance function on the line or in the plane or space. In particular, in the case of time series and images, the data sets are equipped with an a priori metric since they are subsets of the line and the plane, respectively. To see another example, consider the circle, where we can assign a notion of distance that assigns, to two different angles, the arc length of the shortest arc from the one to the other, as points on the circle. This quantity can easily be seen to be a metric on the circle. Let’s now recall the features _{θ} that were introduced in our previous post, one for each angular value on the circle. If we choose a discrete set on the circle, it can be equipped with a metric by restricting the distance on the circle. The points could be chosen to be regularly spaced, as in the diagram below.
We now have a collection of 16 features, equipped with a distance function, just as we had distance functions on the one-dimensional lattice in the time series case and the two-dimensional lattice in the case of images.
The key difference is that these features were not raw data from an image data set, but were derived features that were found to be of importance via the work in an earlier blog post. Nevertheless, it is possible to build a feed forward neural net with neurons in each layer in correspondence with the 16 points on the circle, and where connections from one such layer to the following one are local, i.e. a neuron in a layer is connected to a neuron in the following layer if and only if they are close, say adjacent to each other, in the circular geometry.
Using networks built in this way, but in combination with the grids coming from images, we are exploring how this kind of architecture (again, choices of a directed graph in the mathematical sense) can speed up the learning process in image convolutional neural nets. There are many such geometries that can be used to build neural net structures adapted to various kinds of data.
We have described how to use geometry to specify connections when one has two adjacent layers that are identical. However, and important construction in the image (and time series) examples is that of a pooling process, which has a grid in one layer and a smaller grid in the following layer. In the time series case, it looks like this.
Each neuron in the right layer is connected to two neurons in the left layer. A similar thing is done with images, having the effect of forming a new image at a lower resolution. A variant of this pooling procedure is available for the circular geometries as well. Here is a picture describing it.
The previous example shows how one can incorporate knowledge about features spaces into the construction of neural nets, and that knowledge can be a priori knowledge coming directly from the raw data (as in the case of images or time series) or it can be knowledge derived from research about the features, as in the case of the circular geometry on derived features from images.
But what about a situation in which one does not have research leading to a simple model of geometry on a space of features? There is still something one can do. The important idea is to use the methods described in our earlier blog post. Suppose that we have a data matrix D, whose rows are the data points, and whose columns are the features. Using such a matrix, one can put a metric on the set of rows in a number of ways (higher dimensional versions of the Euclidean metric, Hamming metrics, etc.) , and doing this turns out to be extremely useful.
It is the basis for Ayasdi’s software, where topological models of data sets are constructed using a metric on the data set. What was observed in our previous post, though, is that it is equally possible to put a metric on the columns of D, i.e. the features. By performing this simple transformation, we have a metric on the feature space, as we did in the cases described above.
One could use this geometry directly, but often there are many more features than the number of neurons one would prefer to have. What one can do is to create an Ayasdi topological model of the set of columns, using the distance function. In this model, each node is a collection of features, and I can assign to it a new feature that is the average value of all of the features that correspond to the node. We now have a graph structure so that each node corresponds to a feature.
We can use this to create a feed forward network, with each layer having neurons in one to one correspondence with the set of nodes in the topological model, and where we assign a edge from the neuron in a particular layer corresponding to a vertex of the topological model to a neuron in the following layer corresponding to a vertex if and only if and form an edge in the topological model. This creates a feed forward network with identical layers. There is also a way to construct pooling layers in this context, based on varying the resolution in the topological model.
The activation and weight information in this net gives rise to derived features for the data set, as well as solving a classification or estimation problem on the data set. One important point in using the topological model for the feature space is that it forces the neural network to deal with the results of groups of raw features, and therefore will not be able to overfit on a single raw feature. It should mitigate the overfitting that often occurs in this technology.
What we described above is a methodology that facilitates the construction of feed forward deep learning architectures adapted to classes of data sets, such as images, time series, etc.
Having said that, this methodology also facilitates the construction of a bespoke architectures for one-off data sets that fall outside the classes outlined above. As such, we have a unified approach that covers the vast majority of potential data sets.