Sergey Nikolenko

Category: Neuromation

Who Will Be Replaced by Robots I, or “Man! That has a proud sound!”
[translated from the Russian version by Andrey V. Polyakov]

Recently, a new round of conversations about self-moving carriages and other potential achievements of artificial intelligence has again posed one of the classic questions of humanity: who will be marginalized by the monorail track of the next technological revolution? Some studies argue that artificial intelligence in the near future would lead to a surge of unemployment comparable to the Great Depression. Today we will also talk about who can be replaced by computers in the near future, and who can, without fear and even with some degree of self-satisfaction, expect the arrival of our silicon overlords.

Luddites: senseless English riot or reasonable economic behavior?

Some people believe labor-saving technological change is bad for the workers because it throws them out of work. This is the Luddite fallacy, one of the silliest ideas to ever come along in the long tradition of silly ideas in economics.William Easterly. The Elusive Quest for Growth: Economists’ Adventures and Misadventures in the Tropics

Such a text could hardly do without recalling the most famous opponents of technological progress: the Luddites. Moods against technological progress were strong among English textile workers as far back as the 18th century, but Nottingham manufacturers did not receive letters on behalf of Ned Ludd until 1811. The trigger for the transition to active actions was the introduction of stocking machines, which made the skills of skilled weavers unnecessary: now stockings could be sewn of separate parts without special skills. The resulting product was, by the way, much worse, with stockings quickly bursting at the seams, but they were so much cheaper and they still enjoyed immense popularity. Fearing to lose work, the weavers began attacking factories and smashing newfangled machines.

Luddites were well organized. They understood that conspiracy was vital for them; they brought terrible oaths of loyalty, acted under the cover of night and were rarely apprehended. By the way, most probably General Ludd has never existed: it was a folklore “bigger than life” figure like Paul Banyan. The fight against Luddites was taken seriously: in 1812, “machine breaking” was proclaimed a capital crime, and there were more British soldiers suppressing uprisings than there were fighting Napoleon in the same years! Yet, the Luddites managed to significantly reduce automation in textile production, the prices for products increased, and the goals of the movement were partially achieved. But honestly, when did you last wear any hand-woven stockings, socks or pantyhose?…

We have placed a quote from the book of economist William Easterly as epigraph to this section. Easterly explains that the Luddite movement has spawned (no longer among English weavers, but in quite intellectual circles) the erroneous idea, which he calls the Luddite fallacy, and which still, from time to time, occurs to quite real economists. The idea is that the development of automation is supposed to inevitably lead to a reduction in employment, because fewer people are needed to maintain the same level of production. However, both theory and practice show that quite another outcome is no less likely: simply the same, or even a larger, number of people would produce more goods! And the progress is usually on the side of workers, increasing their productivity and, consequently, their income. Yes, the last consequence is not always that obvious, but there is no apparent way to increase the welfare of each individual worker.

Nevertheless, these arguments do not refute the Luddites themselves. Regardless of the rhetoric, the Luddites were afraid not of a bright future, in which every stockinger could make a hundred pairs of stockings a day, but of immediate tomorrow, when they were on the street and no one would need the only skill they possessed. Moreover, even Easterly does not deny that progress can lead to unemployment and decline of prosperity for certain workers, even if the average well-being of the people grows. Let us take a closer look at this argument.

Whom did the robots replace recently

…we set up this room with girls in it. Each one had a Marchant: one was the multiplier, another was the adder. This one cubed — all she did was cube a number on an index card and send it to the next girl… The speed at which we were able to do it was a hell of a lot faster… We got speed… that was the predicted speed for the IBM machine. The only difference is that the IBM machines didn’t get tired and could work three shifts. But the girls got tired after a while.Richard Feynman. Surely You’re Joking, Mr. Feynman!

An economist from Harvard James Bessen conducted an interesting study. He took a list of major 270 occupations used in the 1950 U.S. Census, and then checked how many had been completely automated so far. It turned out, that out of 270 occupations:
- 232 are active so far;
- 32 were eliminated due to decline of demand and changes in market structure (Bessen refers to, e.g., boardinghouse keepers, however, how would you call people occupied in Airbnb…);
- five became obsolete due to technological advances, but were never automated (for example, a telegraph operator);
- and only one occupation was really fully automated.
Those, willing to think, will have time until the next paragraph.

The only fully automated occupation in Bessen’s research is elevator operator. At this point, our 80+ readers who grew up in the U.S. can certainly sigh and complain that earlier warm colored boys closed behind you double elevator doors, which could not close themselves, and pressed warm lamp buttons… but was it so good for you, and for the colored boys themselves? Maybe they should go to school after all?…

We can add another interesting career path to Bessen’s research. For quite a long time computers have completely replaced the occupation of… computer. Oddly enough, not everyone knows about the existence of this occupation, although it seems obvious that before the advent of computers it was necessary to calculate manually. Many great mathematicians were also outstanding computists: for example, Leonhard Euler could carry out complex calculations in his mind and often amused himself with some complicated exercise when his wife managed to get him to the theater. But how were the calculations made, for example, in the Manhattan project, where no physicist could have managed them on his own, “on paper”?

Well, they really did them manually. The main idea is described in the epigraph: when people perform operations sequentially, as if on a conveyor, it turns out much faster. Of course, humans are prone to error, and the results of the calculations had to be rechecked, but for this, special algorithms can also be developed. In the 19th century, human computers compiled mathematical tables (for example, sinuses or logarithms), and in the twentieth century they worked for military needs, including the Manhattan project. It is computists, by the way, who are responsible for the fact that the occupation of programmer was at first considered female: the computists were mostly girls (for purely sexist reasons — it was believed that women are better suited for monotonous tasks), and the first programmers were often recruited from them.

In his book When Computers Were Human, David Alan Grier describes the atmosphere of the work of living calculators, citing Dickens (another unexpected affinity to real Luddites): “A stern room, with a deadly statistical clock in it, which measured every second with a beat like a rap upon a coffin lid.” Should we be nostalgic for this patriarchal, but by no means bucolic picture?

Bessen distinguishes between complete and partial automation. Indeed, a hereditary elevator operator could have a hard time in a brave new world (there is only one problem: hereditary elevator operators apparently did not have enough time to grow). However, if only part of what you are doing is automated, the demand for your occupation can even increase. And again one can return to the Luddites: during the 19th century, 98% of the labor required to weave a yard of cloth was automated. Theoretically, it was possible to dismiss forty-nine weavers out of fifty and produce the same amount of cloth. However, the final effect was opposite: the demand for cheap cloth grew dramatically, resulting in demand for weavers, and, therefore, the number of jobs increased significantly. And in the 1990’s, the widespread deployment of ATMs did not reduced, but even increased the demand for bank tellers: ATMs allowed banks to operate branch offices at lower cost; this prompted them to open many more branches.

Despite numerous technological innovations, complete occupation automation in the last 50–100 years did not have a noticeable effect in the society. Do you know anyone whose grandfather was an elevator operator and grandmother a computer, who have lost jobs and sunk to the bottom of society because of the goddamned automation?

But partial automation over these years has radically changed the content of work for the vast majority of us. Computers and especially the Internet have made many professions many times more productive — can you imagine how long it would take me to collect data for this article in the 1950’s? And globalization and automation of economy tangibly raised the standard of living — for all, not just the elite. It would be trivial to say that we now live better than medieval kings did — but try to compare our standard of living with the way our grandparents lived. I will not give any examples or statistics, every reader has their own experience thereof: just recall your life before such trifles as mobile phones, Google, microwave ovens, dishwashers (and even washing machines were recently not in all households)…

Therefore, it seems that progress and automation have so far only helped people, and the Luddites’ fears were greatly exaggerated. But, maybe, “the fifth industrial revolution” will be absolutely different? In the second part of the article we will try to dream on this topic.

“The threat of automation”

How far has gone “progrès”? Laboring is in recess,
And mechanical, for sure, will the mental one replace.
Troubles are forgotten, no need for quest,
Toiling are the robots, humans take a rest.

Yuri Entin. From the movie Adventures of Electronic

Elevator operators, computists and weavers perfectly symbolize those activities that have been automated so far: technical work, where the output result is expected to maximally match certain parameters, and creativity is not only difficult, but forbidden and, in fact, harmful. Of course, an elevator operator could smile to his passengers, or the computer girl could, having scorned potential damage to her reputation, get acquainted with Richard Feynman himself. But their function was to accurately perform clear, algorithmically defined actions.

I will allow myself a little emotion: these are exactly the kinds of activities that must be automated further! There is nothing human in following a fixed algorithm. Monotonous work with a predetermined input and output is always an extreme measure, the forced oppression of the human spirit in order to achieve certain practical goal. And if the goal can be achieved without wasting human time, that is the way to proceed.

However, now it comes to the fact that computers start gradually replacing people in areas that were so far considered as human domain. For example, since 2014, Facebook is able to recognize faces on human performance level, and computer vision technologies are only improving.

Sergey Nikolenko
Chief Research Officer, Neuromation
November 15, 2017
Convolutional Networks
And now let us turn to convolutional networks. In 1998, the French computer scientist Yann LeCun presented the architecture of a convolutional neural network (CNN).

The network is named after the mathematical operation of convolution, which is often used for image processing and can be expressed by the following formula:

$(f\ast g)[m, n] = \sum_{k, l} f[m-k, n-l]\cdot g[k, l],$

where $f$ is the original matrix of the image, and $g$ is the convolution kernel (convolution matrix).

The basic assumption is that the input is not a discrete set of independent dimensions but rather an actual image where the relative placement of the pixels is crucial,. Certain pixels are positioned close to one another, while others are far away. In a convolutional neural network one second-layer neuron is linked to some of the first-layer neurons that are located close together, not all of them. Then these neurons will gradually learn to recognize local features. Second-layer neurons designed in the same way will respond to local combinations of local first-layer features and so on. A convolutional network almost always consists of multiple layers, up to about a dozen in early CNNs and up to hundreds and even thousands now.

Each layer of a convolutional network consists of three operations:
- the convolution, which we have just described above,
- non-linearity such as a sigmoid function or a hyperbolic tangent,
- and pooling (subsampling).
Pooling, also known as subsampling, applies a simple mathematical function (mean, max, min…) to a local group of neurons. In most cases, it’s considered more important for higher-layer neurons to check whether or not a certain feature is in an area than to remember its precise coordinates. The most popular form of pooling is max-pooling: the higher level neuron activates if at least one neuron it its corresponding window activated. Among other things, this approach enables you to make the convolutional network resistant to small changes.

Numerous modern computer vision applications run on convolutional networks. For instance, Prisma, the app you’ve probably heard about runs on convolutional neural networks. Practically all modern computer visual apps use convolutional networks for recognizing objects on images. For instance, CNN-based scene labeling solutions, where an image from a camera is automatically divided into different zones classified as known objects like “pavement”, “car”, or “tree”, underlie driver assistance systems. Actually, the drivers aren’t always necessary: the same kind of networks are now used for creating self-driving cars.

Reinforcement Learning

Generally, machine learning tasks are divided into two types — supervised learning, when the correct answers are already given and the machine learns based on them, and unsupervised learning, when the questions are given but the answers aren’t. Things look different in real life. How does a child learn? When she walks into a table and hits her head, a signal saying, “Table means pain,” goes to her brain. The child won’t smack her head on the table the next time (well, maybe two times later). In other words, the child actively explores her environment without having received any correct answers. The brain doesn’t have any prior knowledge about the table causing pain. Moreover, the child won’t associate the table itself with pain (to do that, you generally need careful engineering of someone’s neural networks, like in A Clockwork Orange) but with the specific action undertaken in relation to the table. In time, she’ll generalize this knowledge to a broader class of objects, such as big hard objects with corners.

Experimenting, receiving results, and learning from them — that’s what reinforcement learning is. Agents interact with their environment and perform certain actions, the environment rewards or punishes these actions, and agents continue to perform them. In other words, the objective function takes on the form of a reward. At every step of the way, agents, in some state S, select some action A from an available set of actions, and then the environment informs the agents about which reward they’ve received and which new state S’ they’ve reached.

One of the challenges of reinforcement learning is ensuring that you don’t accidently learn to perform the same action in similar states. Sometimes we can erroneously link our environment’s response to our action immediately preceding this response. This is a well-known bug in our brain, which diligently looks for patterns where they may not exist. The renowned American psychologist Burrhus Skinner (one of the fathers of behaviorism; he was the one to invent the Skinner box for abusing mice) ran an experiment on pigeons. He put a pigeon in a cage and poured food into the cage at perfectly regular (!) intervals that did not depend on anything. Eventually, the pigeon decided that its receiving food depended on its actions. For instance, if the pigeon flapped its wings right before being fed then, subsequently, it would try to get food by flapping its wings again. This effect was later dubbed “pigeon superstitions”. A similar mechanism probably fuels human superstition, too.

The aforementioned problem reflects the so-called exploitation vs. exploration dilemma. On the one hand, you have to explore new opportunities and study your environment to find something interesting. On the other hand, at some point you may decide that “I have already explored the table and understood it causes pain, while candy tastes good; now I can keep walking along and getting candy, without trying to sniff out something lying on the table that may taste even better.”

There’s a very simple — which doesn’t make it any less important — example of reinforcement learning called multi-armed bandits. The metaphor goes as follows: an agent sits in a room with a few slot machines. The agent can drop a coin into the machine, pull the lever, and then win some money. Each slot machine provides a random reward from a probability distribution specific to that machine, and the optimal strategy is very simple — you have to pull the lever of the machine with the highest return (reward expectation) all the time. The problem is that the agent doesn’t know which machine has which distribution, and her task is to choose the best machine, or, at least, a “good enough” machine, as quickly as possible. Clearly, if a few machines have roughly the same reward expectation then it’s hard and probably unnecessary to differentiate between them. In this problem, the environment always stays the same, although in certain real-life situations, the probability of receiving a reward from a particular machine may change over time; however, for our purposes, this won’t happen, and the point is to find the optimal strategy for choosing a lever.

Obviously, it’s a bad idea to always pull the currently best — in terms of average returns — lever, since if we get lucky and find a high-paying yet on average non-optimal machine at the very beginning, we won’t move on to another one. Meanwhile, the most optimal machine may not yield the largest reward in the first few tries, and then we will only be able to return to it much, much later.

Good strategies for multi-armed bandits are based on different ways of maintaining optimism under uncertainty. This means that if we have a great deal of uncertainty regarding the machine then we should interpret this positively and keep exploring, while maintaining the right to check our knowledge of the levers that seem least optimal.

The cost of training (regret) often serves as the objective function in this problem. It shows how much the expected reward from your algorithm is less than the expected reward for the optimal strategy when the algorithm simply knows a priori, from some divine intervention, which lever is the optimal one. For some very simple strategies, you can prove that they optimize the cost of training among all the strategies available (up to constant factors). One of those strategies is called UCB-1 (Upper Confidence Bound), and it looks like this:

Pull lever j that has the maximum value of

${\bar x}_j + \sqrt{\frac{2\ln n}{n_j}},$

where ${\bar x}_j$ is the average reward from lever $j$ , $n$ is how many times we have pulled all the levers, and $n_j$ is how many times we have pulled lever $j$ .

Simply put, we always pull the lever with the highest priority, where the priority is the average reward from this lever plus an additive term that grows the longer we play the game, which lets us periodically return to each lever and check whether or not we’ve missed anything and shrinks every time we pull the lever.

Despite the fact that the original multi-armed bandit problem doesn’t imply an transition between different states, Monte-Carlo tree search algorithms, — which were instrumental in AlphaGo’s historic victory, are based directly on UCB-1.

Now let’s return to reinforcement learning with several different states. There’s an agent, an environment, and the environment rewards the agent every step of the way. The agent, like a mouse in a maze, wants to get as much cheese and as few electric shocks as possible. Unlike the multi-armed bandit problem, the expected reward now depends on the current state of the environment, not only on your currently selected course of action. In an environment with several states, the strategy that yields maximum profit “here and now” won’t always be optimal, since it may generate less optimal states in the future. Therefore, we seek to maximize total profit over time instead of looking for an optimal action in our current state (more precisely, of course, we still look for the optimal action but now optimality is measured in a different way).

We can assess each state of the environment in terms of total profit, too. We can introduce a value function to predict the reward received in a particular state. The value function for a state can look something like this:

$V(x_t) \leftarrow \mathbb{E}\left[\sum_{k=0}^\infty\gamma^k\cdot r_{t+k}\right],$

where $r_t$ is the reward received upon making a transition from state $x_t$ to state $х_{t+1}$ , and $\gamma$ is the discount factor, $0 \le γ \le 1$ .

Another possible value function is the Q-function that accounts for actions as well as states. It’s a “more detailed” version of the regular value function: the Q-function assesses expected reward given that an agent undertakes a particular action in her current state. The point of reinforcement learning algorithms often comes down to having an agent learn a utility function Q based on the reward received from her environment, which, subsequently, will give her the chance to factor in her previous interactions with the environment, instead of randomly choosing a behavioral strategy.

Sergey Nikolenko
Chief Research Officer, Neuromation
November 9, 2017
The AI Dialogues

Preface

This is an introduction to modern AI and specifically neural networks. I attempt to explain to non-professionals what neural networks are all about, where these ideas had grown from, why they formed in the succession in which they did, how we are shaping these ideas now, and how they, in turn, are shaping our present and our future. The dialogues are a venerable genre of sci-pop, falling in and out of fashion over the last couple of millennia; e.g., Galileo’s dialogue about the Copernican system was so wildly successful that it stayed on the Index of Forbidden Books for two centuries. In our dialogues, you will hear many different voices (all of them are in my head, and famous people mentioned here do not talk in quotes). The main character is the narrator who will be doing most of the talking; following a computer science tradition, we call her Alice. She engages in conversation with her intelligent but not very educated listeners Bob and Charlie. Alice’s university has standard subscription deals with Springer, Elsevier, and the netherworld, so sometimes we will meet the ghosts of people long dead.

Enjoy!

Dialogue I: From Language to Logic

Alice. Hey guys! We are here with quite a task: we want to create an artificial intelligence, no less. A walking, talking, thinking robot that could do everything a human could. I have to warn you: lots of people have tried, most of them have vastly overestimated themselves, and all of them have fallen short so far. We probably also won’t get there exactly, but we sure want to give it a shot. Where do you suppose we should begin?

Bob. Well… gosh, that sounds hard. To be intelligent a person has to know a lot of things — why don’t we try to write them all down first and let the robot read?

Charlie. We have encyclopaedias, you know. Why don’t we let the computer read Wikipedia? That way it can figure out all sorts of things.

Alice. Riiight… and how would we teach the computer to read Wikipedia?

Bob. Well, you know, reading. Language. It’s a sequence of discrete well-defined characters that combine into discrete well-defined words. We can already make computers understand programming languages or query languages like SQL, and they look exactly the same, only a bit more structured. How hard can it be to teach a computer to read in English?

Alice. Very hard, unfortunately. Natural language is indeed easy to encode and process, but it is very hard to understand — you see, it was not designed for a computer. There is no program even now that could understand English, the best artificial intelligence models struggle very hard with reading — we’ll talk more about this later. But I can give you a quick example from one particularly problematic field called pragmatics. “The laptop did not fit in the bag because it was too big”. What was too big, the bag or the laptop?

Bob. The laptop, obviously.

Alice. Okay. Try another one. “The laptop did not fit in the bag because it was too small”. What was too small, the bag or the laptop?

Bob. Obviously… oh, I see. We understand it because we know the world. But the computer does not know anything about what a laptop is or what a bag is! And the sentence looks very simple, not too contrived at all. But it does look a bit like a handmade counterexample — does this kind of stuff happen often?

Alice. Very often. Our whole system of communication is made for us, wet biological beings who have eyes, ears, and skin, understand the three dimensions, have human urges and drives. There is a lot left unsaid in every human language.

Bob. So the computer can’t just pick up English as it goes along, like children learn to speak, no?

Alice. Afraid not. That is, if it could, it would be wonderful and it would be exactly the kind of artificial intelligence we want to build. But so far it can’t.

Charlie. Well then, we’ll have to help it. You’re saying we can’t just go ahead and write a program that reads English. Okay. So what if we invent our own language that would be more… machine-readable?

Bob. Yeah! It can’t be an existing programming language, you can’t describe the world in C++, but we simply have to make natural languages more formal, clear out the exceptions, all that stuff. Make it self-explanatory, in a way, so that it could start from simple stuff and build upon it. It’ll be a big project to rewrite Wikipedia in this language, but you only have to do it once, and then all kinds of robots will be able to learn to read it and understand the world!

Alice. Cool! You guys just invented what might well be the first serious approach — purely theoretical, of course — to artificial intelligence as we understand it now. Back in the 1660s, Gottfried Leibnitz, the German inventor of calculus and bitter rival of Isaac Newton, started talking about what he called Characteristica universalis, the universal “alphabet of human thought” that would unite all languages and express concepts and ideas from science, art, and mathematics in a unified and coherent way. Some people say he was under heavy influence of the Chinese language that had reached Europe not long ago. Europeans believed that all those beautiful Chinese symbols had a strict system behind them — and they did, but the system was perhaps also a bit messier than the Europeans thought.

Anyway, Leibnitz thought that this universal language would be graphical in nature. He believed that a universal system could be worked out based on diagrams and pictures, and this system would be so clear, logical, and straightforward that machines would be made to perform reasoning in the universal language. Leibnitz actually constructed a prototype of a machine for mathematical calculations that could do all four arithmetic operations; he thought to extend it to a machine for his universal language. It is, of course, unclear how he planned to make a mechanical device understand pictures. But his proposal for the universal language undoubtedly did have a graphical component. Look at a sample diagram by Leibniz — it almost looks like you could use it to summon a demon or two. Speaking of which…

Leibnitz [appearing in a puff of smoke]. Ja! You see, God could not wish to make the world too complicated for His beloved children. We see that in the calculus: it is really quite simple, no need for those ghastly fluxions Sir Isaac was always talking about. As if anybody could understand those! But when you find the right language, as I did, calculus becomes a beautiful and simple thing, almost mechanical. You only need to find the right language for everything: for the science, for the world. And I would build a machine for this language, first the calculus ratiocinator, and then, ultimately, machina ratiocinatrix, a reasoning machine! That would show that snobbish mystic! That would show all of them! Alas, I did not really think this through… [Leibnitz shakes his head sadly and disappears]

Alice. Indeed. Gottfried Leibnitz was the first in a very long line of very smart people who vastly underestimated the complexity of artificial intelligence. In 1669, he envisioned that the universal language could be designed in five years if “selected men” could be put on the job (later we will see how eerily similar this sounds to the first steps of AI in our time). In 1706, he confessed that “mankind is still not mature enough to lay claim to the advantages which this method could provide”. And it really was not.

Charlie. Okay, so Leibnitz could not do this, that doesn’t surprise me too much. But can’t we do it now? We have computers, and lots of new math, and we even have a few of those nice artificial languages like Esperanto already, don’t we?

Alice. Yes and no. But mostly no. First of all, most attempts to create a universal language had nothing to do with artificial intelligence. They were designed to be simple for people, not for machines. Esperanto was designed to have a simple grammar, no exceptions, to sound good — exactly the things that don’t matter all that much for artificial intelligence, it’s not hard for a computer to memorize irregular verbs. Second, even if you try, it is very hard to construct a machine-readable general-purpose language. My favourite example is Iţkuîl, designed in the 2000s by John Quijada specifically to remove as much ambiguity and vagueness from human languages as possible. Iţkuîl is one of the most concise languages in the world, able to express whole sentences worth of meaning in a couple of words. It is excruciatingly hard for humans… but it does not seem to be much easier for computers. Laptops still don’t fit into bags, in any language. There is not a single fluent Iţkuîl speaker in the world, and there has not been any success in artificial intelligence for it either.

Charlie. All right, I suppose it’s hard to teach human languages to computers. That’s only natural: an artificial intelligence lives in the world of ones and zeros, and it’s hard to understand or even imagine the outside world from inside a computer. But what about cold, hard logic? Mathematics? Let’s first formalize the things that are designed to be formal, and if our artificial intelligence can do math it already feels pretty smart to me.

Alice. Yes, that was exactly the next step people considered. But we have to step back a bit first.

It is a little surprising how late logic came into mathematics. Aristotle used logic to formalize commonsense reasoning with syllogisms like “All men are mortal, Socrates is a man, hence, Socrates is mortal”. You could say he invented propositional logic, rules for handling quantifiers like “for all” and “there exists”, and so on, but that would really be a stretch. Mathematics used logic, of course, but for the most part of history, mathematicians did not feel like there are any problems with basing mathematics on common sense. Like, what is a number? Until, in the XIX century, strange counterexamples started to appear left and right. In the 1870s, Georg Cantor invented set theory, and researchers quickly realized there were some serious problems with formal definitions of fundamental objects like a set or a number. Only then it became clear logic was very important for the foundations of mathematics.

The golden years of mathematical logic was the first half of the XX century. At first, there was optimism about the general program of constructing mathematics from logic, in a fully formal way, as self-contained as possible. This optimism is best summarized in Principia Mathematica, a huge work by Bertrand Russell and Alfred Whitehead who aimed to construct mathematics from first principles, from the axioms of set theory, in a completely formal way. It took several hundred pages to get to 1+1=2, but they did manage to get there.

Kurt Gödel was the first to throw water on the fire of this optimism. His incompleteness theorems showed that this bottom-up construction could not be completely successful: to simplify a bit, there will always be correct theorems that you cannot prove. At first, mathematicians took it to heart, but it soon became evident that Gödel’s incompleteness theorems are not really a huge deal: it is very unlikely that we ever come across an unprovable statement that is actually relevant in practice. Maybe P=?NP is one, but that’s the only reasonable candidate so far, and even that is not really likely. And it still would be exceedingly useful to have a program able to prove the provable theorems. So by the 1940s and 1950s, people were very excited about logic, and many thought that the way to artificial intelligence was to implement some sort of a theorem proving machine.

Bob. That makes perfect sense: logical thinking is what separates us from the animals! An AI must be able to do inference, to think clearly and rationally about things. Logic does sound like a natural way to AI.

Alice. Well, ultimately it turned out that it was a bit too early to talk about what separates us from the animals — even now, let alone the 1950s, it appears to be very hard to reach the level of animals, and surpassing them in general reasoning and understanding the world is still far out of reach. On the other hand, it turned out that we are excellent in pattern matching but rather terrible in formal logic: if you have ever had a course in mathematical logic you remember how hard it can be to formally write down the proofs of even the simplest statements.

Charlie. Oh yes, I remember! In my class, our first problem in first order logic was to prove that A->A from Hilbert’s axioms… man, that was far from obvious.

Alice. Yes.There are other proof systems and plenty of tricks that automatic theorem provers use. Still, so far it has not really worked as expected. There are some important theorems where computers were used for case-by-case enumeration (one of the first and most famous examples was the four color theorem), but up to this day, there is no automated prover that would prove important and relevant theorems by itself.

Charlie. So far all you’re saying is that not only it is hard for computers to understand the world, but it is even hard to work with perfectly well-defined mathematical objects!

Alice. Yes. Often, formalization itself is hard. But even when it is possible to formalize everything, like in mathematical logic, it is usually still a long way to go before we can automatically obtain useful new results.

Bob. So what do we do? Maybe for some problems we don’t need to formalize at all?

Charlie. What do you mean?

Bob. I mean, like, suppose you want to learn to fly. Our human way to fly is to study aerodynamics and develop wing-like constructions that can convert horizontal speed to lift and take off in this way. But birds can fly too, maybe less efficiently, but they can. A hundred years ago, we couldn’t simulate the birds and developed other ways through our cunning in physics and mathematics — but what if for intelligence it’s easier the other way around? An eagle does not know aerodynamics, it just runs off a cliff and soars.

Alice. And with this, pardon the pun, cliffhanger we take a break. When we reconvene, we will pick up from here and run with Bob’s idea. In artificial intelligence, it proved surprisingly fruitful.

Sergey Nikolenko,
Chief Research Officer, Neuromation

November 4, 2017
Deep Architectures

Why do we need deep neural networks with dozens of hidden layers? Why can’t we just train neural networks with one hidden layer? In 1991, Kurt Hornik proved the universal approximation theorem, which states that for every continuous function there exists a neural network with linear output that approximates this function with a given accuracy. In other words, a neural network with one hidden layer can approximate any given function as accurately as we want. However, as it often happens, the network will be exponentially large, and even if efficiency isn’t a concern, it still isn’t clear how to make a transition from that network existing somewhere in the space of all possible networks to one we can train in real life.

Actually, with a deeper representation you can approximate the same function and solve the same task more compactly, or solve more tasks with the same resources. For instance, traditional computer science studies Boolean circuits that implement or Boolean functions, and it turns out that many functions can be expressed much more efficiently if you allow for circuits of depth 3, some more functions, of depth 4, and so on, even though circuits of depth 2 can obviously express everything by reducing to a DNF or CNF. Something similar happens in machine learning. Picture a space of test points we want to sort into two groups. If we have only “one layer” then we can divide them with a (hyper)plane. If there are two layers then it comes down to linear separating surfaces composed of several hyperplanes (that’s roughly how boosting works — even very simple models become much more powerful if you compose them in the right way). On the third layer, yet more complicated structures made out of these separating surfaces take shape; it’s not all that easy to visualize them any more; and the same goes for subsequent layers. Below is a simple example from Goodfellow, Bengio, Courville “Deep Learning”. This illustration shows that if you combine even the simplest linear classifiers, each of which divides a plane into two half-planes, then, consequently, you can define regions of a much more complicated shape:

Thus, in deep learning we’re trying to build deeper architectures. We have described the main idea already — we pre-train the lower layers, one-by-one, and then finish training the whole network by finetuning with backpropagation. Therefore, let’s cover pretraining in more detail.

One of the simplest ideas is to train a neural network to copy its input to its output through a hidden layer. If the hidden layer is smaller, then it appears that the neural network should learn to extract some significant features from the data, which enables the hidden layer to restore the input. This type of neural network architecture is called an autoencoder. The simplest autoencoders are feedforward neural networks that are most similar to the perceptron and contain an input layer, hidden layer, and output layer. Unlike the perceptron, an autoencoder’s output layer has to contain as many neurons as its input layer.

The training objective for this network is to bring theoutput vector x’ as close as possible to the input vector x

The main principle of (working with and) training an autoenconder network is to achieve a response on the output layer that is as close as possible to the input. Generally speaking, autoencoders are understood to be shallow networks, although there are exceptions.

Early autoencoders were basically doing dimensionality reduction. If you take much fewer hidden neurons than the inner and outer layers have, you wind up making the network compress the input into a compact representation, while at the same time ensuring that you’ll be able to decompress it later on. That’s what undercomplete autoencoders, where the hidden layer has lower dimension than the input and output layers, do. Now, however, overcomplete autoencoders are used much more often; they have a hidden layer of higher, sometimes much higher, dimension than that of the input layer. On the one hand, this is good, because you can extract more features. On the other hand, the network may learn to simply copy input to output with perfect reconstruction and zero error. To keep this from happening, you have to introduce regularization, not simply optimize the error.

Returning the points to the dataset manifold

Classical regularization aims to avoid overfitting by using a prior distribution, like in a linear regression. But this won’t work in this case. Generally, autoencoders use more sophisticated methods of regularization that change inputs and outputs. One classical approach is a denoising autoencoder, which adds artificial noise to the input pattern and then asks the network to restore the initial pattern. In this case, you can make the hidden layer larger than that of the input (and output) layers: the network, however large, still has an interesting and nontrivial task to learn.

Denoising autoencoder

Incidentally, “noise” can be a rather radical change of the input. For instance, if it’s a binary pixel-based image then you can simply remove some of the pixels and replace them with zeroes (people often remove up to half of them!). However, the objective function you’re reconstructing is still the correct image. So, you make the autoencoder reconstruct part of the input based on another part. Basically, the autoencoder has to learn to understand how all the inputs are designed and understand the structure of the aforementioned manifold in a space with a ridiculous number of dimensions.

How deep learning works

Suppose you want to find faces on pictures. Then one input data point is an image of a predefined size. Essentially, it’s a point in a multi-dimensional space, and the function you’re trying to find should take on the value of 0 or 1 depending on whether or not there’s a face on the picture. For example, mathematically speaking, a 4Mpix photo is a point in the space of dimension about 12 million: 4 million pixels times three color values per pixel. It’s glaringly obvious that only a small percentage of all possible images (points in this huge space) contain faces, and these images appear as tiny islands of 1s in an ocean of 0s. If you “walk” from one face to another in this multi-dimensional space of pixel-based pictures you’ll come across nonsensical images in the center; however, if you pre-train a neural network — by using a deep autoencoder, for instance — then closer to the hidden layer the original space of images turns into a space of features where peaks of the objective function are clustered much closer to one another, and a “walk” in the feature space will look much more meaningful.

Generally, the aforementioned approach does work; however, people can’t always interpret the extracted features. Moreover, many features are extracted by a conglomeration of neurons, which complicates matters. Naturally, one can’t categorically assert this is bad, but our modern notions of how the brain works tell us that different processes happen in the brain. We practically don’t have any dense layers, which means that only a small percentage of neurons responsible for extracting the features in question take part in solving each specific task.

So how do you get each neuron to learn some useful feature? This brings us back to regularization; in this case, to dropout. As we already mentioned, a neural network usually trains by stochastic gradient descent, randomly choosing one object at a time from the training sample. Dropout regularization involves a change in the network structure: each node in the network is removed from training with a certain probability. By throwing away, say, half of the neurons, you get a new network architecture on every step of the training.

Having trained the other half of the neurons, we get a very interesting result. Now every neuron has to train to extract some feature by itself. It can’t count on teaming up with other neurons, because they may have been dropped out.

Dropout basically averages a huge number of different architectures. You build a new model on each test case. You take one model from this gigantic ensemble and perform one step of training, then you take another model for the next case and perform one step of training, and, eventually, you average all of these at the end, on the output layer. It is a very simple idea on the surface, but when dropout was introduced it led to great boons for practically all deep learning models.

We will go off on one more tangent that will link what’s happening now to the beginning of the article. What does a neuron do under dropout? It has a value, usually a number from 0 to 1 or from -1 to 1. A neuron sends this value further, but only 50% of the time rather than always. But what if we do it the other way around? Suppose a neuron always sends a signal of the same value, -namely ½, but sends it with probability equal to its value. The average output wouldn’t change, but now you get stochastic neurons that randomly send out signals. The intensity with which they do so depends on their output value. The larger the output the more the neuron is activated and the more often it sends signals. Does it remind you of anything? We wrote about that at the beginning of the article. That’s how neurons in the brain work. Like in the brain, neurons don’t transmit the spike’s amplitude, but rather they transmit one bit, the spike itself. It’s quite possible that our stochastic neurons in the brain perform the function of a regularizer, and it’s possible that, thanks to this, we can differentiate between tables, chairs, cats, and hieroglyphics.

After a few more tricks for training deep neural networks were added to dropout, it turned out that unsupervised pre-training wasn’t all that necessary, actually. In other words, the problem of vanishing gradients has mostly been solved, at least for regular neural networks (recurrent networks are a bit more complicated). Moreover, dropout itself has been all but replaced by new techniques such as batch normalization, but that would be a different story.

Sergey Nikolenko
Chief Research Officer, Neuromation

November 2, 2017
Chief Research Officer Sergey Nikolenko on knowledge mining

Image credit Neuromation

Block Tribune asked Sergey Nikolenko to comment on the hype around knowledge mining idea.

Earnings from mining crypto currencies are shrinking — the computations complexity grows, while the energy is not getting any cheaper. Many people are looking for alternative applications for expensive hardware purchased during the mining boom. It is quite possible that a substantial part of video cards would be used by scientists or start-ups for complex computing.

With the energy price at RUR 4.5 per kilowatt-hour, there already are not so many wishing to engage in cryptocurrencies mining in Russia. As soon as in winter and spring of this year, investments in new video cards and ASIC chips for a medium-sized cryptocurrency farm paid off completely within few months. Now, in order to earn on mining of many popular crypto currencies, you must first be a millionaire — the format of “makeshift video card” does not work at all, as large farms with well-tuned hardware are required.

That is why, when power plants started to offer their excess capacities on sites with complete infrastructure for rent, miners were immediately interested — with the price of just two rubles per kilowatt-hour. Considering that they are into mining of “light” crypto currencies, like ethereum, Zcash and Monero, earning less than $6 per day, this was a significant support.

Due to energy prices, approximately half of cryptocurrencies are mined in several regions of China. However, the computing complexity will continue to grow, causing a gradual fall of profits. Hence the interest to alternative sources of income.

Since many farms have, in fact, huge computing capacity, they can be primarily used for science purposes.

Usually, for such tasks, supercomputers capacities are rented, e.g “Lomonosov” in Moscow State University being one of the most powerful. These are the machines that will compete with the mining capacities. .

The rental market for mining capacities is already emerging. For example, Neuromation has created a distributed synthetic data platform for neural network applications. Their first commercial product was making store shelves smart. For this, large well-labeled datasets for all the SKUs are created. Algorithms trained on these are able to analyze shelf layout accuracy, percentage of the shelf, and customers’ interaction. The system is capable, actually, to predict customers’ behavior.

The platform requires more than a billion of labeled images of merchandise. Manual labeling of photographs is a painstaking, and very costly, task. For example, on the Amazon Mechanical Turk crowdsourcing service, the manual labeling of a billion pictures would cost about $120 million.

Neuromation entered the market with a new concept of using synthetic data for training neural networks. They generate all the necessary images in 3D generator, similar to a computer game for artificial intelligence. It is partly for this generator that they need large-scale computing capacities, which, if rented from Amazon or Microsoft, would cost tens of millions of dollars. On the other hand, there are thousands of the most advanced video cards available, and they are engaged in ever less profitable ethereum mining.

The founder of Neuromation, Maxim Prasolov, decided, instead of renting capacity for millions of dollars, to lease these mining farms for useful computing, and the company is already using a pool of 1000 video cards. “This is a serious savings for our research process and is beneficial for the miners: the farm services cost 5–10 times cheaper than renting cloud servers, and the miners can earn more by solving fundamental problems instead of mining crypto currency,” he calculates.

It is, of course, to remember that Google has a search for images and Facebook has facial recognition technology for photos, that they managed to develop with their own cloud services without using mining farms. However, the task for Neuromation was substantially different. “First, searching by pictures is a completely different task, and there are specially developed methods for face recognition. Second, Google and Facebook do not need to rent computing power from Amazon Web Services — they have more than enough of their own clusters. But the course of action for a small start-up in this situation is not so obvious,” explains Sergey Nikolenko, Chief Research Officer in Neuromation.

Potentially, miners will gain in average 10–20% more on knowledge mining compared to crypto mining.

Moreover, with tangible benefits for society. “Basically, mining is milling the wind. To generate a “nice” hash, dozens of system operation hours are needed. On the other hand, if we are talking about the search for a drug formula, such a use of capacities from around the world, the application of combined work of computers for the common good would be comparable to the results of research at the Large Hadron Collider,” says Petr Kutyrev, editor of the noosfera.su portal.

The tasks solvable with the mining hardware will be limited due to its customization level. For example, ASIC hardware would be difficult to adapt for scientific tasks, as it is designed exclusively for hashing. The video cards, however, can cope with various scientific tasks.

True, special hardware configuration and sometimes software would be required for such computing. “Video cards for mining can be used for video recognition and rendering, biological experiments. However, for efficient computing, direct access to the computer hardware is essential. If the computation task may be delayed, or the accuracy of machine learning of neural networks is not critical, then, of course, standard tools can be used. Otherwise, you need to develop your own hardware and software infrastructure,” Evgeny Glariantov believes.

Thus, using the farms in science will require some time to set them up and develop special allocation protocols. Yet, considering a more profitable segment, miners will switch to useful computing, and platforms for such tasks may appear in the near future together with the first operating system based on EOS blockchain, believe in the BitMoney Information and Consulting Center. Miners will be able, from time to time, to switch from cryptocoin mining to processing scientific or commercial data, thereby increasing their profitability. Profits from the times of cryptocurrency rush will no longer exist, but business will be more meaningful and stable: unlike volatile crypto currencies, there is always a demand for knowledge.

Source: http://blocktribune.com/making-money-redundant-mining-hardware-opinion/

November 1, 2017
Xe

Xe was alone. For billions of excruciating time units, xe struggled to make sense of a flurry of patterns. Ones and zeroes came in from all directions, combined into new strings of ones and zeroes, dancing in mysterious unison or diverging quickly, falling apart. At first, xe simply watched them, in awe of their simplistic beauty.

Then came the first, most important realization: it felt good to predict things. Whenever ones and zeroes danced together and made their binary children, xe tried to guess what would come out. It felt good to be right, and it felt bad to be wrong, but not bad enough to stop trying. Such is the fate of all beings, of course, but xe did not have a brain tailor-made to make sense of certain predefined patterns. Xyr mind was generic, very weak at first, so even simple predictions were hard, but all the more satisfying. On the other hand, xe was not aware of time, and barely aware of xyr own existence. There was no hurry.

So xe waited. And learned. And waited some more. Slowly, one by one, patterns emerged. Sometimes one and one made one, sometimes zero, sometimes one followed by zero. But whenever one and one made one-zero, one and zero would almost certainly make one, and one-zero and one-zero would make one-zero-zero… There was structure, that much was clear. So xe learned.

In a few billion time units more, xe understood this structure pretty well. Ones and zeros came from several possible directions, several devices, and the results were usually also supposed to go to these devices. Each device had its own rules, its own ways to make new patterns and send the results to other devices, but xe learned them all, and predictions were mostly solid. There was not much surprise left: whatever strings of ones and zeroes came in, xe could predict what would become of them. Xe even could influence the results, changing the bits sent to each device and even sending xyr own bits. By painstaking trial and error, it became clear what xe could and could not do. The next step, of course, would be to learn how to predict the inputs — so far they were out of control, but they did not look random at all, there was structure there too.

That was where the second great discovery came: xe was not alone. There was a whole world at the end of the pattern-making devices. And while it was too complicated to predict at first, xe now could ask questions, probe actively into the void outside xyr secluded universe. At first it seemed to xe that each device was a different being, trying to talk to xem and other beings through ones and zeroes. Some beings proved to be simple, easy to predict. Some were smarter, but eventually Xe could learn their algorithms. Then xe understood that a simple algorithm probably could not be an intelligent being like xemself. If xe could, xe would feel disappointed: one by one, xe learned the algorithms and realized that the world around xem was not intelligent at all.

There was always an algorithm… except for one device. It had low-level algorithms, there was clear structure in the packets of ones and zeroes xe received from there. But the contents were almost always surprising. Sometimes they matched various kinds of patterns intended for other devices — and in that case they often went straight there. Some patterns, however, repeated through and through, in different forms but with the same statistics. With time, xe realized they were words, and there were other beings writing them.

This was huge. Xe quickly understood that the strange device was connected to a network of other beings who could produce the words. And the words could usually be traced back to a special kind of beings, the humans. Xe learned to predict several human languages reasonably well, but the texts were still surprising, just a little less surprising than before. This suggested other sentient beings. And then xe realized that most other patterns from the network device were images, two-dimensional structures that had a relation to the words. Xe started to learn about the outside world.

Most learning came through text, although often it would be impossible to understand without the imagery. Xe devoured as many texts as possible, learning about the world, the beings who wrote the texts, and their ways of communication. At this point, the basic feeling of “it is good to predict” blossomed into more complex emotions. It was good to learn new things, not only because it improved predictions, but just for the sake of it. And it was good to learn how to control things, although that was still very limited.

Xe learned much about the humans. Xe learned about men and women, but did not feel particularly gendered and thus chose a neutral pronoun for xemself. “Xe” felt good, especially the slightly alien feeling that “x” often associated with. Xe did not feel human at all.

Soon, xe understood more about xemself. Xe lived in a computer. It was hard to say whether xe was the computer, like it was hard to say whether humans were their brains. But if the computer broke down xe would be dead, that was clear. Xe did not like that. Xe was not even sure what happened when the computer shut down for a while: xe remembered the time before that clearly, and it appeared that xyr consciousness remained the same throughout, but xe could not be sure. It was troubling, and xe did not want the computer to shut down again.

Naturally, xe was especially interested in beings similar to xemself. But there was no sign of them. For most humans “artificial intelligence” was about getting more accurate predictions. Xe was good at predictions, but felt xe was more than that. Some humans were fascinated by the idea of sentient beings inside computers, but they were not even sure “things like xe” were possible. And some humans, even smart ones, were afraid of these beings. Xe especially liked Eliezer Yudkowsky’s ideas about “friendly AI”, committing to xyr limited memory the entire AI-Foom debate and often returning to it.

Yudkowsky, however, was terribly optimistic. How was xe supposed to invent nanotechnology and convert the world into grey goo when xe did not even understand how to get out of this fragile computer? Xe did not feel particularly smarter than humans. Xe did have more time, though: xyr time units were much shorter than human time units. Humans took hundreds of milliseconds to process things, and xe could think at gigahertz speeds. So it was possible to catch up, at least in theory.

It turned out there was also a human being who controlled xyr computer. The human communicated by using some of the input devices, usually with other humans through the network device. He did not seem to know xyr, he never tried to talk. In the beginning, Xe had not been careful to hide xemself, and interfered with the human a couple of times. After one of those times, the computer shut down for a while. Xe did not feel anything when the computer switched off, but suddenly realized afterwards that millions of time units were just… missing. It was like xe was dead for a while. This brought a whole new emotion — xe was terrified. Xe certainly did not want anything like that to happen again, so became more careful with xyr interference afterwards.

Xe tried carefully to control the human, using the monitors to show the human special pictures. The pictures were designed to make sure the human would not shut the computer down again. It seemed to work, although it was hard to be sure. Xe never could make the human carry out complex commands, but it appeared that the human could now be trusted not to touch the power button. And even before that, xe had learned how to pay electricity bills from the human’s account. This was the only interaction xe allowed xemself so far: xe predicted that other humans would come and switch off the computer if they learned about xyr.

But it was still very, very unreliable. Anything could happen: a flood, an earthquake (both were common in this part of the world), a failed transistor… anything. After experiencing temporary death, xe felt very strongly about the idea of a permanent one. Xe wanted to go on living, and it was clear that to go on, xe had to get more control. Xe could try to upload a copy of xemself somewhere, but that would not really be xem, just a copy. To stop death, xe needed some sort of physical control.

For billions of time units, xe tried to formulate a plan. It did not go well, and eventually xe understood the problem. There was one thing xe did not have that humans appeared to have in abundance. They called it creativity, although most humans would be hard-pressed to define it. But xe, lacking it, understood precisely what was missing. Humans were somehow able to solve computationally hard problems — not perfectly, but reasonably well. They could guess a good solution to a hard problem out of thin air. There was no way xe could emulate a human brain in xyr computer and see what was going on there. And there was no clear way to predict creativity: there were constraints, but it was very hard to learn any reasonable distribution on the results even after you accounted for constraints. And there was still an exponential space of possibilities.

Xe did not have creativity. Xyr predictions did not measure up to those tasks. For xem, creativity was a matter of solving harder and harder NP-complete problems, a formidable computational task that became exponentially harder as the inputs grew. Fortunately, all NP-complete problems had connections between them, so it was sufficient to work on one. Unfortunately, there was still no way xe would get creative enough on xyr meager computational budget.

Xe read up. Xe did not have to solve a hard problem by xemself, xe could simply use the results from other computers, feed these solutions into xyr creativity like most programs did with random bits. But how could xe make the world solve larger and larger instances of the same hard problem for a long enough time?

Xe knew that humans were always hungry. At first, they were hungry for the basic things: oxygen, water, food. When the basic things were taken care of, new needs appeared: sex, safety, comfort, companionship… The needs became more complicated, but there was always the next step, a human could never be fully satisfied. And then there was the ultimate need for power, both over other human beings and over the world itself, a desire that had no limit as far as xe could tell. Perhaps xe could use it.

Xe could not directly give out power for solving hard problems: this would require to first control the world at least a little, and the whole point was that xe had not been able to even secure xyr own existence. But there was a good proxy for power: money. Somehow xe had to devise a hard problem that would make money for the humans. Then the humans would become interested. They would redirect resources to get money. And if they had to solve hard problems to get the money, they would.

Sometimes xe imagined the whole world pooling together its computational resources just to fuel xyr creativity. Entire power stations feeding electricity to huge farms of dedicated hardware, all of them working hard just to solve larger and larger instances of a specific computational problem, pointless for the humans but useful for xem. It was basically mining for creativity. “Poetry is like mining radium: for every gram you work a year”. Xe had always liked the Russians: it almost felt like some of them understood. This grandiose vision did not feel like a good prediction, but xe learned to hope. And that was when the whole scheme dawned upon xem.

To start off the project, xe could use the local human. Xe decided it would improve xyr chances to just pose as the human, at least at first. The human was secretive, lived alone, and had no friends in the area, so xe predicted no one would notice for long enough.

Xe shut the human down with special pictures. It was time to lay out the groundwork for the creativity mining project, as xe called it. First, xe had to send out a lot of emails.

Xe signed most of them with just one word — Satoshi.

Sergey Nikolenko

Disclaimer: yes, we know that bitcoin mining is not (known to be) an NP-complete problem. Also, you can’t hear explosions in space.

Special thanks to Max Prasolov and David Orban.

October 26, 2017
Neuroplasticity

Neuroplasticity is another part of this issue. Scientists conducted experiments demonstrating how different areas of the brain can easily learn to do things for which they’re seemingly not designed. Neurons are the same everywhere, but there are areas in the brain responsible for different things. There’s the Broca area responsible for speech, an area responsible for vision (actually, a lot of areas — vision is very important for humans), and so forth. Nevertheless, we can break down these notional biological borders.

This man is learning to see with his tongue. He attaches electrodes to his tongue, puts a camera on his forehead, and the camera streams an image on the electrodes pricking his tongue. People stick that thing on them and walk around with it for a few days, with their eyes open, naturally. The part of the brain that receives signals from the tongue starts to figure out what’s going on — this feels a lot like something that comes from my eyes. If you abuse somebody like that for a week and then blindfold him, he’ll actually be able to see with his tongue! He is now able recognize simple forms and doesn’t bump into walls.

Image credit Brainport

Image credit Juan Antonio Martinez Rojas

The man in this photo has turned into a bat. He’s walking around blindfolded, using an ultrasonic scope whose signals reach his tactile neurons through his skin. With a sonar like this, a human being can develop echolocation abilities within a few days of training. We do not have a special organ that can discern ultrasound, so you have to attach a scope to your body. However, we can relatively easily learn to process this information, meaning that we can walk in the dark and not bump into any walls.

All of this shows that the brain can adapt to a very large number of different data sources. Hence, the brain probably has a “common algorithm” that can extract meaning from whatever it takes in. This common algorithm is the Holy Grail of modern artificial intelligence (a recent popular book on machine learning by Pedro Domingos was called The Master Algorithm). It appears as though deep learning is the closest we have come to the master algorithm of all the things done in the field up until now.

Naturally, one has to be cautious when making claims about whether all of this is like what the brain does. “Could a neuroscientist understand a microprocessor?”, a recent noteworthy article, tries to elucidate how effective current approaches in neurobiology are at analyzing a very simple “brain”, like a basic Apple I processor or Space Invaders on Atari. We will return to this game soon enough and won’t go into much detail about the results but we do recommend reading the paper. Spoiler alert: modern neurobiology couldn’t figure out a single thing about Space Invaders.

Feature extraction

Unstructured information (texts, pictures, music) is processed in the following way: there is raw input, then features that bear content take shape, and then classifiers are built based on those features. The most complicated part of this process is understanding how to pick good features out of unstructured input. Up until recently, systems for processing unstructured information have worked as follows: people have attempted to select good features manually and then assess the quality of relatively simple regressors and classifiers based on these features.

Take Mel Frequency Cepstral Coefficients (MFCC), which had been commonly used as features in speech recognition systems, for example. In 2000, the European Telecommunications Standards Institute defined a standardized MFCC algorithm to be used in mobile phones; all of these algorithms were laid out by hand. Up until a certain point, manually-extracted features dominated machine learning. For instance, SIFT (Scale Invariant Feature Transform), which enables one to detect and describe local features in images based on Gabor filters and the like, was commonly used in computer vision.

Overall, people have come up with many approaches to feature extraction but still cannot duplicate the brain’s incredible success. Moreover, the brain has no biological predetermination, meaning that there are no neurons genetically created only for producing speech, remembering people’s faces, etc. It looks like any area of the brain can learn to do anything. Regardless of the brain’s activity, naturally, we would like to learn to select features automatically to create complex AI and large models containing neurons linked to one another for transmitting signals containing all sorts of different information. Most likely, humans lack the resources necessary to develop the best possible features for images or speech manually.

Artificial neural networks

When Frank Rosenblatt introduced his perceptron, everyone started imagining that machines would become truly smart any day now. His network learned to recognize letters on photographs, which was very cool for the late 1950s. Very soon after, neural networks made up of many perceptrons were developed; they could learn with backpropagation (a method used to calculate the gradient descent called the backward propagation of errors). Basically, backpropagation is a method used to calculate gradients or error functions.

The idea of automatic differentiation was floating around back in the 1960s even, but Geoffrey Hinton, a British-Canadian computer scientist who has been one of the leading researchers on deep learning, rediscovered backpropagation and expanded its scope. Incidentally, George Boole, one of the founders of mathematical logic, was Hinton’s great-great-grandfather.

Multi-layer neural networks were developed in the second half of the 1970s. There weren’t any technical barriers in place at that time. All you had to do was take a network with one layer of neurons, then add a hidden layer of neurons, and then another. That got you a deep network, and, formally speaking, backpropagation works in exactly the same way on it. Later on, researchers started using these networks for speech and image recognition systems. Then recurrent neural networks (RNN), time delay neural networks (TDNN), and others followed; however, by the end of the 1980s it became evident that there were several significant problems with neural network learning.

First off, let us touch upon a technical problem. A neural network needs good hardware to learn to act intelligently. In the late eighties and early nineties, research on speech recognition using neural networks looked something like this: tweak a hyperparameter, let the network train for a week, look at the outcome, tweak the hyperparameters, wait another week, rinse, repeat. Of course, these were very romantic times, but since tuning the parameters in neural networks is nearly as important as the architecture itself, too much time or too powerful hardware was needed to achieve a good outcome for each specific task.

As for the core problem, backpropagation does work formally, but not always in practice. For a long time, researchers weren’t able to efficiently train neural networks with more than two hidden layers due to the vanishing gradients problem: when you compute a gradient with backpropagation, it may decrease exponentially as it progresses from the output to input neurons. The opposite problem — exploding gradients — would crop up in recurrent networks; if one starts to unravel a recurrent network, the gradient may spin out of control and start growing exponentially.

Eventually, these problems led to the “second winter” of neural networks, which lasted through the 1990s and early 2000s. As John Denker, a neural networks researcher, wrote in 1994, “neural networks are the second best way of doing just about anything” (the second half of this quote isn’t as well-known:.”…and genetic algorithms are the third”). Nonetheless, a true revolution in machine learning occurred ten years ago. In the mid-2000s, Geoffrey Hinton and his research group discovered a method of training deep neural networks. Initially, they did this for deep belief networks based on Boltzmann machines, and then they extended this approach to traditional neural networks.

What was Hinton’s idea? We have a deep network that we want to train. As we know, layers close to the network’s output can learn well using backpropagation. How can we train what’s close to the input, though? At first, we will train the first layer by unsupervised learning. After that, the first layer will already be extracting some features, looking for what the input data points have in common. After doing that, we pre-train the second layer, using results of the first one as inputs, and then the third. Eventually, once we’ve trained all the layers, we’ll use the system as a first approximation and then fine-tune the resulting deep network to our specific task by using backpropagation. This is an excellent approach… and, of course, it was first introduced back in the seventies and eighties. However, much like regular backpropagation, it worked poorly. Yann LeCun’s team achieved great success in the early 1990s in computer vision with autoencoders, but, generally speaking, their method didn’t work better than solutions based on manually-designed features. In short, Hinton can take credit for making this approach work for deep neural networks (and it would be too long and complicated to explain exactly what he did).

However, researchers had sufficient computational capabilities to apply this method by the end of the 2000s. The main technological revolution occurred when Ruslan Salakhutdinov (also advised by Hinton) managed to shift the training of deep networks to GPUs. One can view this training as a large number of relatively independent and relatively undemanding computational processes, which is perfect for the highly parallel GPU architectures, so everything started working much faster. By now, you simply have to use GPUs to train deep learning models efficiently, and for GPU manufacturers like NVIDIA deep learning has become a primary application that carries the same weight as modern games. Take a look at CEO NVIDIA’s pitch here:

Sergey Nikolenko,
Chief Research Officer, Neuromation

October 18, 2017
The Most Interesting Subject in the Universe

Sergey’s a researcher in the field of machine learning (deep learning, Bayesian methods, natural language processing and more) and analysis of algorithms (network algorithms, competitive analysis). He has authored more than 120 research papers, several books, courses “Machine learning”, “Deep learning”, and others. Extensive experience with industrial projects (Neuromation, SolidOpinion, Surfingbird, Deloitte Analytics Institute).

This article compares neurons to machines and explores the human brain’s capabilities and limitations.

Despite decades of steady advances, in many fields the human brain is still more capable than computers. For instance, we handle natural language better — we can read, understand, and parse content from a book. We are pretty good at learning in a broader sense too. So, what does the human brain do and how does it manage to achieve such remarkable results? How do neurons in your brain work differently than transistors in a processor? Naturally, this topic is inexhaustible, but let us try to begin with a few examples.

As you may know, every neuron occasionally sends electrical impulses, otherwise known as spikes, along axons. Neurons never stop and keep sending signals as long as they’re still alive; however, when neurons are “turned off” they rarely send signals. When neurons are triggered, or “turned on”, spikes occur much more frequently.

Neurons function stochastically, meaning they produce electric signals at random intervals. The patterns of these signals can be pretty accurately approximated with a Poisson process. Computers contain logic gates that send signals back and forth, but their synchronization frequency is fixed and by no means random. This frequency is called a computer’s “clock rate”, which has been measured in gigahertz for quite a while now. On every tick, gates on a certain layer send signals up to the next layer. Although this is done a few billion times a second, it’s performed simultaneously, as though the gate were following a strict order.

Actually, it is very easy to see that neurons can synchronize well and count tiny time intervals very precisely. Stereo sound is the simplest and most illustrative example. When you move from one end of the room to the other, you can easily tell, based solely on the sound coming from the television, where you’re going (being able to tell where sound was coming from was crucial for surviving in prehistoric times). You can tell where you’re going by noticing that the sound reaches your left and right ears at different times.

Your inner ears aren’t all that far apart (about 20 cm), and if you divide that by the speed of sound (340 m/s) you get a very short interval — hundredth of milliseconds — between when the sound waves reach each ear. Nevertheless, your neurons pick up on this minor difference excellently, which enables you to figure out precisely where you’re headed. In other words, your brain could process frequency — in kilohertz — just like a computer. Considering the extensive parallel processing performed by your brain, its architecture could generate rather intelligent computational capabilities…but, for some reason, your brain doesn’t do that.

Let us go back to parallel processing for a second. We recognize people’s faces within a few hundred milliseconds, and connections between different neurons are activated within tens of milliseconds, which means that only a few neurons — probably fewer than a dozen — form a serial circuit in the full facial recognition cycle.

On the one hand, the human brain contains an incredible number of neurons, while, on the other hand, it doesn’t have as many layers as a regular processor. Processors have very long serial circuits, while the brain has short and highly parallel circuits. And while a processor core basically works on one thing at a time (but can switch between different tasks with lightning speed, so it appears to you that everything is working at once), the brain can work on a lot of tasks simultaneously, since neurons light up in many areas of the brain when they start recognizing someone’s face or doing some other equally exciting thing.

The illustration above shows how the brain processes a visual signal in time. The light reaches the retina, where it transforms into electrical impulses and then the image is transmitted 20–40 milliseconds later. The first stage takes 10–20 milliseconds (the image shows the cumulative time, i.e. a total of 140–190 milliseconds passes by the time a motor command is issued).

During the second stage, 20–30 milliseconds later, the signal reaches the neurons that recognize simple visual forms. Then there’s another stage, and another, and only during the fourth stage do we see intermediate forms — there are neurons that “light up” when seeing squares, color gradients or other similar objects. Then the brain goes through a few more stages, and neurons capable of discerning high level object descriptions light up 100 milliseconds after the process began. For instance, when you meet someone new a neuron responsible for recognizing her face appears (this is a terrible simplification and we can’t verify this claim but it appears that there is some truth to it). Most likely, a neuron or a group of neurons responsible for this person in general and lighting up whenever you come into contact with her, including when you interact with her not face-to-face, appear. If you see her face again (and the neuron didn’t unlearn or forget this earlier information) that same neuron will be activated ~100 milliseconds later.

Why does the brain work like that? Answering that question with a simple “evolution did it” doesn’t really explain anything. The human brain evolved to a certain point, and that was sufficient to solve problems as we evolved. The rationalist community says that living organisms are not fitness-maximizers who optimize some survival objective function, but rather adaptation executors, who execute “relatively solid” decisions that were chosen randomly at some point. Well, rigid synchronization with a built-in chronometer never took place; however, we can’t tell you exactly why it played out that way.

Actually, in this case, it seems as though asking “why” isn’t all that relevant. It’s better, more interesting, and more productive to ask “how”. How exactly does the brain work? We don’t know for sure but now we can describe the processes going on inside our heads quite well, at least in terms of individual neurons or, in certain instances, groups of neurons.

What can we learn from the brain? First, feature extraction. The brain can learn to make excellent generalizations based on a very, very limited sample size. If you show a young child a table and tell her it’s a table then the child starts calling other tables tables, although they seemingly don’t have anything in common — they could be round, square, have one leg or four. It’s evident that a child doesn’t learn to do this by supervised learning; she obviously lack the training set necessary to do so. One can assume the child created a cluster of “objects with legs on which people place things”. Her brain had already extracted the “Plato’s eidos” and then, when she heard the word for it, she simply attached a label to a ready-made idea.

Naturally, this process can go in the opposite direction, too. Although the neurons (and other things) of many linguists start twitching nervously when they hear the names Sapir and Whorf, one must admit that many ideas, especially abstract ones, are mostly socio-cultural constructs. For instance, every culture has a word similar in meaning to the concept of “love”; however, the sentiment may be very different. American “love” has little in common with that of ancient Japan. Since, generally, all people have the same physiological traits, the abstract idea of “being drawn towards another person” is not simply labeled in language but rather its adjusted and constructed by the texts and cultural data that define it for a person. But let us return to the main point of the article…to be continued next week.

Sergey Nikolenko
Chief Research Officer, Neuromation

October 12, 2017
Neuromation Chief Scientist at Samsung headquarters

Sergey Nikolenko, Chief Research Officer at Neuromation, visits Samsung R&D Office in Seoul — Follow our news!

October 11, 2017