9 July 2019
Introduction. Module system
In this series of posts I will try to explain myself:
- What is Node.js?
- Why does it exist and why is it popular?
- How to develop backend using Node.js?
I should say that I already have some experience with Node.js, but it is not structured well, I'm not sure about main concepts, best practices, etc. It means that this journey should be breathtaking! (Some guy from the crowd: “You're breathtaking!”)
One more warning: I prefer to read sources instead of docs. There will be lots of source code and not so many links to docs. But you choose what you prefer.
Final warning: Patreon is shitty, I can't post code snippets here, so every piece of code is just a screenshot generated by Carbon. They're awesome, thanks them! And code snippets written by me will be available on GitHub.
Ryan Dahl presented Node.js on JSConf 2009 and said that Node.js was:
- Built on Google's V8.
- Evented, non-blocking I/O. Similar to EventMachine or Twisted.
- CommonJS module system.
Not sure that you understand something here. But it's okay. The most important point is that Node.js has non-blocking I/O. (Everything else we will discuss later, but this point is the key one.)
Usually when you build a server application using Python, PHP, Ruby or any other language, you accept the paradigm that any request should have an instance response. It means that if you suddenly need to go to a database to get some data to response, request handling is “paused” and nothing else is being processed until you get back from the database. And if you need any concurrency, you need to use some “special” things of language or even external tools (e.g. separated instances of your application for each request). But what if every piece of your code works in non-blocking paradigm and you don't need to block handling when you have more than one request?
Ryan got this thought when he had seen progress bar of Flickr which created lots of requests just for showing current progress of file uploading. And it was slowing the actual file uploading. Awkward, huh?
If you touched web servers, you probably know that everybody around uses Apache or NGINX. You also may know that NGINX is known to be faster than Apache. As Ryan said in his Node.js presentation speech, the key source of the NGINX's blazing speed is event-based request handling. He cited WebFaction's research where they compared these web servers by memory consumption:
The test is quite synthetic, but the key point is quite clear — event-based architecture of NGINX takes less memory for high amount of concurrent connections.
The same point applies to Node.js. Of course, there is a negative sides too (e.g. it's harder to always write async code), but it's not so important for those who just started to learn Node.js.
In 2009 there were no so much event-based web platforms, and existed were quite complicated. But the most exciting feature of Node.js was, well, JS. The appearance of Node.js cleared the way to backend for thousands of frontenders and other JS-lovers.
Node.js has changed significantly since 2009. It was even splitted into two different projects because of some conflicts between Joyent (company that owned Node.js) and core contributors. But fortunately everything is going well today. Node.js v10 is the stable release for now, and during this series of posts we will use exactly this version.
To install Node.js just open nodejs.org, download preferred release and install it as you usually do on your computer. After that you should be able to use node and npm inside your terminal. If you don't, well, just Google how to do it, it's not so hard.
If you decide to learn new programming thing—framework, library, language, whatever—you always should start with downloading the sources if you can. Of course it's hard to read them when you don't event understand how the thing works, but when you meet the problem you will be able to dive deep into sources to find out the solution by yourself. It's better to have a habit like this rather then googling on Stack Overflow every time.
Source code available on GitHub, or you can easily download an archive with it from “Download” section of Node.js official site.
Finally we can test it. Let's start with experiments with module system. We create class User that defines method hello. Quite simple:
Let's save it as index.js, and run in a terminal using node command from the folder with index.js:
Well, it works as we expect. As you see, console.log works in Node.js in the same way as in a browser. But we was talking about modules.
Let's say we want to move our User class to separated file. Like you can do with any JS library such as jQuery (oh my, jQ in 2k19). In browser we use tag script to do it:
And the content of this script becomes available globally (if you doesn't prevent it). But it doesn't work the same way in Node.js. Let's move our class to the file user.js:
To import this class to index.js you should use the function require:
Now, to understand how it works, let's just add test console.log to the end of user.js:
And run index.js:
Aaaand there is an error. Surprisingly Node.js couldn't find our class and coundn't instantiate the objects. But user.js was required, we see our test string in the output!
The reason why it behaves so oddly is actually an awesome feature of Node.js (well, it's a usual thing for many other languages). Everything that you write in any required file is contained in this file and doesn't leak out if you don't specify the way of doing it. It allows us to use any variables and functions that we like in separated files and don't be scared of their overriding.
But why? Is it sort of Node.js C++ magic or what? No, it's just JS. Every module that you require is wrapped by Node.js to the special function known as module wrapper.
Finding the module wrapper
In documentation of Node.js module system you can find this code:
Let's try to find it in the sources to understand how it works, why it works, and what exactly going on inside Node.js when you run file using it. This time I'm going to do it step by step just to show you how to search inside the source code. In later posts I will just show you the code without explaining my searching ways. And don't forget, I'm using Node.js v10.16.0, you may have a different version reading it.
Firstly you should open the sources in your text editor (WebStorm in my case). Then you need to find “Find in Path” or similar option and search through the folder with Node.js sources. In this case it looks like the string exports, require, module, __filename, __dirname is appropriate to start with, because code around it can be changed (that “(function(” thing), but arguments should be the same, otherwise it wouldn't work. Well, let's search:
There are 26 matches in 9 files, but the first one looks like the key one:
Well, we see the variable that is used somewhere. By searching in the file we find where:
You may not to understand the whole purpose of it, but you totally can understand that there is some Module object which has defined method wrap, which is used to wrap some script with module wrapper. Well, let's find the usage of the method.
Surprisingly, we have it in the same file:
Let's look at the whole code of this _compile method:
What we can get out of it? A lot!
- The method has description, that actually tells us what it does.
- We see that the method gets content and filename from somewhere. Method is called “_compile” which means that content should be a code of compiling script and filename is a name of the file with it.
- On the line 7 shebang is being removed, so it can exist in the source code and Node.js will just ignore it.
- There're some patching-shmatching thing that we can't understand properly right now. And some lazy module loading system too. The latter one is the experimental something, so we can just ignore it.
- On the line 11 source code of the script is being wrapped with anonymous function which declaration we already know. And after that it's being “compiled” by “virtual machine”.
- On the line 78 compiled code is called with arguments that we were expected.
78th line is the most interesting here. Firstly, now we know that context of the evaluated script is exports object. We don't know yet what it is, but can check our guess:
Be careful! Right know we're talking about executing files! It means that we don't know what happen when we execute sometime like this:
You should check it by yourself and try to understand how it works.
Back to 78th line. We know that arguments are matching like this:
- exports that available inside compiled JS file is this.exports from this source code file;
- require is something that is got by evaluation of makeRequireFunction(this);
- module is an instance of class Module;
- filename is got by an argument for _compile method;
- dirname is got from filename.
Last two things are simple and boring. Let's discuss first three.
exports, require, module
this.exports is defined in this way:
And at the moment of evaluation it should be an empty object. We can check it by the code like this one (you get “true” if evaluate it):
To understand what makeRequireFunction(this) does, we should inspect two pieces of code: the very function and the methods require and _load of Module class that it invokes:
First the function. You can see that:
- require function has some other functions linked to it (in JS everything is object, so why not);
- there is some cache of the required modules;
- requiring is made by calling require method of Module instance.
Well, require method is just a proxy for _load, the main purpose of which is to check cache of the loaded module and return or create and return it. tryModuleLoad here is quite simple:
When load method is bigger:
But the main purpose of this method is to load module depends on it's type. E.g. if it's not an ES Module and it ends with “.js” this code will be evaluated:
Well, just “read and compile”.
Also there are other extensions that Module can handle: json, node and esm. E.g. there is the json handler:
Just “read and parse”.
Okay, we know what exports and require are. Finally, to test that global variable module in our script is just and instance of Module, we can check it by instanceof and to ensure that it has any of Module methods:
If you evaluate it, you will get “true [Function]”, which is fine. Also in the source code of _compile we can see that module is this, when exports is this.exports, which means that inside our code exports is just module.exports.
Phew. Let's sum things up here. What kind of useful knowledge we got from this inspection?
- We can get a lot of useful information from sources. Yes, it's possible to read the docs, but it's much better when you literally can touch all things internal things that do magic.
- Each module has it's own set of global variables: exports, require, module, __filename and __dirname. And exports is actually module.exports, and it's an empty object by default.
- Each module is cached when it's required first time and then just loaded from the cache. Cache unique ID is the module's filename.
It looks like nothing, but actually it's one of the most useful info about module system of Node.js. Do you want to get a proof? Okay.
Proper module structure
Let's go back to our User and try to fix the code. We know that there's some exports object that we can use. Let's just write link to our class there:
And because we don't know what to expect, let's log requirement of this file in index.js:
And run it:
Check the 3rd line. We now have our exported class there! Let's rewrite index.js to use it:
Well, it means that we can't assign our User class to exports. But we can assign it to module.exports:
And finally require it without destructuring assignment:
It works and it proofs our guess about exporting system. Actually, we haven't even try to fail it by assigning an object to exports before making it right, but you can do it by your own.
It's even easier to proof the second guess—caching requirements. Let's change our user.js to add some random number to greetings, but generate this number outside of the class:
And change index.js to require it twice and check the result:
We see here that:
- Module was loaded only once: there is only one greeting.
- It's not only about logging, our random number also was not regenerated during second requirement.
- Required objects are the same object, which means that it was saved and then loaded from the cache (true on the 5th line).
Last feature that is worth to discuss is cache overriding. Node.js module system caches not only your modules but also native ones. We can't get this cache while evaluating our scripts as we can with user-defined modules (just require module and get _cache property of it), but according to sources it works exactly in this way:
It means that when you require any native module you get the same object each time. So you can change this object and use these changes later. Here is test.js:
You may know that there is no such method of fs as noop. But here is our index.js:
When we evaluate this index.js, here is what happens:
- fs is required and cached internally;
- test.js is required and parsed;
- during the parsing, fs is required again and the cached object is returned;
- anonymous function is written to module.exports and returned to index.js as a result of requiring test.js (it's also cached);
- noop property is added to fs, which means it's added to cached version of fs;
- finally, test function is called.
If Node.js doesn't cache modules, test will throw an error of undefined property noop of fs. But it does, so in index.js fs.noop is added and then fired:
Now you should read Modules section in Node.js documentation to get more information about useful things that you can get from require and module objects. Next time we will talk about global objects.
Code is available on GitHub.
16 July 2019
Last time we discussed Node.js history and basic things about module system. Let's remember the main example from that post. We had User class that we stored in separated file and required in index.js. Here is user.js:
This is the proper way to build module system. But as usual there is a “hacky” way. As you remember we said earlier that Node.js modules don't work as browser scripts and it's impossible to require module and force its content to be available globally. Well, it's not totally true.
Node.js is run on servers and it should have a possibility to do lots of “server” things: output something to stdout, get something from stdin, run a child process, process command-line arguments, etc. And there are tools for it: built-in modules and global objects.
Last time we concluded that module, exports, require, __filename and __dirname are not truly global. They exist only in the scope of modules. It's possible to use them for passing data between modules (e.g. module.exports), but there is no way to define a local object as global using them. But wait. We used console to log some data to stdout, but this object doesn't defined in the module scope and wasn't required as built-in module in our code. Where is it defined, and how?
But how can we access to window itself when it's a context? Well, easy:
window has a circular link to itself in window property. It makes it possible to get it inside code by the name. (There are a lot of circular links in browser JS. E.g. window.parent in parentless window is also just a link to window.)
Same rule works for Node.js. It has global object too that works the same way, but it's called... oh, global.
Node.js core is written in C++, but modules are written in JS. And global is defined on C++ level, which means that all modules already have an access to it, and it gives them the ability to add any global objects that they want to add.
In Node.js v10 there are several truly global objects:
- setImmediate, setInterval, setTimeout (and clearImmediate, clearInterval, clearTimeout);
The most useful ones are console, process and functions for working with different timers.
Every global object is linked to global, of course. And you can link anything there too. For example, let's rewrite our example with User to work without explicit export. Easy:
But it's an anti-pattern and usually nobody does things like this, except for some monkey-patching things. There are two reasons why it's weird.
First, it's too implicit. You need to remember all requirements and things that happen there. Yeah, it's possible to use some IDE analysis, but usually it makes code unreadable and unsafe. Especially when you use things like this in conditional-like statements and they can bypass your tests:
It's maybe a little bit strange that I say that global requirement is an anti-pattern and at the same time every browser allows and forces you to do it. But if you have any experience with frontend development you know that even there it's usually the worst way to organise your code.
Second, it works differently in various environments. As you may remember, last time I said that this === exports doesn't evaluate to true everywhere. When you run it using string evaluation feature, you get this:
Why? Because it doesn't have a module wrapper around! This code is evaluated on a top level and the context of this code is global, not exports:
Same for REPL:
When you have something that doesn't work stable and do work too implicit you prefer not to use it. Same applies to global.
window vs global
There is no logical explanation why global wasn't named window years ago. As Ryan said on JS Fest 2019 in Kiev, he didn't bother a lot about naming and picked them randomly. Sometimes he chose something related to browser (e.g. he replaced print with console.log), sometimes not. But there are a lot of similar things between them.
First of all, we already said before that they have circular link to themselves and they are global contexts of scripts execution, which makes everything written inside them to be accessible from any part of the application.
Second, they behave in the same way with variables defined on a top level by var. When user writes var a = 1 on a top level of execution (outside of any function) this definition creates a property in window:
Same works with global:
It doesn't work this way with let of const.
So, why don't call these objects with one name anywhere, right? Actually, there's a proposal that defines universal name for them—globalThis! Feel free to read more on MDN. The problem lies not only between browsers and servers, but deeper. E.g. it behaves weird in different browser contexts.
Another question that you can ask is “But why are some things global, but others we should require?”. Well, it's hard to explain. Some objects are there because it was handy years back. Some of them are there because they come from C++. E.g. process is defined on C++ level and it's just extended on JS level. (Actually, there is a possibility that process and Buffer will be deprecated in ES Modules due to security reasons. Today Node.js developers are trying to improve security and this deprecation is one of the first steps towards there.)
We're discussing “hidden” things like context, but it's always funny to touch them and try to do something with them. Of course it's a bit hard to do because you should know C++ to experiment with Node.js sources. But we can use a virtual machine!
vm is a built-in module of Node.js that works similar to eval function, but it compiles and runs your code within V8 Virtual Machine. Of course it's totally insecure and you shouldn't run there any untrusted code.
Let's create a simple example:
Here what we get when we run it:
As you see it works exactly in the same way as with global. foo is available globally without requiring or defining, bar is defined using var and also available globally because var forces it to be written to the context object, and baz is scoped and doesn't affect the context.
Actually we can create a name for our context object and this time call it window!
It also works exactly as we expect: window is a circular link to itself (and to context object), bar is accessible when it's written into window, etc.
Well, we achieved something similar to what Node.js does each time when it runs our code.
If you want to get more about vm and global objects, you should read VM and Globals docs. And there are a lot of useful things inside the Process doc.
All code samples are available on GitHub.
We have already discussed modules and global objects. This post will be about events and how they're threaded into every piece of Node.js built-in modules and environment.
Let's start with well known global object process. According to the documentation it can handle lots of events: beforeExit, disconnect, exit, message, etc. Let's try to test one of them:
When you run this code, you get the first output line right away and two more two seconds later:
The code above is quite clear, so you understand that when the Node.js process is finished the exit event is fired and our handler is invoked. But there are some unclear parts:
- How does process object get these events? What is firing them?
- How many objects have this on method for attaching listeners?
The answer is Event Emitter.
Node.js is event-driven on the each level of evaluation. The top level which defines almost every built-in module also defines one of the most important of them — events. It makes it possible to define your own modules which implement Event Emitter pattern (sometimes it's called Publish-Subscriber (Pub-Sub), or Observer; but beware: they're not totally equal).
An instance of EventEmitter class has the following main methods:
- emit — triggers event and fires all attached listeners for this event;
- on — adds a listener to the queue;
- off — removes a listener from the queue.
There are some more methods, but these three are the core ones. According to this “spec” it's easy to implement our own EventEmitter without inheriting from the built-in one:
And it works as we expect:
Here we've implemented two important features imperceptible to ourselves.
First, we save the order of attached handlers. It means that they are fired exactly at the same order they were attached.
Second, they are fired synchronously. Along with the previous point it makes event handling more predictable, helps to avoid race conditions, etc. But it also makes it possible to hang the whole Node process if the programmer doesn't know how events work. Let's see at the example below:
What happens here:
- First, after all require statements we open current file for reading. We do it asynchronously.
- Then we create an instance of our EventEmitter and two synchronous functions: first one waits for 5 seconds and logs a message, second does the same but after 2 seconds of waiting.
- Finally we attach defined functions as listeners, then log, emit event, log, remove listeners and do log again.
We see that:
- Our handlers were fired in the order they were attached.
- They were fired synchronously immediately after running emit method (the message “Events emitted” were displayed after handlers' messages, not before).
- They block other async operations (file opening in our case; actually you can reach the same “freezing” effect in browser JS).
And the reason is not our implementation of EventEmitter. The built-in one's behaviour is the same. The reason is a possibility to predict code evaluation. If EventEmitter instance doesn't save handlers order and doesn't fire them syncronously, the code execution would be messy and upredictable. You wouldn't be able to rely on them.
That's why we should not invoke a lot of heavy sync actions inside event handlers, otherwise our code would be inefficient and would eliminate the main pros of Node.js — event loop.
Every async action in Node.js are related to the event loop. This is the abstraction that allows Node.js to schedule async things and fire callbacks for them as quick as possible and don't block code evaluation. Usually, when your code is evaluated, all sync things are run in the first place and only after them the rest async things are scheduled and fired (the same scheme applies to the async/sync code inside them and so on).
There is the same event loop in the browser, so it's better to dive deep into the topic and learn how it works in details. The best explanation that I've seen is Phillip Robert's “What the heck is the event loop anyway?”.
Node.js's event loop is a bit more complicated but basically it's the same thing. If you want to get proper picture of Node.js's event loop there is Bert Belder's “Everything you need to know about Node.js event loop”. It's hard to understand if you don't have enough experience with Node.js APIs, but usually reading docs helps.
Events, events are everywhere
As we said at the start of this post, almost all main built-in modules have event interface with methods like on, off and emit. The reason is simple — they all inherit them from EventEmitter and that's their way to handle async events in JS and don't block everything. If you worked enough with browser's JS you surely know that when you want to handle click event on a button, you attach click listener, to be notified about page loading — load listener to window, to get updates of XHR-request — onreadystatechange callback of XHR's instance and so on. The same implies to Node.js's code: if you want to be notified about new HTTP requests then attach a listener to HTTP server instance, etc.
And this way of code writing creates new problems:
- How to notify user when an error is happened in async action?
- Is it possible to cause memory leaks by attaching a lot of event listeners, and how to prevent them if so?
Because events are everywhere, there's an event in Node.js that handled by EventEmitter instances in a special way. This is an error event. Let's create a small example:
When you run it the main process will crash because of “Unhandled 'error' event”:
You can see in the stack trace that it crashed somewhere inside EventEmitter's emit method. Well, it's here:
emit checks whether there is defined handler for error event or not. If no, then it throws an error (which can be defined in the second argument of emit). So, now we know how to prevent crashing:
So there is an agreement, that if anything goes wrong with an object that implements Event Emitter interface and there is no other way to handle it, it should emit an error event to make it possible for user to handle the error by themselves. That's why chid_process emits error when it can't spawn the process, fs.FSWatcher emits error when it suddenly can't watch the file, net emits error when something went wrong with the connection, etc.
And as a result, there is a good practice to add a handler for an error event if it can happen. Like regular final catch for promises-chain.
First, let's see how we can detect GC work in our code. Let's create a small example of infinite process that allocates more and more memory:
When we run it, we get the long line of numbers:
These numbers shows V8's memory usage that is caused by evaluation of our code. As you see they were constantly increased but then suddenly stopped and then normalized to the constant value. Well, the GC did it. Firstly V8 hadn't run it because memory consumption was too low. Then V8 started to run it and GC freed huge part of unused memory. Finally it started to work regulary and continued to free the same amounts of memory.
Actually, instead of guessing we can hook up to GC using Node.js performance tools:
Now we see when and how often GC works (you can even get more info about its work, e.g. type of it, spent time, etc; read more in the docs).
But if everything works so well, how can we cause memory leaks with event emitters? Well, easy-peasy. Let's assume that the array from the example above is built inside an instance of some class which also attaches some event listener to the separated event emitter:
Here what happens when we run it:
Whoa! What did happen? Well, GS tried to free the allocated memory, but it was impossible to do, because we tied up long-living object with shot-living ones.
During the instantiation of Producer that we do each 200 ms, we not only generate huge amount of data, but also attach the listener to EventEmitter instance, defined in the global scope. And our listener contains part of its context (actually, the whole context — this, but it can be any other Producer's method or property, e.g. this.data). And what we get here is the chain of connected objects: event emitter → our listener → listener's context → huge array. So, when GC tries to remove useless objects it can't do anything with the huge array, because it's related to still living instance of EventEmitter. And it leads to constant memory growing.
Of course it's possible to get rid of this behaviour and let GC do its work. E.g. we can add a method which will remove the event listener and call this method inside our setInterval. But the point is — one should be careful while using instances of EventEmitter, especially when using globally defined ones, because it's possible to cause memory leak which will lead to crash.
Anyway, there was one more insteresting thing to discuss in the log above — the warning. Usually it's not okay when you add a lot of listeners to the one EventEmitter instance, so Node.js warns you. But it's just a warning, it doesn't prevent adding new listeners:
But as warning said, we can increase or event turn off the limit using setMaxListeners method. But of course it's a red button that should be pressed very carefully, and the limit may be turned off only in development code, not in production one.
To learn more about EventEmitter read the docs, there are more useful methods that one can use.
If you want to get more about event loop inside that Node.js and misconseptions about it, there's an article by Daniel Khan.
And, finally, more about memory management in browsers you can read in co-called article on MDN.
The code samples from the article are available on GitHub.
1 August 2019
Streams, Part I
There're two main concepts in Node.js: events and streams. They're the basement of the whole Node.js's “event-drivenness”. We discussed events in the previous article. This one is about streams.
Streams are everywhere
It doesn't matter whether you're experienced developer or not a developer at all, you know what streams are. Nowadays streams are everywhere. In any social network they have something that is called “Stream” or at least looks like it. Usually it's related to video — Streams on Twitch, Live Streaming on YouTube, Live Videos on Instagram. Some streams have even evolved into something so familiar that nobody calls them “streams”. “Where do you listen music? — Well, I prefer Apple Music”. Streamed music is so natural these days that nobody cares what “streamed” means. Why is it even called “streamed” instead of “online”?
Well, because the main key of “streaming” is a possibility of infinite data and pros and cons that it has. For example, when you watch a video on YouTube, you know when it ends because you can see it on the player's dashboard. But when you watch a stream you don't have such knowledge.
The same logic applies to the streams in Node.js.
It's a generator that calls itself recursively and returns natural numbers. And it works:
But there is a problem — it can easily cause stack overflowing:
Yeah, well, you always should remember about call stack size when you write recursion programs (especially with generators). Let's rewrite our stream producing function without affecting call stack recursion and generators, but saving the similar notation:
Now it doesn't cause stack overflowing and works exactly as we expected. But it's just a function that produces an infinite sequence of natural numbers and gives us the next one every time when we ask for. Where is the magic?
Let's assume that our naturalNumbers function isn't a function but a file that we can read line by line, item by item. In this case we want to have something that will do it for us. Something that can read more than one item at the time. Let's create a class that will instantiate such thing for us:
Note here two things.
First, it doesn't matter how exactly our ReadableStream gets data. In real life, of course, nobody wants to get just a stream of natural numbers. But we easily can rewrite our constructor to get an argument that will be the source of data. E.g. if it was the stream for reading files on file system the source of our data would be the OS's API that allows us to read files.
Second, we introduce here the thing called buffer. It's important, because that's the heart of any stream (along with its state). The buffer contains currently available data. And it's always limited by some constant because we don't want the buffer to be overflowed. Why? Because otherwise there's nothing useful in streams. The main advantage of the buffer here is that we can read data chunk by chunk and don't use all available memory. As you remember from the previous posts, that's the key feature of Node.js.
Okay, how does it actually work? Let's look:
That's how. But it's a readable stream. It's cool that we can read things, but sometimes we want to write them. For example, you read file and send it over the network. It's useless to read the whole file to the memory and send it afterwards. It's much better to read it chunk by chunk and send it in the same way. In this case you don't waste you're machines memory and can do lots of these operations simultaneously.
Because right know we doesn't work with real streams, our WritableStream is damn simple:
Here we assume that someone will use write to pass some data (array of something), and each second our stream will flush the buffer by writing it somewhere (e.g. send over the network to the client).
Now, let's write the result of reading into it:
If we added console.log(this.buffer.length) to the write method and run this code we would see this:
Looks like it works. Roughly each 10 ms we read data array from readable stream and pass it to the writable. And about each 1000 ms our writable stream flushes the buffer.
But there's another problem right now. As we see, the reading here works much faster than the writing. It means that it's possible to use the whole OS's memory just because our writable stream doesn't have any buffer. Let's add it:
Here it is. But is it useful? Seems like if stream's buffer is overflowed we just ignore any new coming data. It means that we loose them, and it's definitely not a good idea. What can we do here?
On the one hand we can throw an error and force the programmer to fix it in any way. On the other, we can do more gentle thing and add a return flag which will signal the programmer the current state of the buffer. Let's say, if write method returns false it means that it's enough data for now.
Why is it possible to go this way? Because usually the buffer size isn't the strong fixed value. It means that stream can get more data, it just doesn't want to because otherwise it starts to consume more memory that it should. And it's on the developer to decide — should the stream eat more RAM or not. Now, let's rewrite write to add the flag:
The slight change but now we can handle buffer overflowing:
Easy, right? If the buffer is overflowed we just wait a bit and start writing again.
But... why is there 200 ms? Why not 201? 300? Well, that's because we don't know when exactly we should continue writing. If there was any way to get a notification... but wait.
Yep. As you remember, events are everywhere (there definitely should be the meme with Buzz Lightyear, but I'm trying my best to be serious). So let's add some event emitting to our writable stream to the moment of flushing:
We've simplified the interface and instead of usual for Event Emitter on / off methods added once which means that our callbacks are removed by the stream after firing (removing is also a bit simplified and later it will shoot our legs but it's okay).
Now, let's use it:
And it works like a charm:
Now when our writable stream can't swallow more data it returns false, we pause our writing and add listener for drain event. By default max size of the buffer is 50 items, that's why it stops when we've written 60 (as you remember we don't have a strict limit and return false only when current buffer size is greater that it's allowed).
We eliminated one setInterval, but still have another one. We still read data each 100 ms and still do it 10 items per time. It's impossible to use this strategy in real life, because there are lots of situations when it doesn't work. For example, if you read data from different server over TCP it might haven't been gotten yet for the moment of reading. Well, it looks like we need one more event!
There should be a way for readable stream to say “Okay, you can read know”. Of course we can emit event like ready. But how to notify the handler about the amount of available data? We can pass it as an argument to the event handler, but it still looks complicated, because the programmer should write all these wrappers and invoke read method everywhere. (Another one way is just don't care about the size and let the consumer to read as much as it can, but it's not the point now.)
Let's do it in a better way — pass the read data as an argument! Like this:
Now, when new data is available stream fires callbacks for data event and passes read data to the handlers. Here it is in action:
(As I said earlier, once would shoot our legs, and here it is. Because of the sync code evaluation, if we add data handler without setTimeout it will be removed inside ReadableStream during resetting callbacks array. But it's fine, let's pretend that nobody sees this.)
When we evaluate code above it reads data when it's available and writes it right after that.
Now let's join all pieces together.
It would be more handy if there is a way to just say Node.js: “Hey, there are two streams. Take them, read data from the first one and write it to the latter”. So, let's just write a function for that:
Easy-peasy! Let's run it:
Isn't it awesome? Well... not quite yet. Let's add logging to write, read and pipe, and see what happens there:
Wow. Everything seemed to be correct, but then suddenly batch of read invoked. Why? Well, because there's no any limits for reading. Our “reading simulation” inside ReadableStream simulate reading each 100 ms. If we decrease it to 10 ms we will see lots of “read 10” in the log going one by one. And only several of them will be handled by pipe, because our reading doesn't stop when there is no data listener. So let's add pause / resume methods:
(We don't care here about accessing to methods, so any of them are defined as public.)
Now everything works perfectly. ReadableStream reads data from the source only when there are data event listeners, which means that we won't loose anything. And it pauses reading until new listener is added (that's because we implemented once instead of on / off; but again, for our example it's totally fine). It allows us to make read-write operations without using a lot of computers memory and as fast as possible.
It looks like now we understand what streams are and why they're useful, so let's move to the real streams in Node.js.
Streams in Node.js are quite similar to that we implemented above, but they can handle much more extreme situations.
First things first, there are four types of streams in Node.js:
- Readable — for reading data from somewhere;
- Writable — for writing data to somewhere;
- Duplex — for both reading and writing data;
- Transform — for reading, transforming and writing data.
All these streams are implemented by module stream. But usually you work with its descendants. Let's talk about each of them separately.
Readable Streams in Node.js
Usually every Node.js developer works with Readable Streams everyday, because they are implemented at least by fs.ReadStream, htttp.IncomingMessage, and sometimes process.stdin. As you may understand, when you create a web server, you need to work with the network, file system and sometimes with standard input stream.
Let's look at the example below:
When we run it we see:
Not so readable, huh? That's because we didn't ask fs to convert its internal buffer to the string. But let's first understand what happened here:
- On the line 3 we created a readable stream using fs and passed __filename as an argument. First argument is a path of the file that should be read by fs. So we asked it to read the current file.
- On the line 5 we attached a listener to the readable event, where we read some data and log it.
- On the line 10 we attached a listener to the end event, where we log “Done”.
Well, that was obvious. But what happened inside?
When we asked fs to create a readable stream for the current file, it created an object for the stream and asked OS to open the file. To be more accurate, it invoked C++ method called Open:
You may not understand C++, but you can see the important condition on the lines 17–28. This function can be invoked with different sets of arguments, which depends on how exactly the file should be read — asynchronously (AsyncCall) or synchronously (SyncCall). Well, fs.createReadStream invokes internally fs.open, which fires this C++ method with set 4th param, which forces the method to pick the async way. (As you may understand, fs.openSync does the opposite thing.)
Why is it important? As you remember from the previous post, one of the main components of Node.js as a platform is the event loop. And there is an order of evaluation where sync actions are performed first, and async ones are called later. File stream creation is always an async process, which means that we can be sure that this is safe to create a stream, then add any listeners to it and nothing will happen until the next cycle of event loop (i.e. at least until the end of evaluation of our sync code).
So, when we create the stream on the line 3, Node.js creates any related objects but don't do anything related to fs right away. It continues to evaluate our next instructions.
Okay. Then, on lines 5 and 10 we add event listeners. We can do it, because Stream is inherited from EventEmitter. Literally, here are the first lines of stream source code:
The stream is created, the event handlers are set, but what happens next?
Well, actually, Node.js's ReadStream interface is kinda tricky and it works similar to our interface that we've implemented above. It has overriden on method (here it's called Readable, because the logic is implemented in EventEmitter, not fs):
As you see adding data or readable event listener forces the stream to try to read data from the source. But, again, the stream does it in async way, so it's safe to add more listeners.
Anyway, when we add readable listener, Node.js schedules reading data to end of sync actions, and when the right time comes, it gets data from the OS. Then Node.js checks the size of received data. If it's lower that highWaterMark constant (16 Kb by default, but can be changed), it tries to read more. Otherwise, when there is no more data or hightWaterMark is reached, Node.js emits readable event. And it emits data event and passes new data to the handler of this event.
Finally, if there is no more data, Node.js emits readable, and closes the stream. If you try to read stream during this last readable event handling, you get null, which also informs you that the stream is completely finished. That's why we got null in our test.
To illustrate how it works during the lifecycle, let's add a listener to every possible event:
We also added firing pause to illustrate something. Let's eval:
Okay. First we got pause event, because we fired pause manually. As you see, it's possible to get some events even before opening the file, because stream exists whether or not the file is open.
Then we got open and ready. They aren't related to stream.Readable itself, there are events of fs.
After them we got data, readable and end as we got before. And finally, close. Close is also fs's event.
We didn't get resume and error. There was no error, it's fine. But why didn't we get resume while we got pause? Well, pause and resume are emitted only when someone fired pause and resume methods.
But wait. We fired pause, but data was streaming without any problem. Why? Check the implementation of on method on Readable and you see that when we add data handler the resume is fired. Buuut.. where is resume event? We can get it if we change the order of listeners attaching! Let's do it and also log some a piece of stream state — flag flowing.
What the heck? Attaching data listener changes flowing state, pause returns it back, but resume is fired no matter what!
The answer isn't quite simple — internal resume is fired asynchronously, but state is changed synchronously. So firstly, during attaching data listener, method on switches state to true and schedules resume evaluation. Then we run pause, which changes the state back to false. But resume has already been scheduled, so it's fired when all sync things are done.
Here you should ask — why is it so weird?
That's because before using some methods and events we should read docs completely and carefully. The reason of this strange behaviour is that readable and data events come from two different worlds.
You should use readable when you want to work with streams on a “low” level, using pause, resume, manual reading etc.
But when you want to get data from the stream and don't want to care about this manual things, just believe in data event. You shouldn't mix them. Otherwise you get that odd behaviour. (The same applies to pipe, but we haven't discussed it yet.)
So, the correct way to read data from our stream looks like this:
It's cleaner, you don't need to check for null, and it works:
Oh, and I think it's time to make it readable by setting the encoding:
As we saw, we handled data event only once. What happens when it's called twice? Should we join data in some way? What do we need to do when it's binary? Is it possible to get corrupted data? Or what happens with the stream when error is emitted?
There are lots of questions. And we dive into them next time in the Part II, along with other types of streams.
If you want to get more info about the things that we've discussed, here are some links:
- Streams API (browser) — as I said at the start of this post, it's an experimental API, but it's still worth to read about.
- Stream and File System — Node.js docs.
- Backpressuring Streams — that behaviour that we implemented above, when following stream says to previous one “Wait! I can't eat more!” is usually called backpressure. This article describes how it works in Node.js. It's a bit beyond the current post, but I think it's possible to understand.
Not related to the topic directly, but still interesting:
- Corecursion — I haven't touched this topic here, but it's usually related to the streams as math abstractions, and I'm not a right guy to explain it.
- Data vs Codata — as a consequence of understanding that streams as somehow codata (after reading the previous article), you may think that this one is also worth to read.
- Total function programming — one more article with a lot of math-like terms and links (e.g. partial function).
Every piece of code that has been written here is available on GitHub.
9 August 2019
Streams, Part II
Let's continue. Last time we:
- implemented our own streams simulation,
- started to discuss Node.js Streams,
- tried to work with Readable Stream, messed everything up, but finally found the right way.
And we stopped with a question — what happens when there are more that one data event emitted? To figure it out we don't need to use a big file, let's just set highWaterMark to a low value. Let's say 5. It means that stream will try to read just 5 bytes per time. Here is how it looks:
In data.txt we have 50 “а” and a new line at the end:
Oh, again? What's wrong this time?
Actually, that's me. I've tricked you (probably), because I didn't set encoding for reading, aaaand “а”s in the data.txt are Cyrillic, not Latin. But it's awesome, let's try to understand what's going on here.
First things first, we created a stream and set hightWaterMark. There's no any mistakes. The stream worked exactly as we asked — it was reading 5 bytes per time and emitting data event, passing read bytes into it.
Then, when event was emitted, we got data and added it to content. But, as you remember from the previous post, when we don't set encoding we get data as Buffer, not string. Let's ensure:
Here is what we get. Buffer contains bytes that stream read from the source. As you see, there are 5 bytes inside each buffer (except last which holds new line character), as we asked for — five hex-numbers representing parts of the string from data.txt. And buffers are oddly repeated with some sort of byte shift.
Now we're entering into the swamp terrain — the area of encodings. I like it, but now isn't a proper time to explain everything deeply, so I'm going to make clear the most important points here, and somewhen in the future I will write more detailed post about it.
The file with data was saved in UTF-8, because it's 2019 and we always save files in UTF-8. The data inside the file is a sequence of Cyrillic letters. Each Cyrillic letter takes 2 bytes of memory in UTF-8. For example, letter “а” that we used there is coded as D0 B0:
When Latin letter “a” is coded as 61 and takes just 1 byte to store:
(I've got these tables on unicode-table.com.)
That's the main feature of UTF-8 — to save characters using the least possible amount of bytes.
So, now you probably understand that when we got the buffer with 5 bytes of data inside it contained 2,5 letters. And after getting it we were trying to concat the buffer to the string content. It caused wrong decoding during conversion, and as a result we had a “broken” characters at the end.
How to fix it? As I said before, I didn't set encoding explicitly that time. Let's fix it:
(We've got two new lines on lines 9 and 10 because we log data using console.log which adds a new line character after the passed data. So, on line 9 we log the new line character from the data.txt, and after that console.log adds one more.)
Awesome. We set the encoding and now everything works correctly. (Another one solution is to concatenate buffers instead of strings, and convert result buffer to the string at the end.) But how does Node.js do it? Looking at the log we see that it reads two or three characters per time. Why?
Well, that's because as we discussed before highWaterMark isn't the constant value, it's our “advice” for Node.js, which it may change in some difficult situations like this one.
What it does internally is try to detect incomplete multibyte characters at the end of the buffer and move them to the next buffer. It may sounds tricky, but actually it isn't, because UTF-8 has a specific scheme, and it's possible to detect “broken” bytes by checking them using this scheme. There is a module that does the magic — String Decoder. We will check its code in the next article.
Now it's time to move to Writable Stream.
Writable Streams in Node.js
Let's assume we're creating a web server and we should return a HTML file for user's request. Usually in real life there is a separated server that takes static files and gives them to user (e.g. nginx), but this is just an example. It may look like this:
If we have index.html in the same directory with this file, this server sends it when we load http://localhost:3000/:
(It also can handle errors, but it should be obvious without screenshots.)
This HTML page is simple and this code actually not so bad for production. But what if we want to serve huge files? Let's say we want to create a file hosting and anyone can upload and download any files. In this case the code above is terrible, because it loads whole file into the memory and then sends it. But we know what to do, right? Let's rewrite it using streams!
Better? A little bit. We use streams but don't do it with respect. I'm sure you see lots of mistakes here, because we've already discussed them. Let's fix.
First of all, of course reading is faster than sending over the network, so actually we've got the same handling as before, because we ignore return value of write and push data regardless the possibility of sending. That leads to overflowing the buffer of writable stream, and as a result — we hold a whole file in the memory. Let's do it more careful and rewrite sendFile:
Now we know when we can write and when can't. Also as you see we trigger closing of destination stream when content of source one is finished (end event fired). It's a bit safer than checking truthfulness of content, and at the same time makes the code more readable (otherwise we need to split the condition inside write function into two pieces, etc).
But as you remember we can use data event instead of readable to make it more handy:
As you notice, Node.js writable streams has the similar API to writable streams that we implemented in Part I. We can fire write to push data into the stream, can listen for drain event, check write's return value, etc. But of course there are more different events and methods, you can find them in the docs. One of the new methods we use in our code — end. It closes the stream (and fires one more write before closing, if any data is passed as an argument) which makes it impossible to write more data, and allows Node.js to destroy the stream when data has been fully passed away from its buffer.
Also we don't use stream.Writable here. We use http.ServerResponse, because res in an instance of it. It means that it has custom methods (such as setHeader that we fired during our first try) and events. We will check some of them later, but now let's back to the code.
So the code is clearer, but we have a new problem. It doesn't work properly with big files, because we assume that removing listener for data event pauses the source stream, but it doesn't! Here is a quote from the docs:
For backward compatibility reasons, removing 'data' event handlers will not automatically pause the stream.
So what should we do here? We can pause the source stream manually, then resume, etc. But it's quite exhausting, right? Let's use pipes! Of course Node.js has the same abstraction as we implemented in our home-made streams, but it's even better:
Awesome, huh? But of course there're pitfalls! Let's look at the whole file and compare it to the code that we wrote on our first try (without streams):
If you scroll above and look at that code, you will notice that there we handled errors during file reading and also set headers. Well, we can do it here too.
When we work with streams, we always can rely on error event, because it's fired when any error is occurred and streaming can't continue (reading or writing):
Now, if we run the server, than rename index.html to something else and try to open http://localhost:3000/, we get an error in the browser:
And log in the terminal:
This is a regular Node.js error object which has a code, so in real life there is more elaborate error handling, where we check the error code and choose the way of handling: just logging, sending an error to the user, restarting the server, etc.
Anyway, we still need to set Content-Type header for HTML files. Let's do it:
We can set the header on some source's event like open or ready, but actually it doesn't matter when we do it. We only need to set it before data sending, so we do it in a sync way, and it's fine. And we remove the header if error is occurred. In our case it's not so important, because error string looks like HTML, but it's important for other file types (usually Content-Type is set dynamically using packages like mime).
Oof. Is there something that we haven't handled yet? Well, there is.
As we learned before, when ones work with streams they should carefully read the docs, because Stream heirs can emit more events in case of different situations that are not foreseen by origin streams. And developers must be sure that they know all these events and all of them are handled.
One of these events is close of http.ServerResponse. As the docs says, it “indicates that the underlying connection was terminated before response.end() was called or able to flush.” What it means for us is that there is a possibility that user can cancel the request during getting the data.
But why is it important? Well, it sounds like it can lead to memory leaks, because some operations may not finish correctly. Let's try to confirm our guess.
First of all, let's replace our tiny index.html with something bigger. HTML file with “The Adventures of Sherlock Holmes“ seems suitable.
Then, let's add more event handlers. We know that we use fs.ReadStream as a source and http.ServerResponse as a destination. fs.ReadStream emits three more events: open, ready, close. We're not interested in ready, because it doesn't have special meaning, it's an alias for open event which was added to standardize events of Stream heirs. http.ServerResponse emits two more events: close and finish. So, let's log all of them:
Now let's run the server and open the page:
Here is what we see in the log:
We don't see here “writable close”, because it's emitted when user aborts the connection. So let's try to emulate the abortion. It's not easy to make it right in the browser, but we can use curl:
We run it with --limit-rate 1K which means “download 1 Kb per second” and cancel execution right after getting first data chunk. Now we have different log:
What's happened? Well, when user aborts the connection, the destination stream is being closed, but the source one isn't. To force source stream to close we can manually destroy it:
Now everything works properly:
Let's make a conclusion.
First, we now know how to read files and send them without excessive memory usage.
Second, we have a flexible function that can get any streams as arguments. We can replace fs.createReadStream with our own streams implementation which may, let's say, resolve paths in our storage, join chunks of data from different disks, decrypt and decompress them, etc. And instead of http.ServerResponse we can pass any other Writable Stream (but we should remember that if these streams implement custom events they should be handled). That's because Streams and Event Emitters are quite universal and it's easy to implement your own class which would implement both of their interfaces.
But we still have some questions unanswered. Why do we need to destroy readable stream manually in the example above? Is it enough to destroy the stream to remove all associated objects? Are there more pitfalls? How do these actions look in the real life? What about others types of streams? We will cover them in the next parts of the series.
There're not so many useful links this time.
We've slightly touched http module, but it has much more different methods and built-in objects. Some of them we can even use in our result code. E.g. we can replace status strings with http.STATUS_CODES.
Also, http module is based on net. Its description is harder to understand, because it's more low-level. But it will help us to answer some questions that we described above.
And the code is available on GitHub.
17 August 2019
Streams, Part II (appendix)
Earlier we solved some problems with Readable Streams and encodings and wrote not bad working function for piping from Readable to Writable Streams, which we used in our case for sending content of HTML files to clients.
There were two interesting things (except UTF-8) that we skipped: String Decoder module and writing abortion. Let's deal with them, because they are worth it.
As I said in the Part II, there is a built-in module String Decoder that does the encoding magic during reading, writing, sending requests, etc. Let's try it:
If we follow the code step by step, here is what happens.
On the line 4 we say to decoder to process 4 bytes: D0 B0 D0 B0. As you remember, D0 B0 means Cyrillic “а”. So we want to decode two “а”. Well, decoder does the job, checks that array of letters isn't finished with a corrupted character and passes it through without any changes. That's why we have “аа” in the log.
Next, on the line 5 we ask decoder to process 3 bytes: D0 B0 D0. It checks the array and notices that there's not enough bytes for the one more UTF-8 symbol, because the latter D0 isn't a valid UTF-8 code. It leaves D0 inside its internal buffer. That's why we see valid “а” in the log and don't see any corrupted characters.
Later, on the line 6 we want to decode only 2 bytes: B0 D0. But decoder has D0 in the buffer. So it prepends that byte to our two bytes. As a result it decodes D0 B0 D0, and the result is the same as on the 5th line: “а” in the log, D0 in the buffer.
Finally, on the line 7 we tries to decode just one byte: B0. But decoder has D0 in the buffer from the previous decoding. Again, it prepends the very byte to the passed array, gets D0 B0 and decodes it as one more correct character. Voilà!
(Note: I want to clarify that String Decoder checks only beginnings and ends of passed array, because it can't restore broken symbols in the middle of them. It should be obvious, but the name of the module looks like it can do any decoding magic.)
Internally this decoder module has some highly optimized code which was designed to be as fast as possible, because it shouldn't be a bottleneck during reading. E.g. there is a code which normalizes passed encoding:
It tries to guess as fast as possible which encoding you want to use. As you see there're special “quick” handling for the most common options: utf8 and utf-8.
Second optimized part of the decoder is its buffer. Let's log it before and after each write and end operation:
Here is what we get:
Symbol(kNativeDecoder) is a unique name of the buffer (made using Symbol). It's a usual practice for Node.js internal objects to have properties like this, when they're related to C++ side, because it makes them safe from accidental overriding by user which may lead to Node.js crash.
Yes, this buffer is related to C++ side, because more than a year ago String Decoder was rewritten to C++ for the sake of speed. It's not a big deal, but it can make clear some weird moments in the log above.
I'm sure you can link this log to the described logic of our code, and now you see saved-restored bytes in the buffer. But there are two questions:
- First, why fourth and fifth buffers in the log contain D0 B0 instead of D0?
- Second, what are the ones in the end of the buffers?
They're not so important, but that strange behaviour is been a pain in the ass when you notice it in the log.
The reason is the binding between C++ and JS. The buffer of the String Decoder instance maps to the fixed-size memory array. You can think about it as malloc in C++ (but actually it's Uint8Array). And it's quite useful feature when we don't want to waste memory.
So, any instance of String Decoder has an internal 7-byte buffer, where each byte is used for specific purpose:
- First four bytes are characters bytes (CB). They represent those bytes that String Decoder tries to restore. It doesn't need to store more, because max length of UTF-8 character is 4 bytes (other encodings have even shorter character length).
- Fifth byte is a missing bytes counter (MBC). It is changed when decoder figures out how many bytes it should read to get the correct character (probably).
- Sixth byte is a buffered bytes counter (BBC). Well, it's just a number of buffered bytes.
- Seventh byte is an encoding flag (EF).
Now we can fully understand what the code above does. Here is the log:
- When we create a String Decoder instance, it sets EF to 1, because the code of UTF-8 is 1.
- Then, we write 4 bytes, String Decoder checks that there's no corrupted bytes, and does nothing.
- After than we write 3 bytes, and String Decoder finds the last byte broken, and adds it to buffer. Also it sets MBC to 1 (because it should read one more byte to complete the character), and increases BBC (because it has written one byte to the buffer).
- Next, we write 2 more bytes. Decoder pushes first one to the buffer, to complete the corrupted character. It's faster than allocate a new buffer for this character. After that it prepends the completed character to the data chunk, but finds one more corrupted character and saves its byte to the buffer. That may be not obvious from the log above, but in the buffer only the first CB is valid, the second CB is just a trash from the previous operations. We can be sure that nothing bad will happen, because we have BBC which says that we have only one meaningful character in the buffer. And of course MBC equals 1 too.
- Finally, we write one more byte. Decoder does the same job as on the previous step, but this time there is no more corrupted characters, so it sets MBC and BBC to 0.
To get more, read the sources: string_decoder.js → string_decoder.h → string_decoder-inl.h → string_decoder.cc.
Let me remind you what happened last time. We were writing an algorithm for reading file and sending it to the client over HTTP, when suddenly we realised that user could abort the connection and broke stream piping, which might cause a memory leak. The code looked like this:
And when user aborts the connection, close is emitted on dest, but nothing else happens. Let's go deeper to the sources and find out what's going on there.
Here is the source code of pipe method:
The most interesting lines are highlighted. As we see, pipe handles lots of edgy situations, but none of these handlers destroys the source stream. So, even if error appears on destination stream, source one isn't destroyed, because Node.js can't be sure that it should be (e.g. it may be an infinite generator of data and when an error occurs on destination stream, something handles it and creates another stream-consumer).
One more important moment here — pipe is implemented on Readable Stream. It knows nothing about classes-heirs, their events and similar effects. So it can't detect such thing as “aborted HTTP-request”.
That's why we should handle close and error events by our own. However there're lots of packages that can do it for us. One of them is end-of-stream. It has easy to understand definition, so you can understand what it does from the repo description. Other one is pump. It allows you to pipe streams with handling finishing by passing the callback as an argument (e.g. for debug or logging purposes).
These packages has become really popular (also check the whole collection), and as a result they were added to Stream module as Stream.finished and Stream.pipeline (I'm sure you know about them, because you've already read an article about backpressuring in streams, haven't you?). So instead of destroying the source stream manually, let's use pipeline instead of pipe:
Now, when we run this code and do the same curl-ceremony again, we get this:
We haven't seen this error yet, because it's a new one and it was added along with pipeline. We don't want to shut the whole server down because of the error like this, so let's handle it:
Now it works properly:
But why is it even important? If you check the code of pipe again, you will see that the source stream is unpiped, which means that it isn't flowing and doesn't consume memory. Why should we care about it?
To illustrate the answer, let's run the original code (without stream destroying), and do request-cancel 10-20 times. After that run lsof -p PID, where PID is process ID of node (you can find it running ps ax | grep node):
lsof shows the list of opened files. As you see right now our node instance keeps a lot of opened files in the memory. You can run sysctl kern.maxfilesperproc to get the limit of open files per process:
24 576 means that each process can open no more that 24 576 files. Which in turn means that this implementation of server can handle ≈24 550 aborted requests and after that it will crash (≈24 550 is here because node opens some files internally for its own purposes).
(Note: there're some differences between OSs, which means that limits in your OS may be get-set by other commands, such as ulimit.)
We can reach this limit to check what happens in this case. We can do it by running some script, which runs-aborts curl in the infinite loop, like this:
But in this case we have to wait 6-7 hours for the error. We can just create two streams, pipe one to another and close the consumer:
It's a slightly different case than we've described above, because here we close destination stream even before writing started, but it leads to the same behaviour, so it's fine. (Closing before starting or reading / writing is also the common case. E.g. someone can send you a zero-sized file.)
Wait two seconds and:
EMFILE is the very error which occurs when Node.js can't open more files because of system limits.
But if we change pipe to pipeline, it works properly:
Conclusion: when you work with streams (or any other entities that you don't or can't understand properly), always check edge cases, read documentation and source code properly.
There is no completely new information, so there is no any articles to read.
But during the article preparation I was thinking about the title and wanted to call it “Part 2,5”, but didn't know how to write it using Roman numerals. So I started to search and read in Wikipedia that Romans had their own way of writing fractions. It my case it would be “Part IIS”, but I had thought that it looks too Microsoft-ish and that's why I decided to call it “appendix”. Anyway, read the article about Roman numerals, it's cool.
The code is available on GitHub.
25 August 2019
Streams, Part III
After thorough examination of Readable and Writable Streams in the previous posts, it's time to get into the rest two types of streams: Duplex and Transform.
Sometimes we want to get an object which implements both Readable and Writable interfaces, because it's more handy rather than two separated ones. Especially when we work with duplex entities such as network sockets. And it's easy to imagine it in Node.js, because Readable and Writable Streams are different: they have different methods, different events, etc. Of course it isn't a coincidence, they were created in this way, exactly for making it possible to combine into the one object — Duplex Stream.
Internally Duplex Stream is easy-peasy — it has ≈150 lines of code where 30% of them are comments. Most of the lines look like this:
But it causes the problem — instanceof doesn't work properly. Let's try to emulate it:
Here is what we get:
As you see, we can run copied methods, but instanceof works only for explicitly inherited prototype. We can even rewrite it using classes to make it a bit clear:
But the result is the same:
Why? Because of the way how instanceof works. As you may know, it tries to find the passed object in the prototypes chain of the requested one. So let's check the chain:
There is no “Second”, so instanceof doesn't work correctly. But let's check Duplex class:
Let's look at the ECMAScript specification and find what it says about the instanceof algorithm:
The abstract operation InstanceofOperator(V, target) implements the generic algorithm for determining if ECMAScript value V is an instance of object target either by consulting target's @@hasinstance method or, if absent, determining whether the value of target's prototype property is present in V's prototype chain.
@@name is a notation for well-known Symbols. They're used across the spec and inside the implementations of different built-in constructions such as instanceof. So, @@hasInstance means Symbol.hasInstance, and here's what spec says about usage of this symbol:
A method that determines if a constructor object recognizes an object as one of the constructor’s instances. Called by the semantics of the instanceof operator.
So, there are two things that help instanceof to work: object's method Symbol.hasInstance and its prototype chain. When the first is accessible then it's used, otherwise checking the prototype chain is the thing. Let's try to play with Symbol.hasInstance and make an instance of Third work properly with instanceof:
Here we've added static property to Second where we check whether the passed object is an instance of Third or just an instance of Second. Now, when we run it, we get this:
It works. instanceof recognizes Second as a parental class of an instance of Third. Same applies for the instances of Second (which means that we haven't broken the original algorithm), and doesn't apply for other unrelated to Second classes (such as Date).
Even more, it works on the same way for all Third heirs, because of object instanceof Third. I means that we can do this:
And get this:
So, why were we talking about it? Because that's exactly how Writable class handles Dulpex instances' instanceof check!
It has a bit more checks just to be closer to the specification, but the general idea is the same. And it makes it possible to write polymorphic interfaces which can get both Duplex and Writable (or Readable) Streams' instances as arguments.
Another one important thing about Duplex Streams is that they have separated Readable and Writable States inside (you can see writable one in the code above). Duplex Stream isn't something that has combined buffer and does complex things. It's just an object that inherits from two sources and keeps two streams in the one place.
To understand how it works let's try to use net.Socket class, which is one of the Duplex's heirs. We can create a TCP connection and inspect socket object which is created internally:
When we run it and make a request to localhost:3000 (e.g. using curl), we get this:
The long output of the internal state of the Socket instance, which is, as we see, also an instance of Duplex Stream (true on the 2nd line) and has two states: Readable and Writable, which are separated and isolated from each other. Since it implements two interfaces, we can use them simultaneously:
Now, let's make a POST request to localhost:3000:
So, write and end methods worked properly and we got an answer (which looks like valid HTTP, because we made it so; also we could use only end instead of sequence of write and end). And here's what we got in the terminal with the running script:
Event listener for data event worked properly and passed data as the first argument. Well, we'd expected it.
So, we implemented sort of HTTP server, but without http module. But it's totally fine because reading from and writing to the socket is exactly the job that http module does internally (along with parsing headers, encoding-decoding data, etc).
Here's one interesting thing that we've missed. Socket inherits from Duplex Stream, but the objects res and req, which you get when you pass a listener during HTTP server creation and which are used to pass the data to the socket, aren't fully implement Readable and Writable interfaces. Actually, res implements Readable Stream, while res implements just Stream — legacy version of stream module, which is Event Emitter with pipe method. It's not a big deal and you still can rely on piping into req object, but just be aware that it's not a Writable Stream and doesn't have all its methods.
Duplex Streams are extremely useful, but they aren't used a lot as instances of Duplex Stream. There're much more applications for them as Transform Streams.
This kind of streams implements Duplex interface, but overrides internal _write and _read methods, and also adds _transform and _flush to the interface, which make its input to be related to its output because they're connected by these new methods.
There're lots of built-in streams that inherit from Transform Stream. For example, almost every class of crypto module implements Transform interface. Whole zlib module is built around the idea of transformation input to output by compressing or decompressing the data. But I think that the most famous example of Transform Streams is an infinite collection of gulp plugins. gulp is a streaming build system where each step of build process usually represented by a custom plugin, and each of these plugins is an instance of Transform Stream. So, if you tried gulp somewhen, you used Transform Streams.
Let's implement a simple custom Transform Stream which will lowercase the passed text:
Note that we don't need to implement _flush method, because the amount of our input data is the same as the amount of output data, that's why we push data inside _transform (implicitly, by passing it to the second argument of the callback).
Now, when we create input.txt with the text inside and run the script we get output.txt with the same text but in lowercase. (Note that we assume that the source of data isn't a buffer, that's why we pass decodeStrings: false and use toLowerCase without type casting. In real life there should be some additional checks.).
When we told about gulp plugins we missed the fact that they usually work with objects, don't they? How do we implement the same idea here? Well, easy. Each Stream (any of them) has objectMode option which allows to stream objects instead of buffers or strings. Let's conclude our stream series and create an example which uses all types of streams in object mode.
First, let's create a natural numbers generator wrapped around Readable Stream interface:
Here we set objectMode to true as a default option and implemented _read method where we pushed the object with the only property called n (which is a next natural number) into the buffer.
Now we should implement a Writable Stream for consuming the result. Let's just log the data to the console using it:
Finally, let's create a Transform Stream which will convert our decimal numbers to hexadecimal:
And combine them together:
According to the fact that Transform Stream inherits from Duplex Stream, we may say that we implemented all four types of streams in this example.
During the posts about Streams we've discussed inheritance a lot, and usually it is made internally by util.inherits. But actually it's broken, and here is an interesting issue with the explanation of the bug. I'm not sure that it's possible to deal with it in the real life, but anyway.
Of course it's always worth to read the docs — Duplex and Transform Streams.
And just to get more about Duplex communications read about them too.
1 September 2019
But actually Ryan had authored Node.js two years earlier, in 2009. And since then he was giving talks on conferences and meetups elsewhere. Ryan had been Node.js evangelist for years at that point. And it wasn't easy.
First, in January 2010, Isaac Schlueter created a package manager he called “npm” — “Node.js Package Manager” (but he has never agreed with the this due to trademark rights). It allowed to share Node.js modules — packages. Any package in npm was structured in a certain way which made it possible to search among them, communicate with authors, easily publish and update. It used semantic versioning, so the way of maintaining packages was predefined. It was also convenient. There was no git cloning or copy-pasting files to folders. Just add a dependency in the one file called package.json, run one command — npm install — and BOOM! The library is living in your project now and it's available to use. This practicality made Node.js much more user-friendly and allowed developers to externalise the complexity of bootstrapping projects or development.
After all, Node.js that days wasn't unstable but it gave cryptic error messages when something was wrong, and its API was noting like LTS releases we're dealing with right now. And developers was spending huge amount of time tracking bugs and obscure behaviour of Node.js during their daily programming routines. npm made it possible to collaborate in it and also find a ways to improve Node.js APIs. It made it much less isolating and much less frustrating. And it helped to build a community of developers because anybody got a possibility to get in touch with any author of any package and discuss its features and bugs. As a result, when one engineer found the best solution for some Node.js problem, he shared it through npm and the rest used it, which reduced the number of bugs in Node.js that they had to hunt by their own.
MongoDB and JSON
Another one boost for Node.js rising came from the small obscure database project that had been built around the same time as Node.js (February, 2009) and was also experimenting with V8 and had been evengelized around the tech community at that time — MongoDB. In early 2009 there were few entrance competing the title of “NoSQL Database of choice”: Couchdb, Redis, and the others, but they all were getting not so warm reception (e.g. check the thread on HN about Couchdb). Most part of the development world wasn't sold with the idea of NoSQL at that time. It still was a new idea (even it had existed since 1960s). Usually on any popular framework community there was the “predefined” database: RoR community used PostgreSQL, PHP — MySQL, C# — MSSQL, Java — Oracle, etc. Why would any pragmatic developer want to switch his tooling around to embrace a new DB, especially one that worked in the way so completely different than anything they were used to. One that didn't support SQL queries. One that store data in JSON — a format that was largely unfamiliar to most devs and unsupported in many languages at that time. Even the developers that highly communicated with third-party APIs had no use for JSON. As most APIs exposed themselves over XML. But at the late 2000s the meiotic rise of social medias started to change that. APIs like Twitter's were released and became widely popular among the developers. These APIs almost exclusively exposed themselves over REST with JSON. So in the late 2000s any developer who wanted to work with social media APIs had to become familiar with JSON. And the languages and frameworks they worked with had to begin supporting it.
So, as social media sides grew in popularity and standardise things like OAuth and REST, the use of JSON API spread and soon most new public APIs were being released RESTfully with JSON. So then JSON became everyones problem. And as more and more developers were fetching data in JSON, the idea of and the attractiveness of JSON-based document store started to make sense for them. Fetching data from the APIs and storing them in the traditional columnar data store took a lot of extra work. You had to make sure there was a column for every key and subkey of the JSON document and if the document changed you had to change the DB as well. So with schema-less JSON document store as like MongoDB developers faced no such problems. They simply fetched the document and sent it to the DB as is. MongoDB just made their live easier. But working with a JSON document on a backend with a language like PHP, or Java, or Ruby was still no picnic. The languages didn't natively understand nested object's structure and developers had to manage two ideas of JSON documents: the actual JSON representation of the objects, as they might be stored in MongoDB or fetched from the API, and the object structure that their language presented to them to allow them to manipulate the data. This mental management was painful, and instead of writing the business logic they had to write helpers and wrappers for dealing with JSON. And here where Node.js came from the dark.
Joyent and key projects
And over the next twelve month that miracle would happen. Authoring by many of early adapters and unrelated third-parties which made one project by another turned Node.js from weird choice to clear choice for many developers. These projects were: Express.js, Mongoose, AngularJS and Node for Windows.
Another key project that was released in 2011 was Mongoose — an easy to use Node.js driver for MongoDB. With Mongoose anyone could plug MongoDB into the project in two minutes rather than two hours. This made building hello-world projects and fun hackaton projects in Node.js a lot more practical.
One more key development was on a frontend, not on the back. In 2010 Google released AngularJS, the frontend framework that everyone blessed for two-way binding — the active syncing data objects representing on a client with those representing on a server and database. In AngularJS both Node.js and MongoDB found their champion. If a dev was working with AngularJS, it would hard to work with a backend and database that didn't understand JSON, and the project would be very slow-going. But with Node.js + MongoDB + Mongoose the JSON object representing on the frontend could easily be sync with the backend and the database. For Node.js and MongoDB embracing two-way binding became a key. And those developers who chose this stack quickly gain the competitive edge for building real-time applications. And because of this advantage the “MEAN” stack was born.
MEAN stands for “MongoDB, Express.js, AngularJS, Node.js”, but it could easily stand for “Mongoose, Express.js, AngularJS, npm”. MEAN had quickly became what RoR or LAMP were for previous generations of web developers — everything they needed to build any kind of application that was demanding at their days, with relative ease. (The actual name “MEAN” would appear later, in 2013, but the tools were used together right after their creation.)
But for those building in raw Node.js outside of the MEAN stack Node.js remains cumbersome, unstable and hard to debug. These limitations were due to the fact that Node.js was a one-man project with a loose association of contributors, no sponsors, etc. In short, to become as Ryan Dahl said “the next PHP” Node.js needed a corporate sponsor. In a blogpost in 2010 Dahl explained the situation and introduced to the community the company called Joyent which hired him and was purchasing the name and the trademark for Node.js.
Joyent stewardship promised stability, security and scalability of Node.js core and the tooling surround it. Within a year Joyent had delivered on this promise and have the next boost to Node.js. In 2011 Joyent in a partnership with Microsoft released Node.js 0.6.0 which included support for Windows, opening Node.js to a much wider audience than just OS X and Linux developers. While many developers built their software on Linux machines, they had to run it on Windows, and the Windows version of Node.js was crucial to them, adapting the platform and incorporating into the development cycle. By the fall 2011, around the time the video of Ryan Dahl 2011's speech was marching about the devs community, all these developments: Express.js, AngularJS, MongoDB, Node.js for Windows, had already taken the place. And thanks to Joyent's government of the project, Node.js was getting more stable and easy to use with every release. Additionally, JSON APIs were spreading like wildfire, making MEAN stacks and Node.js in general much more attractive and Node.js still had its creator and evangelist Ryan Dahl pounding the pavement all over Bay Area and abroad and spreading the gospel of Node.js.
Node.js Foundation and stable releases
At this time there were some big companies who used Node.js: Joyent and Microsoft (obviously), LinkedIn, Uber. Later some others tech giants would move to Node.js: Netflix, PayPal, Ebay, etc.
Anyway, by the end of 2011 all these forces had come together and manifested exactly of the miracle that Node.js needed. From that time until today, under the Joyent governance the Node.js project grew and changed in largely predictable ways. In 2012 Dahl stepped aside and introduced Isaac Schlueter — the creator of npm — to lead the project instead. He subsequently stepped down and promoted TJ Fontaine to take his place in 2014. Then in 2015 he left too, and due to a rift between groups of Node.js developers the fork of Node.js was created — io.js, and many of major contributors migrated there and abandoned the core project. As a result, the governance of the project was taken from the Joyent's hands and given to the neutral organisation called Node.js Foundation. This new organisation allows all sides to work together and to develop the project. At the time of this writing Node.js is still under the governance of Node.js Foundation.
Node.js's popularity has highly grown since its first days, but the face of the project has largely changed over time. Early pioneers like TJ Holowaychuk have since moved to another projects and platforms and became replaced with other stars like Sindre Sorhus, James Halliday, Addy Osmany, and so on. The project reached the significant level of stability at 2015 and released its official non-beta version. According to the generally accepted rules of semantic versioning all versions of Node.js prior to that were actually just some beta non-public releases without LTS. In 2015 io.js released version 1.0 and once it was merged back into Node.js under the new umbrella organisation, Node.js was released as 4.0 at the same year.
Node.js's rise to adoption was unexpected and impressive. Something happened during Node.js's rise, something not quite worth celebrating. Many of the new breeds of the new Node.js developers those who adopted the platform after all these events, adopted the platform not because of Node.js's runtime, I/O model, CLI or API, but because of its ecosystem. And the speed and ease with with they can build applications with tools like Express.js and with the plato of available npm modules, but without having to learn Node.js itself and care much how it worked. Today we have an entire generation of developers which knows MEAN stack, Mongoose, Underscore, AngularJS, Bootstrap and applications on Node.js using these tools, but knows almost nothing about Node.js itself. You can see why: the ecosystem, not the API converted Node.js from pet-project to the giant platform. npm, not Node.js, is what attracts most developers. MEAN, not Node.js, is what many devs specialise in. Mongoose, not MongoDB, is what many devs understand how to integrate with. This ecosystem of libraries and frameworks and projects helped prepare Node.js to stardom, but it doesn't mean that we haven't lost something in return. Especially ironic that as the Node.js API has being stabilising and has finally become a fully-featured API that provides so many features out of the box, developers usually abandoned them in favour of libraries and frameworks built on top of it.
That's why we dived into the sources and looked under the dress of the Node.js and tried to understand its core concepts and mechanisms. Of course we haven't discussed every nook and cranny, but most of the rest parts are related to the different topics: OS's file and process management, networks, algorithms, etc. Someday we will check them all, find the difference between TCP and UDP, figure out why HTTP/2 is awesome (or why it's not), and understand how ZIP is working, but now it's time to write something attractive and interesting, just to understand how to build it using Node.js and its frameworks.
This post wouldn't exist without the sources that I linked across the text, and without two more resources: A history of Node.js and The Node.js Master Class (this post is based on the video “The Story of Node.js” from the master class).