Visualize C Data-Types on Linux with Ruby
--
Data types in C are like containers for storing different data. There are various data types with various sizes. The data type size depends on the target system and the compiler. Here is a list of data types on a 64-bit system:
On 32 bit systems however, GCC and Clang assigns both int
, and long int
or long
as 32 bit (4 bytes) as well. Same applies to the fellow unsigned
ones, and double
, and long double
. So you can’t assume long is going to be 8 bytes in all cases.
Intro
When I was learning C for the first time, I learnt about data types and how they work. But I had no idea about the effect could they have on the system memory. If you use long long
instead of char
to store say 0 and 1, you will use 7 more bytes. But that can’t be really visualized with modern system monitors like gnome-system-monitor, htop, top, etc.
If you create a while loop running 5 million times, and create a float or double inside, you won’t see a lot of memory is used in your system monitor, because you are assigning the value to float on the stack, and you can’t also deallocate or free them. They are just static variables. For example, even though C doesn’t have GC by default, you can still run this and not run out of memory:
int main() {
char *b = "Hello!" ;
while(1) {
char *a = "This is a String!" ;
b = "Something!" ;
}
}
But if you were to do things like this in Ruby:
GC.disable # Disable the Garbage Collection featurewhile true
b = "Something" # Create a new String and assign it to b
end
That will be disastrous. Because you are creating string “Something” in every loop, generally GC collects such string or objects that has no reference and frees it from the memory. But with the GC disabled, you will soon use all your system’s available memory, and end up freezing your whole system in a very short amount of time.
So visualizing datatypes and know about real memory usage in C is sometimes hard. So I have come across a way to visualize them.
My plan is to actually write C files with a 100K variables inside each. It will be generated by a small Ruby script.
So to follow up, all you need is a Linux system with a good processor, like i3 or better, otherwise the script we will use will take time to get compiled! And also, you might understand from the title that you also need the Ruby programming language.
Our Plan
- We are going to create large C files with the help of Ruby.
- In the C files we are going to declare a humongous amount of variables like
float
,int
,double
,long double
etc. - Each code will also measure the memory of themselves in the end and show the amount of memory they are consuming before exiting. It actually uses Linux's
/proc/PID/statm
file. - We are then going to compile and execute them to see the memory usage they are consuming.
Spoiler alert… char
uses the least and long double
uses the most memory…
Creating the Main Ruby File
Using the Ruby language, we will create a C file, that will declare variables like this:
int _a=0,_b=0,_c=0…_n=0;
… And we will have those megabytes long! It is just for the educational purpose we are going through. Don’t use such code in production!
So here is the code:
This code creates a directory called “cfiles” and the sources inside them. Here we are going with the following data types:
TYPES = <<~'EOF'
char % int % short % long % long long
float % double % long double
EOF
New lines and % are separators, so we are going with char
, int
, short
, long
, long long
, float
, double
, and long double
.
The memory usage is calculated by these lines:
unsigned long long calcMemUsage() {
unsigned long long resident, shared ; FILE *f = fopen("/proc/self/statm", "r") ;
if (fscanf(f, "%*llu %llu %llu", &resident, &shared) != 2) return 0 ;
fclose(f) ; return (unsigned long long)((resident - shared) * sysconf(_SC_PAGESIZE)) ;
}
The file /proc/self/statm
contains the memory usage info about the current program. All we need is the 2nd and the 3rd field. The 2nd field contains the resident set size, and the 3rd field contains the resident shared pages. If we subtract them, we get the memory used, but in the multiples of page size.
The pagesize is generally 4096 bytes, but we can run sysconf(_SC_PAGESIZE)
to get the accurate value, and multiply that page size with resident — shared
.
Now let’s compile and execute one binary file:
$ gcc -O0 cfiles/double.c -o double
$ ./double
Memory: 786432 Bytes
It right up shows memory usage. Don’t forget the -O0
flag to prevent optimizations done by the compiler, here GCC in this case.
It takes some time to compile the file. So why not generate a Makefile?
Generate Makefile
Makefile contains instructions made up of shell commands that says the system how to build a target. A simple example of Makefile can be:
# Filename: Makefileall:
echo "Hello World!"
If you run:
make all
It will print “Hello World!”
Don’t forget to indent with tabs instead of spaces.
Q. But why do we need Makefile here?
A. We need Makefile so that we can compile the codes on each CPU thread. You can use make -j4 to compile 4 files at once. You can also compile them manually, or just compile in the background with &, and skip this entire section. But Makefile is kind of clean!
But what we need is a boring long Makefile containing details about all the fines to compile. We can also eliminate that monotonous work and let Ruby do that for us:
Make sure to run the generate_sources.rb before this.
We have run N=100_000 generate_sources.rb
before, so 100,000 variables assigned to short, int
, float
, double
, long double
, etc. (whatever we have in TYPES above). To change N to say 500,000, we can just run this script like this:
$ ruby generate_sources.rb
$ ruby generate_makefile.rb
So let’s run this:
CC=gcc
CFLAGS="-O0"all: generate char int short long long_long float double long_double
char:
$(CC) $(CFLAGS) "cfiles/char.c" -o "binaries/char"
int:
$(CC) $(CFLAGS) "cfiles/int.c" -o "binaries/int"
short:
$(CC) $(CFLAGS) "cfiles/short.c" -o "binaries/short"
long:
$(CC) $(CFLAGS) "cfiles/long.c" -o "binaries/long"
long_long:
$(CC) $(CFLAGS) "cfiles/long long.c" -o "binaries/long_long"
float:
$(CC) $(CFLAGS) "cfiles/float.c" -o "binaries/float"
double:
$(CC) $(CFLAGS) "cfiles/double.c" -o "binaries/double"
long_double:
$(CC) $(CFLAGS) "cfiles/long double.c" -o "binaries/long_double"
Generated Makefile
Please Run "make all -j4"
OK as instructed in the last line of from generate_makefile
let’s run:
$ make all -j4
It will compile the sources to binary files inside “binaries” directory:
$ time make all -j4
gcc "-O0" "cfiles/char.c" -o "binaries/char"
gcc "-O0" "cfiles/int.c" -o "binaries/int"
gcc "-O0" "cfiles/short.c" -o "binaries/short"
gcc "-O0" "cfiles/long.c" -o "binaries/long"
gcc "-O0" "cfiles/long long.c" -o "binaries/long_long"
gcc "-O0" "cfiles/float.c" -o "binaries/float"
gcc "-O0" "cfiles/long double.c" -o "binaries/long_double"real 2m50.049s
user 10m1.554s
sys 0m1.147s
[ We can also omit “all
” in “make all
” ]
Right, so it took 2 minute 50 seconds to compile on my system, which is a little bit old and runs i3 3rd generation processor.
Execution
Now that we have -O0 level optimized binary files, we can just run them one by one and check the memory usage! But that’s boring, let’s run the files one by one with the help of our BASH (or maybe ZSH, YASH, etc.) shell:
$ for i in binaries/* ; do echo "$i: `./$i`" ; done
binaries/char: Memory: 73728 Bytes
binaries/double: Memory: 729088 Bytes
binaries/float: Memory: 282624 Bytes
binaries/int: Memory: 282624 Bytes
binaries/long: Memory: 790528 Bytes
binaries/long_double: Memory: 1622016 Bytes
binaries/long_long: Memory: 798720 Bytes
binaries/short: Memory: 262144 Bytes
Right, these are the memory usage caused by 100,000 copy of short
, int
, float
, double
, long double
etc. Each C program prints the memory it is consuming by itself!
I have run it multiple times, here is the output:
The problem is that the measurement isn’t really precise, it can be a few tens or even hundreds of bytes off. So to make it even more precise, increase the value of N in generate_sources.rb from 100,000 to maybe 500,000? That way we will create 500K variables but that will also increase the compile unless your system is a little bit rusty like me.
So here’s the chart you all have been waiting for:
The chart was created in LibreOffice Calc but putting the averages of the values above.
Recap
So to recap, here are the points I did:
- Created a new file called
generate_sources.rb
, and pasted the content from the above generate_sources.rb GitHub gist. - Created a new file called
generate_makefile.rb
and pasted the content from the above generate_makefile.rb GitHub gist. - ran
N=100_000 ruby generate_sources.rb
to generate 100,000 data types. - Ran
ruby generate_makefile.rb
to generate the Makefile. - Ran
make all -j4
, replaced-j4
with the suggested number of CPU bygenerate_makefile.rb
- Executed the files with the bash script
for i in binaries/* ; do echo "$i: `./$i`" ; done
. That showed the total memory usage by the 100K data types together.
So that is how we can visualize the data types in real-world with the help of a scripting language and a compiled language. You can also generate any ridiculous long program like this with a scripting language like Ruby, Perl, or Python.
If you are working on a C project that has a function that returns a 0 to 255, use an unsigned char
, if it generates something like 0 to 32767, use short
and so on. The reason is, if you use say long
instead of char
or short
, you will use more memory. If you do that inside a while loop for many billion times, writing more bytes to the memory will require more time at some point. So use data types wisely.
Hope you learnt something new… Have a good day!