HomeBooksGallery • Comics • Team DFC


Yes we are still alive... or dead for that matter.

The dangers of anime in your machine learning

Thu, 12 Sep 2019 16:19:20 +0800

 I have been experimenting with machine learning image upscaling via a program called ESRGAN, a worthy successor to the famous Waifu2X programs used by many to upscale anime art, at twice the image size. The default model ESRGAN came with, while good, was not particularly good with cartoony/anime artworks. This all changed when kingdom akrillic released his manga109 model, which exposed the capabilities of the upscaler with a well-trained model. This opened the floodgates to other people trying machine learning for themselves.

The first model I trained tried to replicate Waifu2X’s dataset which was less manga, and more towards manga/anime styled artworks. The easiest method was to grab the Danbooru dataset that the folks on the web have collated, containing hundreds of gigabytes of data that required a special torrent client to download. Unfortunately, not all of the pictures are of good quality, so the images had to be curated into a proper dataset with only the best. I called my attempt WaifuGAN in honour of Waifu2x, and the main content of the dataset.

The first version of WaifuGAN had JPEGs in the dataset, which resulted in learning the noise present in those pictures, and using it as output in the upscales, something definitely unwanted. The later attempts removed JPEGs and only used PNGs, which had better results. It seemed to produce comparable, if not better, results than Waifu2x… for the moment. Then I realised some artifacting happening in the upscales:


Instead of thin or thick lines, certain parts of outlines upscaled to double lines instead. This problem was also present in the manga109 model. I tried to troubleshoot where the model learned this behaviour by looking at the dataset. And why Waifu2x rarely had this problem in its upscales. The I looked at the lowres images that were generated for the training dataset, and managed to come to a theory why this happened: rim-lighting.


Consider the illustration above, which at first glance makes for a good input image for training. Now observe the four parts highlighted below at 4x zoom:


For clarification, when preparing a training dataset, a pair of images called HR (Hi-Res / original sized image) and LR (Low-Res / downscaled images are created. The computer basically has to learn how to convert the LR picture into the HR picture as closely as possible. For a 4x upscaling model, the LR must be ¼ of the HR size. The 1 pixel wide outlines used in the HR is actually problematic enough, but if you observe the images above, there are also another 1 pixel wide line alongside the outlines with a clearly brighter colour before you see the colour of the area fills. This effect is called rim lighting because you add some brightness to the edge of objects. In traditional painting or 3D, this effect is achieves with usage of glow/highlight near the edges of the objects. This adds more contrast and helps define the shape of the subject matter. in anime art, it is often emulated as a line/strip of brighter colours on the inside of highlights. So how does this stylised effect cause trouble in machine learning situations?  Consider the LR tile generation of the tiles shown above:


This were generated using 4 different downscaling methods:

  1. Nearest neighbour
  2. Bicubic smoother
  3. Bicubic sharper
  4. Photoshop preserve detail 2.0

Observe how the outlines and rim-lighting were scaled. In the places where rim lighting is thick, it downscales correctly and is visible in the LR images. But for the ones that are 1 pixel thick, downscaling it to 25% size results in the highlight merging with the outlines (!) in most cases and this will be used to teach the machine learning that such lines should be upscaled with a highlight added. Most models trained on anime paintings will exhibit the behaviour of adding highlights to some outlines, or in worse case scenarios, result in the lines becoming double-lines. To be succint: Garbage In, Garbage Out.


So what images would make better input? Stills taken from actual anime, or illustrations that mimic this style are good. They usually don’t resort to rim lighting because flat colours are preferred. And if rim lighting is indeed used, they tend to be reasonably thick as well. Otherwise, illustrations that either have thick outlines, or do not use outlines at all are better. This is important if you need your model to be responsive to colour gradients instead of the flat colours that permeate anime screenshots.


On another note, also avoid ‘sketchy’ illustrations such as the one above because the line quality is inconsistent, and sometimes outright have double outlining in the image itself, which will mess up with your machine learning data pronto.

Unless of course, you were wanting these double lines as your goal. Then, everything discussed here is moot. :P

You can find ESRGAN models trained by me and a bunch of brilliant people over at the Upscale Wiki.

A dose of reality in your dinosaur theme park.

Sun, 05 Jun 2016 07:50:26 +0800

A dose of reality in your dinosaur theme park.

Do you liek horned bunniez?

Fri, 20 May 2016 17:37:42 +0800

Do you liek horned bunniez?

The 2K Momohime Project is go!

Thu, 17 Sep 2015 23:03:59 +0800

Details here.

- Feed from Tumblr -