A 6-Stage Algorithm to Turn AI-Generated Pixel Art into the Real Thing
TL;DR
AI-generated pixel-art-style images have dot sizes and positions that are subtly off, making them unusable as-is in game engines and pixel art editors. Pixel Snapper converts a broken grid into proper pixel art aligned to an exact grid in six stages: “color quantization → edge detection → grid estimation → integration → Elastic Walker → majority-vote resampling”. The design decisions in each stage — outlier robustness via k-means++, robust estimation by median, and so on — are quiet but effective tricks.
The “Almost-There” Problem in AI-Generated Pixel Art
Recent AI image generation has become surprisingly high-quality, but pixel art has its own peculiar problems. When you zoom into a generated image, the size and position of dots — which should align to a uniform grid — are subtly off. Anti-aliasing inserts intermediate colors at boundaries, and gradients give adjacent dots, which should be the same color, slightly different colors.
Visually it looks “right enough”, but loading it into SpriteFusion or Aseprite breaks down. Trying to use it as a sprite sheet for game development, the grid isn’t aligned, so slicing fails.
Pixel Snapper solves this in six stages. Let’s walk through what each stage does and why those design choices were made.
The Overall Pipeline
First, let’s look at the big picture.
flowchart LR
A[Input image] --> B[Stage 1\nColor quantization]
B --> C[Stage 2\nEdge profile]
C --> D[Stage 3\nGrid spacing estimation]
D --> E[Stage 4\nSpacing integration]
E --> F[Stage 5\nElastic Walker]
F --> G[Stage 6\nResampling]
G --> H[Output image]
The input is an “AI-generated pixel-art-style image”, and the output is “pixel art aligned to a precise grid”. Each stage takes the previous stage’s output as input, processed in pipeline fashion.
Stage 1: Color Quantization — Reducing Colors with k-means++
The first stage reduces the image’s color count.
In AI-generated images, adjacent pixels that should be the same color end up subtly different due to anti-aliasing or gradients. A “red dot” isn’t pure red; it’s scattered across dozens of variants — darker red, brighter red, orange-tinted red. If you try to detect edges in the next stage in this state, you’ll pick up tons of nonexistent boundaries.
The algorithm uses k-means++ clustering to consolidate all colors into 16 representatives by default. The procedure is:
- Collect RGB values only from opaque pixels ()
- With k-means++ initialization, pick initial representatives that are well-spread in color space
- Assign each pixel to the nearest representative, then update each representative to the mean of its members
- Repeat until convergence or up to 15 iterations
- Replace every pixel with its nearest representative
The k-means++ initialization is the key. Plain k-means has unstable results because centroid initialization is random, but k-means++ uses weighted sampling where “points farther from already-chosen centroids are more likely to be chosen next”. This produces well-spread initial placement in color space and stabilizes the result.
Convergence is declared when the centroid movement falls below . Given that each RGB component lies in , effectively means “no longer moving”.
Stage 2: Edge Profile — Compress Color Changes into 1D
After cleaning up colors, the next step is to find “where the grid boundaries are likely to be”.
On pixel art grid lines, colors always switch. So columns and rows where color changes concentrate become candidate grid boundaries.
Concretely, each pixel is converted to grayscale, and the brightness difference (gradient) between adjacent pixels is computed. Grayscale conversion uses the weights from ITU-R BT.601:
The human eye is most sensitive to green and least to blue, so this weighting is more perceptually natural than a simple .
The gradient is computed by applying a kernel. Subtracting the left neighbor from the right gives the horizontal change. This is the simplest edge detection filter — a simplified version of Sobel or Prewitt.
- Column profile: For each column, sum the absolute horizontal gradients across all rows
- Row profile: For each row, sum the absolute vertical gradients across all columns
The result is two 1D arrays. Larger values indicate higher likelihood of a grid boundary at that position. By collapsing 2D image data into 1D profiles (projections), subsequent peak detection becomes drastically simpler.
Stage 3: Grid Spacing Estimation — Take the Median of Peak Distances
With profiles in hand, estimate the width of one grid cell.
- With a threshold of 20% of the profile’s max, extract local maxima (points greater than their neighbors) above the threshold as peaks
- Remove peaks closer than 4px apart (noise reduction)
- Compute all distances between adjacent peaks
- Return the median of those distances as the grid width
The reason for using the median instead of the mean is robustness to outliers. At image edges or sprite boundaries, abnormally large gaps appear, and the mean would be pulled by them. For example, with intervals , the mean is but the median is . The actual grid width is likely close to 8, so the median is more accurate.
This procedure is applied to the column and row profiles independently to estimate grid widths along X and Y axes.
Stage 4: Spacing Integration — Reconcile X and Y Axes
Pixel art grids are usually square, but Stage 3 may produce different values for X and Y. This stage merges them into one consistent value.
The processing branches by four cases.
| Case | Action |
|---|---|
| Both X and Y estimated, ratio within 1.8x | Use the average of both for both axes |
| Both estimated, ratio over 1.8x | Use the smaller for both axes (the larger is likely a misestimate) |
| Only one estimated | Use that value for both axes |
| Neither estimated | Fall back to (shorter side / 64) |
The 1.8x threshold is interesting. When the ratio exceeds 1.8x, it’s interpreted as “edge detection on one axis didn’t go well”, and the smaller value is taken as correct. The larger one likely picked up peaks that span multiple grid cells.
The “shorter side / 64” fallback is designed to give a reasonable initial estimate for typical pixel art grid sizes (8–64px).
Stage 5: Elastic Walker — Grid Lines That Snap to Edges
This is the heart of the algorithm.
The grid width from Stage 4 is only an “average spacing”; in AI-generated images, dot positions wobble, so cutting at strict equal spacing slices through the middle of dots. Elastic Walker is a greedy method that uses equal spacing as a baseline but adjusts to actual edges.
- Start at position 0; mark a position one grid width ahead as the “target”
- Center a “search window” of 35% of the grid width (minimum 2px) on the target
- Find the position with the highest profile value within the search window
- If that value exceeds 50% of the profile’s overall mean, place a cut there (snap to edge)
- Otherwise, place the cut at the target position itself (fall back to equal spacing)
- Repeat until reaching the image edge
The 35% search window width is a delicate balance. Too wide and you risk shifting by a whole grid cell; too narrow and you miss the edge.
A two-pass stabilization follows.
- Pass 1 (per-axis stabilization): If an axis has too few cuts, supplement with information from the other axis
- Pass 2 (cross-axis verification): Compare cell widths between X and Y; if they differ by 1.8x or more, re-split based on the smaller one
Through this cross-verification, even if edge detection fails on one axis, the other axis’s result can compensate.
Stage 6: Resampling — Aggregate Each Cell into One Dot by Majority Vote
The last stage collapses each cell into a single pixel based on the grid determined in Stage 5.
For typical image downscaling, bilinear interpolation (a weighted average of surrounding pixels) is used, but in pixel art, intermediate colors are unwelcome. So majority voting is used.
- Set the output image size to “(column cuts - 1) × (row cuts - 1)”
- Tally the colors of all pixels inside each cell
- The most frequent color becomes the representative
- On ties, decide deterministically by lexicographic order of RGBA values
The advantage of majority voting is that the original colors stay intact. If a cell has 60 reds, 30 blues, and 10 greens, it becomes red. Averaging would create intermediate colors and blur edges; majority vote keeps things sharp.
The lexicographic tiebreaker is a quiet but important detail. With a random pick, results would change between runs, but lexicographic order always returns the same output for the same input (deterministic). Reproducibility is fundamental for both debugging and testing.
Summary of Design Decisions
What’s common across all stages is “choosing robust methods”.
- k-means++: More stable results than random initialization
- Median: Outlier-resistant grid width estimation
- Elastic Walker: Equal spacing as baseline, with flexible adherence to edges
- Cross-verification: If one axis fails, the other compensates
- Majority vote: Avoids blur from averaging
AI-generated images have the property of being “approximately right but subtly off”. Rather than methods that assume perfect input, the algorithm consistently picks methods robust to noise and error — that, I think, is its design philosophy.
The six stages are individually simple, but combined as a pipeline they turn an unstable input — an AI-generated image — into stable pixel art output. It’s a practical and elegant approach that combines basic image processing techniques — k-means clustering, edge detection, peak detection, greedy methods — appropriately.
That’s all.