<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE article  PUBLIC "-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN" "http://dtd.nlm.nih.gov/publishing/3.0/journalpublishing3.dtd"><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" dtd-version="3.0" xml:lang="en" article-type="research article"><front><journal-meta><journal-id journal-id-type="publisher-id">JCC</journal-id><journal-title-group><journal-title>Journal of Computer and Communications</journal-title></journal-title-group><issn pub-type="epub">2327-5219</issn><publisher><publisher-name>Scientific Research Publishing</publisher-name></publisher></journal-meta><article-meta><article-id pub-id-type="doi">10.4236/jcc.2022.106011</article-id><article-id pub-id-type="publisher-id">JCC-118235</article-id><article-categories><subj-group subj-group-type="heading"><subject>Articles</subject></subj-group><subj-group subj-group-type="Discipline-v2"><subject>Computer Science&amp;Communications</subject></subj-group></article-categories><title-group><article-title>
 
 
  Using U-Net to Detect Buildings in Satellite Images
 
</article-title></title-group><contrib-group><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Eric</surname><given-names>Wang</given-names></name><xref ref-type="aff" rid="aff1"><sup>1</sup></xref><xref ref-type="corresp" rid="cor1"><sup>*</sup></xref></contrib><contrib contrib-type="author" xlink:type="simple"><name name-style="western"><surname>Dali</surname><given-names>Wang</given-names></name><xref ref-type="aff" rid="aff2"><sup>2</sup></xref></contrib></contrib-group><aff id="aff2"><addr-line>Environmental Sciences Division, Oak Ridge National Laboratory, Oak Ridge, USA</addr-line></aff><aff id="aff1"><addr-line>Department of Computer Science, Stanford University, Stanford, USA</addr-line></aff><pub-date pub-type="epub"><day>09</day><month>06</month><year>2022</year></pub-date><volume>10</volume><issue>06</issue><fpage>132</fpage><lpage>138</lpage><history><date date-type="received"><day>2,</day>	<month>May</month>	<year>2022</year></date><date date-type="rev-recd"><day>27,</day>	<month>June</month>	<year>2022</year>	</date><date date-type="accepted"><day>30,</day>	<month>June</month>	<year>2022</year></date></history><permissions><copyright-statement>&#169; Copyright  2014 by authors and Scientific Research Publishing Inc. </copyright-statement><copyright-year>2014</copyright-year><license><license-p>This work is licensed under the Creative Commons Attribution International License (CC BY). http://creativecommons.org/licenses/by/4.0/</license-p></license></permissions><abstract><p>
 
 
  This report presented a method that uses deep computing and stochastic gradient descent algorithm to automatically detect building from satellite images. In this method, a convolutional neural network architecture called U-Net was trained to highlight the building pixels from the rest of the image. This method applied a binary cross-entropy loss function, used ADAM algorithm for gradient descent optimization, and adopted interaction-over-union for accuracy measurement. Continuous loss decreases and accuracy increases were observed during the training and validation. Finally, the visualization of the predicted masks from the trained model after 20 epochs proved that the U-Net model delivers over 60% Intersection over Union accuracy results for detecting buildings from satellite images.
 
</p></abstract><kwd-group><kwd>U-Net</kwd><kwd> Satellite Images</kwd><kwd> Computer Vision</kwd><kwd> Object Detection</kwd></kwd-group></article-meta></front><body><sec id="s1"><title>1. Introduction</title><p>Satellite imagery has a myriad of uses in a variety of different technical fields, such as meteorology, oceanography, fishing, agriculture, regional planning, education, intelligence and warfare. This study uses satellite images for building detection. There are several reasons to focus on the topic of satellite imagery and its relationship to building representation: 1) it can help us to plan out what new buildings can go where, and how each current building fits in an ecosystem of other surrounding buildings, 2) it can be applied to a city building simulation, where I can have a holistic visualization of each section of the city, to find ways to improve on the current infrastructure, and 3) it can help with risk detection and management (i.e. natural disaster planning).</p><p>There is an expansive amount of literature on how to detect buildings from satellite imagery using traditional approaches. [<xref ref-type="bibr" rid="scirp.118235-ref1">1</xref>] [<xref ref-type="bibr" rid="scirp.118235-ref2">2</xref>] developed feature-based approaches to characterize and detect buildings. [<xref ref-type="bibr" rid="scirp.118235-ref3">3</xref>] presented a region-based technique for building detection. [<xref ref-type="bibr" rid="scirp.118235-ref4">4</xref>] proposed a model to compute the contours of buildings. Additionally, people have applied deep computing technologies for building detection very recently. [<xref ref-type="bibr" rid="scirp.118235-ref5">5</xref>] proposed a scheme with guided filters for efficient building detection from satellite images using deep learning. [<xref ref-type="bibr" rid="scirp.118235-ref6">6</xref>] presented a method for automatic airport detection in remote sensing images using convolutional neural networks. Finally, [<xref ref-type="bibr" rid="scirp.118235-ref7">7</xref>] presented a method that used R-CNN network methods for building detection in remote sensing images.</p></sec><sec id="s2"><title>2. Method</title><sec id="s2_1"><title>2.1. Satellite Data and Mask Generation</title><p>Most satellite images come with a special data format (such as GEOTIFF and HDF [<xref ref-type="bibr" rid="scirp.118235-ref8">8</xref>] [<xref ref-type="bibr" rid="scirp.118235-ref9">9</xref>]), and require special knowledge as well as a geospatial library (such as GDAL [<xref ref-type="bibr" rid="scirp.118235-ref10">10</xref>]) to process it. In order to help accomplish this task without extenal software dependecy, we first converted the original data from GEOTIFF into the common numpy format with the channel first option. The resulting dataset contains 16 images with a core area of 625 &#215; 625 pixels, as well as an additional 92 padding pixels outside of the core box, making the total dimension a 809 &#215; 809 pixel image. Each image contains 4 channels: the cyan, magenta, yellow, and key (black) color maps. Furthermore, binary masks for these images (where 1 represents a building pixel, and 0 represents anything else) were manually created. These images and masks were randomly split into three parts: 12 images/masks grouped as the training dataset, 2 images/masks as the validation dataset, and the final 2 images/masks as the testing dataset. An example of the input image and its corresponding masks are illustrated in <xref ref-type="fig" rid="fig1">Figure 1</xref>.</p></sec><sec id="s2_2"><title>2.2. Building Detection with U-Net</title><p>This study adopted a special network, called U-Net, to detect buildings. In this section, the U-Net architecture was briefly introduced and was followed by an illustration of the general workflow for training, validation, and testing. Several technical components, such loss functions, accuracy measurements, and optimization algorithms were also explained in this section.</p><sec id="s2_2_1"><title>2.2.1. U-Net Model Architecture</title><p>Our primary segmentation tool, U-Net [<xref ref-type="bibr" rid="scirp.118235-ref11">11</xref>] is an architecture originally used for biomedical image segmentation, but has become the most popular architecture for many type of semantic segmentation. It is composed of many of what is known as a skip connection, a connection from the early parts of the network to its later parts, with information being transferred over. This allows us to bring over lost information that was previously located in the early layers, and it allows us to get a better idea of the network as a whole. It assigns each skip connection from the first convolution to the last, and meeting in the middle, giving the architecture its signature “U” shape. In this study, a PyTorch implementation of U-Net is adopted [<xref ref-type="bibr" rid="scirp.118235-ref12">12</xref>].</p></sec><sec id="s2_2_2"><title>2.2.2. Workflow of Building Detection</title><p>The workflow chart of this study was presented in the the following <xref ref-type="fig" rid="fig2">Figure 2</xref>. At first, a U-Net model was randomly initialized to take the input images from the training dataset and produce a series of masks. Then, these masks are compared with the labelled masks for loss function calculation. Next, these loss functions are used in the backpropagation procedure to adjust the weights of U-Net with a gradient decent algorithm. During the training process, the intermediate training result of the U-Net was used on the images in the validation dataset to produce validation masks. These masks are then compared with associated masks to show the model prediction accuracy. After training and validation, the final U-Net</p><p>model was saved into disk and a test utility was created to load the model, take the images of testing dataset, and produce masks that show building pixels.</p></sec></sec><sec id="s2_3"><title>2.3. Loss Function</title><p>Our indicator for loss during this project was Binary Cross Entropy [<xref ref-type="bibr" rid="scirp.118235-ref13">13</xref>], or BCE for short. It is a loss function used to classify binary (yes/no, A/B, 0/1) tasks. As such, it is represented by Loss Equation (1):</p><p>− 1 N ∑ i = 1 N     y i log y ^ i + ( 1 − y i ) log ( 1 − y ^ i ) (1)</p><p>where y ^ i and y i are the output scalar value and the target value of the i’th term, respectively, and N is the number of scalars within the output data for the model. Our purpose for using BCE is to allow quick processing and classification of our training examples, as it is equivalent to maximum likelihood estimation fitting, guaranteeing consistency and statistical efficiency.</p></sec><sec id="s2_4"><title>2.4. Accuracy Measurement</title><p>For accuracy prediction, the Intersection over Union (IoU) metric, a 0 (least overlap) to 1 (most overlap) scale metric, was used to determine the amount of similarity between the predicted masks and the ground truth masks. <xref ref-type="fig" rid="fig3">Figure 3</xref> illustrates an IoU as the ratio of the two bounds’ overlap over the total areas of the two bounds.</p></sec><sec id="s2_5"><title>2.5. Optimization Procedure</title><p>Our method adopted Adaptive Moment Estimation, or ADAM, for stochastic gradient descent optimization. ADAM optimization relies on the first and second moment of gradient to update its learning rates. It has an increased cost, due to</p><p>requiring the calculation of the second derivative, but with the added benefit of converging in circumstances that standard gradient descent may not, as it is invariant to gradient rescaling.</p><p>Mathematically, by defining the estimates of m t and v t as the mean and the uncentered variance of the gradients of current mini-batch g t , respectively, as well as the decay rates as β 1 and β 2 , ADAM can be represented as following Equations (2):</p><p>m t = β 1 m t − 1 + ( 1 − β 1 ) g t v t = β 2 v t − 1 + ( 1 − β 2 ) g t 2 (2)</p><p>This gives a decaying average for both that allows the gradient descent to proceed and eliminate more jagged routes to the intended destination.</p></sec></sec><sec id="s3"><title>3. Experiment and Result</title><sec id="s3_1"><title>3.1. Software Packages and Computer Configuration</title><p>Python 3.8 is the main program environment. Several packages were added, including PyTorch, Click, and NumPy for code development, as well as matplotlib for visualizations. PyCharm and Jupiter Notebook were crucial for debugging and arduously testing the code. All the code development and experiments were conducted on a 2020 16’ MacBook Pro with an 8-core i9 Processor and 16GB DDR4 Memory.</p></sec><sec id="s3_2"><title>3.2. Training and Validation Results</title><p>After running through 20 epochs (30 seconds per image/2.5 hours for the entire training) with the training data, a few insights were discovered as to the loss and accuracy of the masks as a whole.</p><p>The average of the 12 BCE values at each epoch was used to determine the training loss over time. There was a significant decreasing negative rate over time, as shown in the first graph of <xref ref-type="fig" rid="fig4">Figure 4</xref>.</p><p>As to the Validation Accuracy over time, the average of the 2 IoU values at each epoch were use to measure our prediction. Due to the high variance of the</p><p>data, a logarithmic trend line was plotted to show the general increase of the accuracy over time (<xref ref-type="fig" rid="fig4">Figure 4</xref>, second graph).</p><p>After training and validation, the final model was saved to a separate folder, called saved models.</p></sec><sec id="s3_3"><title>3.3. Testing Result</title><p>Through the previous training and validation, we were able to get a well trained model to run on the test images. <xref ref-type="fig" rid="fig5">Figure 5</xref> shows the result of the model on a test image. The general result seems to accurately find the buildings, leading us to conclude that our trained model was a good fit.</p></sec></sec><sec id="s4"><title>4. Discussion and Future Work</title><p>In this project, a U-shaped convolutional network was used to detect building pixels from satellite images with the help of the Python Pytorch package on a laptop computer. A common optimization algorithm (i.e. ADAM) was used for training with a fixed learning rate, and the Intersection-over-Union index was used to measure the accuracy. The result is generally satisfactory, but there is certainly more room to improve, especially as the IoU data tended to still fluctuate largely even after 20 epochs. Currently, it takes very long time to train the model, and most importantly, the training process drained my computer battery rapidly even as it was connected to the power source the entire time. For future work, I would like to look into the use of a graphics processing unit (GPU) to accelerate the deep learning calculation. I will also look to explore the impact of different learning rates on model prediction accuracy.</p></sec><sec id="s5"><title>5. Data and Code Availability</title><p>The code related to this report is publicly available in GitHub at https://github.com/Ericw553/sat_detect. All the training, validation, and testing data is also available per request by email.</p></sec><sec id="s6"><title>Acknowledgements</title><p>Sincere thanks to the members of JAMP for their professional performance, and special thanks to managing editor Hellen XU for a rare attitude of high quality.</p></sec><sec id="s7"><title>Conflicts of Interest</title><p>The authors declare no conflicts of interest regarding the publication of this paper.</p></sec><sec id="s8"><title>Cite this paper</title><p>Wang, E. and Wang, D. (2022) Using U-Net to Detect Buildings in Satellite Images. Journal of Computer and Communications, 10, 132-138. https://doi.org/10.4236/jcc.2022.106011</p></sec></body><back><ref-list><title>References</title><ref id="scirp.118235-ref1"><label>1</label><mixed-citation publication-type="other" xlink:type="simple">Weidner, U. and F&amp;ouml;rstner, W. (1995) Towards Automatic Building Extraction from High-Resolution Digital Elevation Models. ISPRS Journal of Photogrammetry and Remote Sensing, 50, 38-49. https://doi.org/10.1016/0924-2716(95)98236-S</mixed-citation></ref><ref id="scirp.118235-ref2"><label>2</label><mixed-citation publication-type="other" xlink:type="simple">Liu, W. and Prinet, V. (2005) Building Detection from High-Resolution Satellite Image Using Probability Model. Proceedings of 2005 IEEE International Geoscience and Remote Sensing Symposium, 6, 3888-3891.</mixed-citation></ref><ref id="scirp.118235-ref3"><label>3</label><mixed-citation publication-type="other" xlink:type="simple">Cui, S., Yan, Q., Liu, Z. and Li, M. (2008) Building Detection and Recognition from High Resolution Remotely Sensed Imagery. Proceedings of the XXIst ISPRS Congress, 37, 411-416.</mixed-citation></ref><ref id="scirp.118235-ref4"><label>4</label><mixed-citation publication-type="journal" xlink:type="simple"><name name-style="western"><surname>Theng</surname><given-names> L.B. </given-names></name>,<etal>et al</etal>. (<year>2006</year>)<article-title>Automatic Building Extraction from Satellite Imagery</article-title><source> Engineering Letters</source><volume> 13</volume>,<fpage> 255</fpage>-<lpage>259</lpage>.<pub-id pub-id-type="doi"></pub-id></mixed-citation></ref><ref id="scirp.118235-ref5"><label>5</label><mixed-citation publication-type="other" xlink:type="simple">Xu, Y., Wu, L., Xie, Z. and Chen, Z. (2018) Building Extraction in Very High Resolution Remote Sensing Imagery Using Deep Learning and Guided Letters. Remote Sensing, 10, 144. https://doi.org/10.3390/rs10010144</mixed-citation></ref><ref id="scirp.118235-ref6"><label>6</label><mixed-citation publication-type="other" xlink:type="simple">Chen, F., Ren, R., Van de Voorde, T., Xu, W., Zhou, G. and Zhou, Y. (2018) Fast Automatic Airport Detection in Remote Sensing Images Using Convolutional Neural Networks. Remote Sensing, 10, 443. https://doi.org/10.3390/rs10030443</mixed-citation></ref><ref id="scirp.118235-ref7"><label>7</label><mixed-citation publication-type="other" xlink:type="simple">Bai, T., Pang, Y., Wang, J., Han, K., Luo, J., Wang, H., Lin, J., Wu, J. and Zhang, H. (2020) An Optimized Faster r-cnn Method Based on Drnet and Roi Align for Building Detection in Remote Sensing Images. Remote Sensing, 12, 762. https://doi.org/10.3390/rs12050762</mixed-citation></ref><ref id="scirp.118235-ref8"><label>8</label><mixed-citation publication-type="other" xlink:type="simple">Ritter, N., Ruth, M., Grissom, B.B., Galang, G., Haller, J., Stephenson, G., Covington, S., Nagy, T., Moyers, J., Stickley, J., et al. (2000) Geotiformat Specication Geotirevision 1.0. SPOT Image Corp, 1, 154-172.</mixed-citation></ref><ref id="scirp.118235-ref9"><label>9</label><mixed-citation publication-type="other" xlink:type="simple">Folk, M., Heber, G., Koziol, Q., Pourmal, E. and Robinson, D. (2011) An Overview of the HDF5 Technology Suite and Its Applications. Proceedings of the EDBT/IC-DT 2011 Workshop on Array Databases, pp. 36-47.</mixed-citation></ref><ref id="scirp.118235-ref10"><label>10</label><mixed-citation publication-type="book" xlink:type="simple">Warmerdam, F. (2008) The Geospatial Data Abstraction Library. In: Hall, G.B. and Leahy, M.G., Eds., Open Source Approaches in Spatial Data Handling, Springer, 87-104. https://doi.org/10.1007/978-3-540-74831-1_5</mixed-citation></ref><ref id="scirp.118235-ref11"><label>11</label><mixed-citation publication-type="other" xlink:type="simple">Ronneberger, O., Fischer, P. and Brox, T. (2015) U-Net: Convolutional Networks for Biomedical Image Segmentation. International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, Cham, 234-241.</mixed-citation></ref><ref id="scirp.118235-ref12"><label>12</label><mixed-citation publication-type="other" xlink:type="simple">Charles, P. (2018) Unet: Semantic Segmentation with Pytorch. https://github.com/milesial/Pytorch-UNet</mixed-citation></ref><ref id="scirp.118235-ref13"><label>13</label><mixed-citation publication-type="other" xlink:type="simple">Rubinstein, R.Y. and Kroese, D.P. (2004) The Cross-Entropy Method: A United Approach to Combinatorial Optimization, Monte-Carlo Simulation, and Machine Learning. Springer, Berlin.</mixed-citation></ref></ref-list></back></article>