Distinguishing between natural images (NIs) and computer-generated (CG) images by naked human eyes is difficult. In this work, we propose an effective method based on convolutional neural network (CNN) for this fundamental image forensic problem. Having observed the rather limited performance of training existing CNN from scratch or fine-tuning pre-trained network, we design and implement a new and appropriate network with two cascaded convolutional layers at the bottom of CNN. Our network can be easily adjusted to accommodate different sizes of input image patches while maintaining a fixed depth, a stable structure of CNN and a good forensic performance. Considering the complexity of training CNNs and the specific requirement of image forensics, we introduce the so-called \emph{local-to-global} strategy in our proposed network. Our CNN derives a forensic decision on local patches, and a global decision on a full-sized image can be easily obtained via simple majority voting. This strategy can also be used to improve the performance of existing methods that are based on hand-crafted features. Experimental results show that our method outperforms existing methods, especially in a challenging forensic scenario with NIs and CG images of heterogeneous origins. Our method also has good robustness against typical post-processing operations, such as resizing and JPEG compression. Unlike previous attempts to use CNNs for image forensics, we try to understand what our CNN has learned about the differences between NIs and CG images with the aid of adequate and advanced visualization tools.