Tuesday, October 23, 2007

Hi there. This blog is about old news stories, mainly public domain stuff.

**PLEASE, PLEASE, PLEASE DO NOT CONSIDER THIS A SPLOG OR BLOG WORTH DELETION! PLEASE READ BELOW FOR MORE INFO ABOUT WHY I'M CREATING THIS BLOG, AND IT MAY MAKE MORE SENSE TO YOU AS TO WHY IT'S HERE!***

I am starting this blog as an attempt to post old news stories, mostly public domain ones that are in ancient newspapers and things like that. The primary reason I'm doing this is because this blog's search capabilities are, in my opinion, better than the search capabilities of the online homes of many of these old stories, and also because I plan to create mirrors of some of these stories on another website, and plan to use this blog to post links to the urls on that site.

I'll post links to the main files online where I'm finding the articles in each posting and a possibly a link to a mirror of each as well. I'm posting mirrors of this stuff because there's no telling how long the original links will stay up since some of them are to government websites that are considered beta at the moment, so the urls may move around at some point in the future...

A few weeks back, I came across the National Digital Newspaper Program and it's Chronicling America: Historic American Newspapers Beta Project.

This is a very awesome project. They are trying to create digital images in pdf and text format of public domain newspapers. This sort of stuff is very useful for many people like me that like to read about history and also for people doing college level studies on the past and news from the past.

One thing that I noticed when browsing around in the Chronicling America: Historic American Newspapers Beta Project is that while each page of the papers that were scanned are accessible as pdf and text files, it appears that the text files are really just straight out OCR translations of the scanned pdf files. What this means is that a lot of the stories when read as text files don't make a lot of sense or are garbled gibberish when the OCR didn't accurately translate the stories, or more often then not did accurately translate what it could, but read the page from left to right instead of in columns of texts like the newspapers were meant to be read. As a result, you can do searches in the Chonicling America Search Pages but all that you can search for is individual words since phrase searches may or may not work due to all the OCR mistranslations and moving around of texts.

What I'm going to try to do with this blog is create a copy of each news story in those Chronicling America Project, but actually correct the OCR's Text Translations, and sort of edit it to make it readable in English with the full phrases... then I'll post a link to the original pdf source where I got the story, and also the mirror of the pdf if and when I get the mirrors uploaded to my own website (I'll be using quatoless, at least at first, as a free host for the mirrors since they basically offer near unlimited storage and bandwidth for free). This will be a time consuming chore, but is a worthwhile effort, if for no other reason because of the fact that it'll allow people working on scholarly research papers and things trying to dig up more info from the Chonicling America to search this blog and get the info they want to find without having to deal with the sometimes archaic search results you'll get on the Project's main site. Eventually, I may expand this blog to other, mostly public domain, news sources as well, but for now, this is a huge project, so I'm going to try to stay focused on it. I'll mostly only be able to work on this on weekends, so if you want to do similar with your blog, please post a reply here and link your blog here. This is a worthy, educational project to undertake. It's not splog or anything, even though it may have some similarities in some ways. Hopefully you can see why it's a worthwhile effort.... and won't report it as a blog that needs to be deleted since it doesn't need to be. Thanks.

No comments: