process « hadoop « Java Database Q&A

1. Processing files with headers in Hadoop stackoverflow.com

I want to process a lot of files in Hadoop -- each file has some header information, followed by a lot of records, each stored in a fixed number of bytes. ...

2. How to ensure MapReduce tasks are independent from each other? stackoverflow.com

I'm curious, but how does MapReduce, Hadoop, etc., break a chunk of data into independently operated tasks? I'm having a hard time imagining how that can be, considering it is ...

3. Any Open Source Pregel like framework for distributed processing of large Graphs? stackoverflow.com

Google has described a novel framework for distributed processing on Massive Graphs. http://portal.acm.org/citation.cfm?id=1582716.1582723 I wanted to know if similar to Hadoop (Map-Reduce) are there any open source implementations of this framework? ...

4. Hadoop: Processing large serialized objects stackoverflow.com

I am working on development of an application to process (and merge) several large java serialized objects (size of order GBs) using Hadoop framework. Hadoop stores distributes blocks of a file ...

5. image processing with hadoop stackoverflow.com

How to read video frames in hadoop?

6. How to process lines in a file in specific hadoop slave? stackoverflow.com

We have a custom input format extending the FileInputFormat, which generates a separate split for each line in the input file. This file provides a host name in which the mapper ...

7. What is the process to compile Nutch into one Jar file (and run it)? stackoverflow.com

I'm trying to run the Nutch crawler in a way that I can access all its functionality through one JAR file that contains all its dependencies. For instance,

java -jar nutch-all-1.2.jar -crawl <other ...

8. Hadoop map-reduce v/s cascading, which is better when compare on basis processing time? stackoverflow.com

I have used cascading as well M/R, cascading job looks slow as compare to M/R. It looks me 25% to 50% slow. Is it true or i need to dig more ...

9. Hadoop for processing very large binary files stackoverflow.com

I have a system I wish to distribute where I have a number of very large non-splittable binary files I wish to process in a distributed fashion. These are of the ...

10. Custom inputformat to process protobufs in hadoop 0.20 stackoverflow.com

I'd like to process protobufs using hadoop....but am unsure where to start. I don't care about splitting large files. The protobufs are stored as binary data...what class should I extend to make it ...

11. Processing xml files with Hadoop stackoverflow.com

I'm new to Hadoop. I know very little about it. My case is as follows: I have a set of xml files (700GB+) with the same schema.

<article>
 <title>some title</title>
 <abstract>some abstract</abstract>
 <year>2000</year>
 <id>E123456</id>
 ...

12. HDFS says file is still open, but process writing to it was killed stackoverflow.com

I'm new to hadoop and I've spent the past couple hours trying to google this issue, but I couldn't find anything that helped. My problem is HDFS says the file is ...

13. Processing paraphragraphs in text files as single records with Hadoop stackoverflow.com

Simplifying my problem a bit, I have a set of text files with "records" that are delimited by double newline characters. Like

'multiline text' 'empty line' ...

14. Processing large set of small files with Hadoop stackoverflow.com

I am using Hadoop example program WordCount to process large set of small files/web pages (cca. 2-3 kB). Since this is far away from optimal file size for hadoop files, the ...

15. XML Processing in hadoop stackoverflow.com

I have nearly 200+ xml files in the hdfs. I use the XmlInputFormat (of mahout) to stream the elements. The mapper is able to get the xml contents and process it. ...

16. Possible to use Map Reduce and Hadoop to parallel process batch jobs? stackoverflow.com

Our organization has hundreds of batch jobs that run overnight. Many of these jobs require 2, 3, 4 hours to complete; some even require up to 7 hours. Currently, these jobs ...

17. Task process exit with nonzero status of 126 stackoverflow.com

java.io.IOException: Task process exit with nonzero status of 126. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418) HTTP ERROR: 410 Failed to retrieve stdout log for task: attempt_201107051558_0371_m_000000_0 RequestURI=/tasklog Powered by Jetty:// I am getting the above error when ...

18. Hadoop streaming job failure: Task process exit with nonzero status of 137 stackoverflow.com

I've been banging my head on this one for a few days, and hope that somebody out there will have some insight. I've written a streaming map reduce job in perl that ...

19. How do I give the tasktracker/mapred user within hadoop permissions to modify files and execute processes? stackoverflow.com

I'm running hadoop and within a mapper process I'm executing some process and creating/editing files. Unfortunately I'm getting some permissions errors for mapred such as:

org.apache.hadoop.util.Shell$ExitCodeException: cp: cannot create directory

Anyone know ...

20. Running certain Hadoop Jobs only on a chosen node and not in the others, managing the process with Oozie stackoverflow.com

Is that even possible? I've searched quite a bit and I'd say it's not possible, but I think it's so strange a so basilar functionality has not been foreseen. If i have ...

21. hive reads from a table partition fail when the partition is overwritten by a different process stackoverflow.com

Environment : 
hive : 0.6.x
hadoop : 0.20.x

I have a table "clicks" which is partitioned by "dt". The schema is

create table clicks (
 ... columns
)
PARTITIONED BY (dt string) 
ROW FORMAT DELIMITED ...

22. Hadoop, Mahout real-time processing alternative stackoverflow.com

I intended to use hadoop as "computation cluster" in my project. However then I read that Hadoop is not inteded for real-time systems because of overhead connected with start of a ...

23. Using search logs for processing on Hadoop stackoverflow.com

I am going to do a project on massive text processing with Hadoop. Like one thing I am thinking of is to use search logs data sets to identify search patterns ...

24. Different ways of configuring the memory to the TaskTracker child process (Mapper and Reduce Tasks) stackoverflow.com

What is the difference between setting the mapred.job.map.memory.mb and mapred.child.java.opts using -Xmx to control the maximum memory used by a Mapper and Reduce task? Which one takes ...

25. using pig/hive for data processing instead of direct java map reduce code? stackoverflow.com

(Even more basic than Difference between Pig and Hive? Why have both?) I have a data processing pipeline written in several java map-reduce tasks over hadoop (my own custom code, derived ...